There are many fish in the sea ... but there's only one bacon: April 2016

Saturday, April 30, 2016

Horrible Movie Review: The Switch

Yesterday I saw one of the stupidest movies I've ever seen. Not necessarily a "bad" movie, just one of the stupidest ones. This movie is "The Switch", a 2010 rom-com starring Jason Bateman and Jennifer Aniston.

WARNING SPOILERS AHEAD

In the movie, Wally (Bateman) is best friends with Kassie (Aniston). Kassie has decided she'd like to have a baby and wants to go the sperm bank route. Wally suggests he could be the father, because he's had a crush on Kassie for a long time. Kassie objects, but they remain friends.

Through a comedy of errors Wally hijacks the sperm bank donation and substitutes his own sperm. So ... Kassie's baby is actually his own.

Kassie begins dating the (presumed) biological father Roland at some point in the future. Despite Roland being in the picture, Wally becomes an even better father figure to Kassie's son (his biological son) than the presumed biological father.

At the climax of the movie, Wally reveals he is the biological father to both Kassie and Roland.

Now at this point in the movie, what do you think should happen?

A) Kassie enters a murderous rage and kills Wally
B) A more mild form of violence occurs, such as a beating, and possibly through proxy of another individual (such as Roland)
C) Kassie gets a lawyer and sues Wally for everything he owns
D) Kassie is pissed as hell, moves away, and they never see each other again
E) Kassie realizes Wally is her true love and they live happily ever after

Living in a civilized society and not believing Aniston wants to move into a new genre of film, I suppose 'C' or 'D' would be the most likely thing to happen. However, I think 'A' or 'B' would have been a reasonably expected response given the circumstances of how horrible this was.

But since this is a Hollywood rom-com, the answer is in fact 'E'. About the stupidest thing I think could possibly happen.

Thursday, April 28, 2016

HPC vs Big Data - Part 3

In several prior posts comparing Big Data to HPC, I asked a question about how HPC and Big Data were different and why it was difficult for HPC people to "get" Big Data and vice versa.

I had another idea of how to present it.

Earlier I stated that:

HPC: Computation Time >> IO Time

Big Data: IO Time >> Computation Time

However, this by itself can sometimes be confusing. Some algorithms are programmed in both the HPC world (e.g. MPI) and the Big Data world (e.g. Hadoop/Spark). So how can this dichotomy be represented?

The reason the above definition can be confusing is because it suggests things are black and white, which is of course not true. In reality, the trade off between I/O time and Computation time is along a sliding scale. To represent this, lets look at this simple (terribly drawn) diagram.

In it, we have increasing computation on the X axis going right and increasing I/O on the Y axis going up. Based on my prior definition, we can consider HPC jobs to be those along the bottom right, where computation is much greater than the I/O. Big Data is the jobs in the upper left of this diagram, where I/O time is much greater than computation time.

Naturally, if your users and/or code is significantly in one area vs the other, you'll have different prioritization on hardware, software, etc. than those in the other area. Thus leading to the very different universes of cluster computing.

What (hopefully) clarifies some of the confusion on HPC vs Big Data is the fact that at some point in the middle these two worlds sort of meet. There are Data + Computation combinations where it's nebulous which direction is superior. A user might in fact be indifferent towards which type of hardware/software situation is better for them. For users in the middle, it is far more likely they will simply gravitate towards the system they are more familiar with (e.g. if you know Java, use Hadoop/Spark, if you know C, use MPI) or simply have access to (if you are familiar with AWS, just use it, if you have access to an MPI cluster, just use it).

I hope this clarifies the confusion between implementation that may be done in both worlds.

Saturday, April 23, 2016

Good vs. Bad Movie Fight Scenes

I recently came upon the following video:

It talks about "rhythm" and sequencing in fight scenes. There's a great quote around 4:25 that speaks of "the audience doesn't know the rhythm's there until it's not there."

I started to pay attention to this in some martial arts films and you can't help but notice it more once you are looking for it. Instead of "rhythm" what I like to think of it is "sequences of action" in a single film take. When a performer isn't good or the director is trying to save money on takes, very rarely will "sequences of action" be done in a single take. In other words, will only a single punch or kick happen in a video take before they cut away to a different angle? Or will multiple kicks/punches occur within a single take in a sequence of a choreographed fight?

Lets start by looking at a bad fight scene. This scene in The Medallion, a pretty terrible Jackie Chan film, with Claire Forlani in a pretty awful fight.

Very rarely is more than a single punch or kick ever done in a take. One punch or kick is done, then they director cuts to a different angle. Only at 41 seconds into the video do they even bother to sequence about 3 kicks together in a take and at about 46 seconds there are two kicks in the take.

In contrast, lets take a look at a fight that is perhaps a gold standard of excellence, the Michelle Yeoh vs Zhang Ziyi fight in Crouching Tiger Hidden Dragon.

This fight is wonderful. Throughout the fight you see sequences of multiple thrusts & blocks of a sword in a single take. The overhead sequences at 1:02 in the video is particularly wonderful. I counted about 15 actions (attacks/parrys/blocks) sequenced together in a single take as the actresses move across the floor.

It's interesting to look at Jackie Chan fights that were directed in Hong Kong vs. America. In the first fight scene in Rush Hour, we get this very meh fight sequence.

Again, you can see that most of the takes contain a single punch or action. Only at about 1:05 do they bother to sequence about 2 attacks in a single shot and a few multiple actions in a shot around 1:20.

In contrast, I think of this incredible fight from First Strike

There are many sequences of 3-4 actions in a single take which give the fight a much better rhythm. The chair sequence at 1:06 is particularly wonderful. Does Jackie Chan really need to jump over a chair, duck a chair thrown, and catch one in a single take? No. But it adds something special to the scene.

Likewise with the ladder sequence. You could probably forgive Jackie Chan for only doing a single attack or block with a ladder given its hefty weight. However, at 3:38 he actually launches an attack with the ladder opening up and bothers to block two further attacks in a single take. That's the kind of thing that makes these fights far more special.

While looking at martial arts fights on YouTube, I thought I'd bring up one particularly awesome scene. It's the elevator fight scene from Ip Man 3.

Once they exit the elevator, there are sequences of 5+ punches/blocks/kicks in a single take. The overhead shot at 1:54 is particularly amazing. I count 15 actions that take place as the actors move down the stairs, around the hallway, and end with one actor getting kicked down the next batch of stairs. All in a single take.

Finally, just to show that it doesn't take trained martial artists in Hong Kong to do good fights, lets take a look at this Neo vs. Agent Smith fight in the Matrix.

To my knowledge, neither actor is actually trained in martial arts. But with some good choreography and good editing and willingness from the director, you can sequence many punches/blocks into a single take and get a much better fight scene out of it.

Friday, April 1, 2016

Beginners Guide to Hbase with Python & Thrift

I recently wanted to play around with Hbase and Python, which subsequently lead me to use Thrift.

I know there are tons of guides on the web, but a number I found were outdated or based on downloaded versions of things instead of packaged distro versions. I eventually had to piece together information from several sources. So I thought I'd put it altogether on this page for anyone looking for simple "cut and paste" instructions to begin with. I'm not going to go through the basics of Hbase and Thrift, as there are many guides out there, but I'll give updated instructions based on the following versions I used.

Hortonworks 2.2.6.0-2800
Hbase 0.98.4
Redhat 6.7

As an aside, there may be newer interfaces (such as happybase) that are now more popular. I may look into those later, but these are just my notes on this particular subject.

As another aside, my Hbase has already been populated with data, so there's no need to create/insert data, so I'm skipping that.

The two primary Hbase + Python + Thrift sources I used for this were:

Using Facebook’s Thrift with Python and HBase (posted July 2008)

and

How-to: Use the HBase Thrift Interface, Part 1 and Part 2 and Part 3 (posted September 2013)

So lets start.

First up, I downloaded thrift 0.9.3 and did the normal configure and make, but this didn't compile for me.

src/thrift/qt/moc_TQTcpServer.cpp:14:2: error: #error "This file was generated using the moc from 4.8.1. It"
src/thrift/qt/moc_TQTcpServer.cpp:15:2: error: #error "cannot be used with the include files from this version of Qt."
src/thrift/qt/moc_TQTcpServer.cpp:16:2: error: #error "(The moc has changed too much.)"

this was also the case with thrift 0.9.2, 0.9.1, and 0.9.0.

I went to thrift 0.8.0 and hit other build errors. These were maybe solvable, but being lazy I just downloaded 0.7.0 to try it, and it compiled fine. So I ended up using thrift 0.7.0.

Since this compiled, I need to install it somewhere. I'm going to install into a non-privileged directory, so set all of these prefixes appropriately when you configure and make install. If you're root and you can install anywhere, you can probably ignore all of this. Adjust appropriately if you use bash instead of tcsh.

setenv PY_PREFIX /yourprefix/thriftpy/
setenv JAVA_PREFIX /yourprefix/thriftpy/
setenv RUBY_PREFIX /yourprefix/thriftpy/
setenv PHP_PREFIX /yourprefix/thriftpy/
setenv PHP_CONFIG_PREFIX /yourprefix/thriftpy/
setenv PERL_PREFIX /yourprefix/thriftpy/
./configure --prefix=/yourprefix/thriftpy --exec-prefix=/yourprefix/thriftpy
make install

After that, you should hopefully have thrift installed into your appropriate path and /yourprefix/thriftpy/bin/thrift should be available to run.

Now run thrift to generate Python files for Hbase

> /yourprefix/thriftpy/bin/thrift --gen py /usr/hdp/current/hbase-client/include/thrift/hbase1.thrift

Now if you're wondering why hbase1.thrift instead of hbase2.thrift, it's because there is now a newer thrift interface available compared to the original. I use hbase1.thrift just to begin with.

Now there should be a "gen-py" sub-directory where you ran this.

Now we need to start the thrift server. With HDP you can do:

/usr/bin/hbase thrift start

I started this in another window so I can control+C is later on.

Ok, onto code. I started with code sort of from both the sites above and put together:

#!/usr/bin/env python                                                                                                                                 
import sys

sys.path.append('../gen-py/')
sys.path.append('../thriftpy/lib64/python2.6/site-packages/')

from thrift import Thrift
from thrift.transport import TSocket
from thrift.transport import TTransport
from thrift.protocol import TBinaryProtocol

from hbase import Hbase

# Make socket                                                                                                                                         
transport = TSocket.TSocket('localhost', 9090)

# Buffering is critical. Raw sockets are very slow                                                                                                    
transport = TTransport.TBufferedTransport(transport)

# Wrap in a protocol                                                                                                                                  
protocol = TBinaryProtocol.TBinaryProtocol(transport)

client = Hbase.Client(protocol)

transport.open()

tablenames = client.getTableNames()

print "table names are: " + ",".join(tablenames)

You'll notice a few sys.path.append calls to add paths to my local dirs with libraries in them. You may need to adjust accordingly for your installed things.

Hopefully, you should just be able to run this and there will be no problems (unless you've never created tables in Hbase, which you should create some).

Ok, now to read some data. Again, pasting together some code from the above resources, this will get a single row of data.

rows = client.getRow('mytable', 'myrowkey')
 
for row in rows:
    message = row.columns.get('mycolumnfamily:mycolumnname').value
    print "message = " + message
    rowKey = row.row
    print "rowkey = " + rowKey

This didn't work for me and I got an error of

TypeError: getRow() takes exactly 4 arguments (3 given)

Hmmm. Now, if you look in the file hbase1.thrift from before, you can see what the interface for getRow looks like.

  /**
   * Get all the data for the specified table and row at the latest
   * timestamp. Returns an empty list if the row does not exist.
   *
   * @return TRowResult containing the row and map of columns to TCells
   */
  list getRow(
    /** name of table */
    1:Text tableName,

    /** row key */
    2:Text row,

    /** Get attributes */
    3:map attributes
  ) throws (1:IOError io)

Hmmm, it seems there is a new attributes argument. I couldn't figure out what this did by searching online. I didn't want to dig into the code too much at this point so I just passed in None to the getRow call like so:

rows = client.getRow('mytable', 'myrowkey', None)
 
for row in rows:
    message = row.columns.get('mycolumnfamily:mycolumnname').value
    print "message = " + message
    rowKey = row.row
    print "rowkey = " + rowKey

And lucky for me it worked.

Now getting one row of data is boring, we actually want to do scans. So I started with the following.

scan = Hbase.TScan(startRow="someStartPrefix", stopRow="someStopPrefix")
scannerId = client.scannerOpenWithScan(desiredtable, scan)

rowList = client.scannerGetList(scannerId, 5)

while rowList:
    for row in rowList:
        mydata = row.columns.get("mycolumnfamily:mycolumnname").value
        rowKey = row.row
        print "rowKey = " + rowKey + ", mydata = " + mydata
    rowList = client.scannerGetList(scannerId, 5)

Again, I hit

TypeError: scannerOpenWithScan() takes exactly 4 arguments (3 given)

Just like getRow, there is a similar attributes argument I don't know what to do with. So I add a None argument to scannerOpenWithScan like so.

scannerId = client.scannerOpenWithScan(desiredtable, scan, None)

And this works and I get results.

Now, getting data with Hbase with start & stop rows is boring. It's far more interesting to do filters. How can we pass filters in Python? Again, looking at the hbase1.thrift again, I can see what arguments TScan can take.

/**
 * A Scan object is used to specify scanner parameters when opening a scanner.
 */
struct TScan {
  1:optional Text startRow,
  2:optional Text stopRow,
  3:optional i64 timestamp,
  4:optional list columns,
  5:optional i32 caching,
  6:optional Text filterString,
  7:optional i32 batchSize,
  8:optional bool sortColumns
}

Hmmm, this filterString argument looks interesting. But what to fill it with? After I some Googling, I figure out it can be filled with functions you can find in the Hbase thrift documentation.

So here's some examples.

scan = Hbase.TScan(filterString="RowFilter(>=, 'binary:FOO')")

this is functionally identical to

scan = Hbase.TScan(startRow="FOO")

You can AND/OR things together. So for example:

scan = Hbase.TScan(filterString="(RowFilter(>=, 'binary:STARTPREFIX') AND RowFilter(<=, 'binary:ENDPREFIX')) AND (RowFilter(=, 'substring:FOO') OR RowFilter(=, 'substring:BAR'))

Would find rows with the substring FOO or BAR within a range of STARTPREFIX and ENDPREFIX.

Well, that's as far as I've gotten. There were a few gotcha points, so I hope that this helps somebody out there.

There are many fish in the sea ... but there's only one bacon

Pages