Friday, April 1, 2016

Beginners Guide to Hbase with Python & Thrift

I recently wanted to play around with Hbase and Python, which subsequently lead me to use Thrift.

I know there are tons of guides on the web, but a number I found were outdated or based on downloaded versions of things instead of packaged distro versions.   I eventually had to piece together information from several sources.  So I thought I'd put it altogether on this page for anyone looking for simple "cut and paste" instructions to begin with.  I'm not going to go through the basics of Hbase and Thrift, as there are many guides out there, but I'll give updated instructions based on the following versions I used.

Hortonworks 2.2.6.0-2800
Hbase 0.98.4
Redhat 6.7

As an aside, there may be newer interfaces (such as happybase) that are now more popular.  I may look into those later, but these are just my notes on this particular subject.

As another aside, my Hbase has already been populated with data, so there's no need to create/insert data, so I'm skipping that.

The two primary Hbase + Python + Thrift sources I used for this were:

Using Facebook’s Thrift with Python and HBase (posted July 2008)

and

How-to: Use the HBase Thrift Interface, Part 1 and Part 2 and Part 3 (posted September 2013)

So lets start.

First up, I downloaded thrift 0.9.3 and did the normal configure and make, but this didn't compile for me.

src/thrift/qt/moc_TQTcpServer.cpp:14:2: error: #error "This file was generated using the moc from 4.8.1. It"
src/thrift/qt/moc_TQTcpServer.cpp:15:2: error: #error "cannot be used with the include files from this version of Qt."
src/thrift/qt/moc_TQTcpServer.cpp:16:2: error: #error "(The moc has changed too much.)"

this was also the case with thrift 0.9.2, 0.9.1, and 0.9.0.

I went to thrift 0.8.0 and hit other build errors.  These were maybe solvable, but being lazy I just downloaded 0.7.0 to try it, and it compiled fine.  So I ended up using thrift 0.7.0.

Since this compiled, I need to install it somewhere.  I'm going to install into a non-privileged directory, so set all of these prefixes appropriately when you configure and make install.  If you're root and you can install anywhere, you can probably ignore all of this.  Adjust appropriately if you use bash instead of tcsh.

setenv PY_PREFIX /yourprefix/thriftpy/
setenv JAVA_PREFIX /yourprefix/thriftpy/
setenv RUBY_PREFIX /yourprefix/thriftpy/
setenv PHP_PREFIX /yourprefix/thriftpy/
setenv PHP_CONFIG_PREFIX /yourprefix/thriftpy/
setenv PERL_PREFIX /yourprefix/thriftpy/
./configure --prefix=/yourprefix/thriftpy --exec-prefix=/yourprefix/thriftpy
make install


After that, you should hopefully have thrift installed into your appropriate path and /yourprefix/thriftpy/bin/thrift should be available to run.

Now run thrift to generate Python files for Hbase

> /yourprefix/thriftpy/bin/thrift --gen py /usr/hdp/current/hbase-client/include/thrift/hbase1.thrift

Now if you're wondering why hbase1.thrift instead of hbase2.thrift, it's because there is now a newer thrift interface available compared to the original.  I use hbase1.thrift just to begin with.


Now there should be a "gen-py" sub-directory where you ran this.

Now we need to start the thrift server.  With HDP you can do:

/usr/bin/hbase thrift start

I started this in another window so I can control+C is later on.

Ok, onto code.  I started with code sort of from both the sites above and put together:

#!/usr/bin/env python                                                                                                                                 
import sys

sys.path.append('../gen-py/')
sys.path.append('../thriftpy/lib64/python2.6/site-packages/')

from thrift import Thrift
from thrift.transport import TSocket
from thrift.transport import TTransport
from thrift.protocol import TBinaryProtocol

from hbase import Hbase

# Make socket                                                                                                                                         
transport = TSocket.TSocket('localhost', 9090)

# Buffering is critical. Raw sockets are very slow                                                                                                    
transport = TTransport.TBufferedTransport(transport)

# Wrap in a protocol                                                                                                                                  
protocol = TBinaryProtocol.TBinaryProtocol(transport)

client = Hbase.Client(protocol)

transport.open()

tablenames = client.getTableNames()

print "table names are: " + ",".join(tablenames) 

You'll notice a few sys.path.append calls to add paths to my local dirs with libraries in them.  You may need to adjust accordingly for your installed things.

Hopefully, you should just be able to run this and there will be no problems (unless you've never created tables in Hbase, which you should create some).

Ok, now to read some data.  Again, pasting together some code from the above resources, this will get a single row of data.

rows = client.getRow('mytable', 'myrowkey')
 
for row in rows:
    message = row.columns.get('mycolumnfamily:mycolumnname').value
    print "message = " + message
    rowKey = row.row
    print "rowkey = " + rowKey


This didn't work for me and I got an error of

TypeError: getRow() takes exactly 4 arguments (3 given)

Hmmm.  Now, if you look in the file hbase1.thrift from before, you can see what the interface for getRow looks like.

  /**
   * Get all the data for the specified table and row at the latest
   * timestamp. Returns an empty list if the row does not exist.
   *
   * @return TRowResult containing the row and map of columns to TCells
   */
  list getRow(
    /** name of table */
    1:Text tableName,

    /** row key */
    2:Text row,

    /** Get attributes */
    3:map attributes
  ) throws (1:IOError io)


Hmmm, it seems there is a new attributes argument.  I couldn't figure out what this did by searching online.  I didn't want to dig into the code too much at this point so I just passed in None to the getRow call like so:

rows = client.getRow('mytable', 'myrowkey', None)
 
for row in rows:
    message = row.columns.get('mycolumnfamily:mycolumnname').value
    print "message = " + message
    rowKey = row.row
    print "rowkey = " + rowKey


And lucky for me it worked.

Now getting one row of data is boring, we actually want to do scans.  So I started with the following.

scan = Hbase.TScan(startRow="someStartPrefix", stopRow="someStopPrefix")
scannerId = client.scannerOpenWithScan(desiredtable, scan)

rowList = client.scannerGetList(scannerId, 5)

while rowList:
    for row in rowList:
        mydata = row.columns.get("mycolumnfamily:mycolumnname").value
        rowKey = row.row
        print "rowKey = " + rowKey + ", mydata = " + mydata
    rowList = client.scannerGetList(scannerId, 5)


Again, I hit

TypeError: scannerOpenWithScan() takes exactly 4 arguments (3 given)

Just like getRow, there is a similar attributes argument I don't know what to do with. So I add a None argument to scannerOpenWithScan like so.

scannerId = client.scannerOpenWithScan(desiredtable, scan, None)

And this works and I get results.

Now, getting data with Hbase with start & stop rows is boring.  It's far more interesting to do filters.  How can we pass filters in Python?  Again, looking at the hbase1.thrift again, I can see what arguments TScan can take.

/**
 * A Scan object is used to specify scanner parameters when opening a scanner.
 */
struct TScan {
  1:optional Text startRow,
  2:optional Text stopRow,
  3:optional i64 timestamp,
  4:optional list columns,
  5:optional i32 caching,
  6:optional Text filterString,
  7:optional i32 batchSize,
  8:optional bool sortColumns
}


Hmmm, this filterString argument looks interesting.  But what to fill it with?  After I some Googling, I figure out it can be filled with functions you can find in the Hbase thrift documentation.

So here's some examples.

scan = Hbase.TScan(filterString="RowFilter(>=, 'binary:FOO')")

this is functionally identical to

scan = Hbase.TScan(startRow="FOO")

You can AND/OR things together.  So for example:

scan = Hbase.TScan(filterString="(RowFilter(>=, 'binary:STARTPREFIX') AND RowFilter(<=, 'binary:ENDPREFIX')) AND (RowFilter(=, 'substring:FOO') OR RowFilter(=, 'substring:BAR'))

Would find rows with the substring FOO or BAR within a range of STARTPREFIX and ENDPREFIX.

Well, that's as far as I've gotten.  There were a few gotcha points, so I hope that this helps somebody out there.

1 comment:

  1. Amazing post mahn, helped me a lot. Was blinded by the abstraction of happybase, especially the unequivocal power of hbase filters by the thrift api is laudable. Would really appreciate it if you could post a follow up blog of your further advancements. Thanks a ton again.

    ReplyDelete