Using HBase's Thrift interface with Ruby

In my continued fooling around with various key-value stores I've finally come across HBase. Naturally, since I do my day-to-day programming in Ruby I wanted to setup some basic examples. Though HBase does support a RESTful interface I thought I would get the Thrift interface working for some better throughput. If you need help Thrift running take a look at my post on Cassandra's thrift interface that has all the prerequisites listed. The example assumes a table "t1" and a column "f1".

Using Cassandra's Thrift interface with Ruby

If you've been trying to figure out how to work with Cassandra then you've probably come across Thrift. Thrift is a library written in the spirit of Google's protocol buffers, but developed by Facebook and then open-sourced in 2007. The quick and short of it is that Thrift enables you to create RPC style calls in a platform-independent and XML-free way that is extremely efficient and surprisingly easy to work with once you get all the pieces working. Rich Atkinson already has a great blog post on how to get up and running with Thrift on Snow Leopard. So if that's what you're running, I'm going to suggest you check it out. If you're running Ubuntu you'll need to satisfy the following dependencies:
sudo aptitude -q -y install libexpat1-dev libboost1.37-dev g++ autoconf automake libtool
and the source can be obtained with:
svn co http://svn.apache.org/repos/asf/incubator/thrift/trunk thrift
and then you can proceed with the standard "configure && make && make install". Hopefully at this point you have the Thrift native libraries installed. Since this is about Ruby, you should also install the Thrift gem that will take advantage of the native libraries.
sudo gem install thrift
Armed with both native library and gem, let's go ahead and navigate to your Cassandra install's interface directory (cassandra/interface) and build the ruby code:
thrift --gen rb:new_style cassandra.thrift
This will generate (as of this writing...) three files: gen-rb/cassandra.rb, gen-rb/cassandra_constants.rb, and gen-rb/cassandra_types.rb. At this point you can create a temp.rb file in the gen-rb folder to play around with connections. Here's a short example of how to make a GET request for a specific key: It's worth noting that there *is* a gem available on github from fauna/cassandra that creates a much easier-to-work-with client, but since the interface for Cassandra is still evolving and changing the client is broken at the moment. As far as I know this only applies to Cassandra 0.4.1 DEV and newer. I'm very much looking forward to a working update.

Provisioning script for Ubuntu Intrepid and Ruby 1.9.1

Here's a simple gist I use to provision either Amazon EC2 AMIs or Slicehost images running Ubuntu Intrepid. It'll setup all the requirements to build Ruby 1.9.1 from source since the official Ubuntu package isn't due out until Karmic Koala is released. It has a handful of constants at the top of the file you need to define for everything to work right. Of note are the application name and the machine's FQDN it should answer on. If your using EC2 you might have to tweak some configuration afterward since the FQDN in DNS probably won't match the IP of the machine's interface. Also, if you plan on using authorized_keys and deploying from a git repository it makes things a lot easier if you tar and gzip the relevant files and put them in an S3 bucket to pull from. The script handles this case as well. As always, make sure you understand what a provisioning script does before you accept it with blind faith. At Flatterline we use this as our base template.

Symmetric indices to make JOINs faster

I am frequently asked how to increase the performance of Rails, and here's a great starting point. This advice generalizes to just about any database or platform that relies on B-Tree indices. If your using MySQL out of the box, then this definitely applies. Consider the following three models which are a very basic "has and belongs to many" setup: So as you can see, a user can be in many groups and a group can have many users, all by way of the memberships join relationship. The two example use cases I'm going to work with are:
  1. Given a user, what groups is he/she in?, and
  2. Given a group, who are the members?
Both of these are pretty typical, but can yield surprisingly different results from the database's perspective. If we try and get this data from the console we either start with a user and navigate to group, or the reverse. Here's the MySQL EXPLAIN output from the console ( I recommend viewing the RAW output unless I can figure out how to make github display it correctly): Totally understanding the output of the EXPLAIN syntax is well outside the scope of this post, but we're going to need to cover the basics. The first thing you should notice is the word ALL in the type column and the NULL in possible_keys. This indicates that MySQL's query optimizer has no index to read from and will be forced to perform a table scan to return the result. In general, this will kill your performance. Note the rows value of 100. This value will be whatever the size of your table is. If you have 500,000 records, then the database will check all 500,000 rows. It's worth noting that for small datasets you'll see ALL and a possible_keys value. This means the optimizer believes that scanning the table will be faster than actually loading the index into memory. This is generally fine. So let's go ahead and add a composite index on [user_id,group_id]. The SQL is:
ALTER TABLE memberships ADD INDEX test_index(user_id,group_id)
Now let's repeat the previous queries. This is where I see most people stop when it comes to performance optimization. Note though that these tables aren't the same! If you join from users to groups (the first query) you see a massive speedup. Only one row is consulted (instead of 100) and it's in the index. A further benefit we see in both queries is the "Using index" in the Extra column. This means that MySQL can determine the query result without ever checking the actual table because all required info is in the index (ie no extra disk hits). Unfortunately, joining from groups to users (second query) still (sorta) sucks. It says index instead of ALL, but that just means it will have to scan the entire index rather than scan the entire table on disk. This is a marginal improvement at best, so 50% of our use cases still suck. Here's the explanation: B-Tree indices are unidimensional structures. That means that the interior nodes of the index tree are strongly ordered, and thus cannot be arbitrarily accessed out of order. If that doesn't make any sense, it means that joining from users to groups is not an equivalent operation to joining groups to users because of the ordering of the index elements. So let's cleanup the second use case by adding an additional index, exactly like the first, except the order of the elements is reversed. Here's the SQL:
ALTER TABLE memberships ADD INDEX test_index1(group_id,user_id)
Now we have symmetric indices. Let's run our queries again with our second index in place: Voila! Now it doesn't matter which way we join the tables because we have an index that is correctly ordered based on the directionality of the join. You can even see that the optimizer selects a different index (key) depending on which direction you join the tables, exactly as expected. Also, both queries now only require consulting the exact number of rows necessary and won't involve any further disk hits as both queries can be satisfied with data available entirely within the index.

Converting latitude and longitude to timezones

If you've fought it out with localization(l10n) of timezones then you know it can be a pain in the ass.  Further, suppose your localizing arbitrary information where all you've really been given is an address.  The relevant information isn't necessarily in your system and the user might be in the wrong timezone anyway, so no sense in using that. Here's a quick Ruby means to convert latitude and longitude into a timezone ActiveSupport recognizes.  This snippet only relies on Hpricot and the freely available Geonames API: GMT offsets are a convenient way for moving time data in and out of UTC as well as for not having to deal with arbitrary string names.