Cloud hosting vs colocation

Edit: I'm linking to the discussion on HN that has another 90 comments or so.  More discussion.

I understand that it might sound heretical to claim that traditional colocation is still the more cost-effective approach to hosting, but let's look at the numbers.

 

tl;dr

Cloud computing isn't that "cheap" unless you're talking about tiny costs that early, early stage startups might face. Once you know you will require dedicated infrastructure it's very likely worth it to get your own hardware.  You're employing technical people anyway; have some of them manage your infrastructure.  You would have needed some people doing it with your cloud hosting anyway.  Being able to throw stuff away is nice, but so is having authoritative control over your business operations and in-house accountability.

The actual numbers

Hypothetically, let's look at a scenario where we need 4 web servers, 50 app servers, and 2 reasonably high powered databases. For all intents and purposes, this is a small data center and the kind of configuration that cloud companies like Amazon claim they are designed to handle. It also mimics several of my clients' infrastructures, which are built on Rails, so it feels particularly apropos.

Secondly, let's suppose that each of these servers requires 8 cores to keep up with the workload.  So all together we have 56 servers at 8 cores a piece.  For the database, let's assume those are also 8 core machines but will require 68GB of RAM a piece plus about 10TB of storage.  I think it would make more sense to actually add more RAM, but that's the largest amount you can get from Amazon right now on their Instance Types list.

We're going to talk about the total cost of ownership across a year for both options.  I am also going to make the assumption that the duty cycle of activity is fairly uniform.

Note: If you're going to disagree with one of my assumptions it's this last one.  I am perfectly aware that a uniform duty cycle is unheard of when it comes to web applications because we're saying that our traffic load is uniform across the day. I am, however, making this assumption because for all the companies I talk to that cite the ability to dynamically decommission hardware, none of them do.  There seems to be a massive perceived risk in starting up and shutting down instances on the fly and whether or not the custom experience will suffer.  If you are currently cloud hosted and you decommission hardware daily then congratulations: anecdotally you are extremely rare.

How many cores do we need?

54 * 8 
= 432 cores

While it's totally overkill (and awesome), I put together a linear optimization model using glpk.  If we optimize in server count then the fewest instances in Amazon that will meet this requirement are 54 high CPU extra large instances costing about $0.68 per hour.  Further, let's assume 1 month is 720 hours.  So for one year:

54 HCPUEL * 720hr * 12m * $0.68 
= $317,260.80

That covers our application servers. For our databases we're going to need high-memory instances.  The only one that offers 68GB of RAM is an "High-Memory Quadruple Extra Large Instance" which runs $2.00/hour.

2 HMQEL * 720hr * 12m * 2 
= $34,560

Now, these 2 databases are going to need 10TB of storage apiece, so let's add in 10TB of storage.  Checking the price sheet again: $0.10 per GB-month of provisioned storage.

$0.10 * 1024GB * 10TB * 2 servers * 12months 
= $24,576

 

We are currently up to $376,396.80 and we haven't talked about bandwidth or management.  Since we have a uniform duty cycle (I can hear you screaming) let's suppose that all our web servers receive 1Mbps worth of traffic each.  Not a ton, but certainly not trivial either.  Well, at 1Mbps then every 8 seconds we're transferring a megabyte per machine.  How much is that over a year?

(1Mbps * 54 servers * (60sec * 60min * 24hr * 30days)) / (8 * 1024) 
= 17,085.9375 GB/month

or

17,085GB * $0.15/GB * 12m
= $21,754.69

There's one last variable to account for, and that's EBS IOP activity.  You will be billed for every million IO requests you make to an EBS volume.  With uniform duty cycle let's say we're looking at 300 read IOPS and 150 write IOPS per second.  So 450 IOPs (once again, quite low).

(450IOPs * (60sec*60min*24hr*30days*12months) * 2 DB servers) * $0.10 / 1,000,000 
= $2,799.36

Thus, our total cost for the year:

 

$400,950.85

 

What would it cost to colocate these servers ourselves?  Let's look at some similar numbers.  The boys (and girls) over at Dell offer a reasonably priced 8 core server for about $1901.  Take a look for yourself over at Dell. I'm looking at the Dell R515 w/ 8 cores, 16GB of RAM, dual gigabit NICs, and redundant power.

Initial cost of servers:

54 servers * $1901 
= $102,654

Let's add in 10% of the cost of servers in additional networking hardware just in case our colocation option doesn't offer any beyond an uplink port.

$10,265.40

Since the R515 is a 2U server we're going to need 108Us worth of rack space.  A typical rack has 40Us of storage, but since we want room for power spikes and our network hardware let's say we need 4 racks.  Further, let's say we want to host this hardware in a Tier IV data center.  That's going to run about $1000/month per rack.

12m * $1000 * 4racks 
= $48,000

Depending on the data center they may give you a certain number of hours per month of "remote hands" where you can have techs handling your hardware, or you can colocate in your city and pay employees to interact with your hardware.  If touching hardware is that abhorrent to you, you probably wouldn't have read this far anyway.

Let's talk bandwidth fees.  Data centers don't usually bill like Amazon.  Amazon charges per gigabyte, and most DCs charge at what's called the 95th percentile usage or Burstable Billing.  So if 95% of the time your only using 54Mbps, then you're billed for a 54Mbps link.  Let's choose a higher number for cost and get some premium, guaranteed bandwidth at $150 per Mbps (quite high).

54Mbps * $150 * 12m
= $97,200

Lastly, let's choose some really sweet databases.  We can pump 64GB of RAM in an R515 so let's use that for comparison bringing the cost to $3831 and we'll add 10TB of high-performance direct attached storage per server.  I customized a Dell PowerVault MD1220 with 10TB, and SAS drives with a hardware buffered RAID controller for $12,200 apiece.

$3831 * 2servers + $12,200 * 2 servers 
= $32,062

Our servers, bandwidth, databases, and storage come out to a grand total of:

$290,181.40

That's $110,769.45 less than the cost of the equivalent Amazon service.  So let's close out with a click commentary on how to compare these numbers.

 

  1. Note that these numbers are for the first year of operation.  If you don't change anything in year 2 you will have an additional $400,950 for Amazon because it's a subscription.  If you bought the hardware, it's yours.  You'll just have the cost of colocation and bandwidth: $145,200. Have your accountant look into depreciating a capital asset and that number may come down further with tax advantages.
  2. I tried to achieve a certain level of "spec parity" between the two, but that isn't really fair.  The hardware we're getting from Dell won't be virtualized (unless we add it), shared, and completely abstracted away.  If you've ever had to deal with EBS volumes then you know the performance is questionable at best and you have zero visibility into it.  We have direct attached storage connected and no contention with other customers.  The same applies to our inner-network bandwidth: only our machines are linked on private switches.
  3. The advantage of cloud computing is you just "throw away" instances running poorly or misbehaving.  You will have to manage and periodically fix broken hardware.  Some of the hardware like the PowerEdge storage includes 3 years of 24x7x365 4 hour max response support contract.  It's an opt-in for general servers but isn't ridiculously expensive.  A glance at the AWS support options appears we'd need Platinum support for something comparable.  That runs the greater of $15K a month 10% of monthly usage.  That's $33,412 $15,000 a month.
  4. Never under-estimate the power of taking on responsibility for your entire stack down to the hardware. Yes, you will have more responsibility and you will need to employ people.  You had to employ people with EC2 so you're just having the same people work on your own hardware.  Need more bandwidth? Add switches.  Need intrusion detection? Add it, you have control of the network.  Want to share IPs for high-availability configurations? Add it, the switches are yours.  Want to be PCI-compliant? Not a problem, your cabinets are yours and you can prove physical access restriction easily.
  5. You want to experiment with ideas and write code that can interface with your physical infrastructure? Then look at Amazon Virtual Private Cloud.  Add cloud instances and link them in to your network.  Throw them away when you're done and move them to dedicated hardware if they're doing something you plan on doing for the long-term.
  6. 37Signals manages their own hardware.  "We were able to cut response times to about 1/3 of their previous levels even when handling over 20% more requests per minute." If it's cool enough for them, then it's cool enough for you: http://37signals.com/svn/posts/1819-basecamp-now-with-more-vroom
  7. What if we say we only keep 25 nodes up full-time and make 29 of them half-time to reflect a more likely duty cycle?
    25 HCPUEL * 720hr * 12m * $0.68 + 29 HCPUEL * 360hr * 12m * $0.68 
    $232,070.4
    A savings of $85,190 but that's still $315,760.85.
  8. Yes, you will have to buildout your own data center and that will take some time. Most colocation facilities will be more than happy to help you with this.  Yes, it will also take more time and you can't just click a button.  The result though is most likely worth it.  Look at most companies that maintain a tyrannical control over their products and integration and tell me they aren't generally perceived as market leaders.
  9. What about special Reserved Instances? The pricing options will help, but you won't come out ahead over colocation.  Considering the example above used high-CPU extra larges then the cost of the reserved instance is $1820 + a "reduced" hourly rate.  A mere $100 less than purchasing a comparable server outright.  Plus, at the end of the year you'll have to make the investment again.

 

An argument against hashing credit cards

I'm still editing this, but thought I'd at least open it up to read in the meantime.

Also, special thanks for helping me review this one to:

Dave Hoover @redsquirrel

Blake Smith @blacksmith

Jason Casillas @RAGEBARRAGE

The setting

Periodically I'm asked about whether or not it's considered safe to store hashes of credit cards.  It's possibly for a secondary form of user authentication or potentially fraud detection to see if a card is used in different places by different users. In general, I am strongly against storing hashes of credit cards, because combinatorically it isn't that much safer than storing the number outright against a clever attacker.

The combinatorics

When it comes to security, or more specifically cryptanalysis, your system is only as safe as the narrowest/smallest brute-forceable surface.  Consider an arbitrary and perfect hash algorithm h that generates 160 bits of output. If you had the hashed output c and the function h you would need to find the input p such that h(p) = c and you'd have all the secrets. Well, the size of all possible inputs is infinite but no hash is perfect, so where's the upper-bound? It's 2^160 ( the size of the hash, N bits can represent 2^N unique pieces of information: see Integers in Computer Science and Pigeonhole principal).

Here we consider the difference between an infinite set and 2^160. Thanks to a great writeup by Jeff Bonwick we know that fully populating a 128 bit set would literally require more energy then it would take to boil the oceans of Earth, so 160 bits should be plenty safe, right?

Let's keep going and move on to look at the structure of a credit card number:

0000-0000-0000-0000

Consider a typical credit card number is 16 digits (and only digits, no letters/symbols) long, so that gives us 10^16 total possibilities.  Written out: 10,000,000,000,000,000. 10 quadrillion numbers.  Not necessarily at the point at which we start calling things intractable, but still prohibitively large.  Let's try and remove some entropy from this number.

First of all, there's the Bank card number that accounts for the first 4-6 digits of the credit card number. I'm going to X out the ones that we may know ahead of time.  In the event of a 4 card bank number:

XXXX-0000-0000-0000

and a 6 digit:

XXXX-XX00-0000-0000

In the case of a 4 digit card we now have just 1,000,000,000,000 (10^12, 1 trillion) possibilities and the 6 leaves us with 10,000,000,000 (10^10, 10 billion) numbers.

So given a stack of 160 bit hashed credit card numbers how much computer power would it take to reverse one of them out by brute forcing the hash? 10 quadrillion is certainly smaller than 2^160 so brute forcing the number is going to be the easier target. If we select a specific bank it's even easier.  First, let's look at some code:

The CPU implementation

Assuming that we're going after a 6 digit credit card number, how long would it take us to try all the possibilities?  We have three tasks:

  1. Generate a number
  2. Determine if it's valid
  3. Calculate the hash of the input and find a collision

The code to do that is here https://github.com/cchandler/cc-hash-probe .

Let's see how long it takes to get my 2.8Ghz Intel Core i7 to do the above steps 10,000,000 times:

14sec

So if it takes 15 seconds to do 10 million iterations scanning every possible number for a 6 digit card it would take us 4.16 hours. In the event of a 4 digit card we'd have to do about 100 times the work or about 416 hours. Inconvenient, but not unthinkable.  Especially considering I have more cores I'm not using.

However, we have other faster/better options...

The GPU

My commodity laptop has a nVidia GeForce GT330M card inside it. I've highlighted the relevant stats:

Geforce_specs

That card most people aren't using has 48 cores, is capable of executing 32 hardware threads at a time (the warp size), and is running at a Clock rate of 1.1Ghz by itself.  How long does it take my GPU to calculate the above 3 steps 10M times?

1.9s

Only about 2 seconds to do an equivalent workload.  My GPU code is available here: https://github.com/cchandler/cc-hash-probe/blob/master/gpu.cu

Let's do some math: at 2 seconds / 10M hashes we'll get through a 6 digit BIN in 33 minutes down from 4.16 hours. The 4 digit BIN in 55.56 hours down from 416.

But wait! It gets better.  Amazon semi-recently announced the general availability of their High Performance Computing GPU computer cluster instances.  If you haven't had a chance to play with them yet, you'll discover they come equipped with dual nVidia Tesla C2050 (Fermi) cards.  For perspective, these cards have 448 cores a piece.  For each instance you fire up, at $2.30/hr you get 896 cores @1.15Ghz or roughly ~20 times the computer power of my laptop (I'm rounding up).

Let's suppose these cards do 20 times the workload of my laptop, then in 2 seconds we'll have 200,000,000 hashes instead of 10,000,000.  How long will it take us to go through that 6 digit card now?  200M hashes in 2 seconds will yield all possible outputs in a little over a minute and a half (1.67m). The 4 digit card will be ours in 2.7 hours.

A more clever attack

So now we know how fast we can potentially recover all the possible values using our fancy-pants Amazon GPU instances.  A total brute-force attack might still be a bit prohibitive because we don't want to shell out that much money for the compute instances.  

Thanks to institutional banking, it turns out that a massive amount of money is deposited in very few banks.  Here are the top 3:

  • Bank of America
  • JPMorgan Chase
  • Citigroup

What if we only bothered to go after credit cards at these banks?  A quick check at Wikipedia confirms that a list of known BIN numbers is available.

Here are all currently listed Bank of America BINs:

  • 377311 - MBNA Europe Bank (Bank of America) bmi plus Credit Card (UK)
  • 377311 - MBNA Europe Bank (Bank of America) Virgin Atlantic Credit Card (UK)
  • 41177 - Bank of America (US; formerly FleetBoston Financial|Fleet) VISA Debit Card
  • 414716 - Bank of America (US) - Alaska Airlines Signature Visa Credit Card
  • 417008-11 - Bank of America (USA; Formerly Fleet) - Business Visa Card 
  • 421764 to 66 - Bank of America VISA Debit Card
  • 4256 - Bank of America General Motors|GM Visa Check Card
  • 426428 to 29 - Bank of America (formerly MBNA) Platinum Visa Credit Card
  • 426451 to 52, 65 - Bank of America (formerly MBNA) Platinum Visa Credit Card
  • 430536, 44, 46, 50, 94 - Bank of America (formerly Fleet) Visa Credit Card
  • 431301 to 05, 07, 08- Bank of America (formerly MBNA) Preferred Visa & Visa Signature Credit Cards
  • 432624 to 30 - Bank of America (formerly Fleet National Bank) Visa Check Card, Debit
  • 4342 - Bank of America Classic Visa Credit Card
  • 4356 - Bank of America Visa Debit Card
  • 435680 to 90 - Bank of America, Visa, Platinum Check Card, Debit
  • 449533 - Bank of America (USA), National Association - Classic, Debit, Visa
  • 4635 - Bank of America Business Platinum Debit
  • 4744 - Bank of America Visa Debit
  • 474480 - Bank of America Visa Debit, Midwest USA
  • 4888** - Bank of America (US) - Visa Credit Card
  • 5401 - Bank of America (formerly MBNA) MasterCard Gold Credit Card
  • 549035 - MBNA American Bank [Now part of Bank of America]
  • 549099 - MBNA American Bank [Now part of Bank of America]
  • 587781 - Bank of America ATM Card

More generally: 37 6-digits and 7 4-digits. If we used the high-powered compute instances that means reversing out all the Bank of America cards would take:

2.7 hours * 7 cards + 0.02 hours * 37 cards = 19.64 hours

Just 20 hours of compute time to potentially recover all possible BofA numbers. Considering it's $2.30/hr for these instances that's about $46.00.

The takeaway

Don't hash credit card numbers. Or, if you insist on doing it, store/use them in a way that guarantees if they are reversed the would-be attacker can't connect them back to personal identifiable information that would make them useable elsewhere (eg foreign key on user table and don't store them with created_at/updated_at timestamps that can be linked to other tables/columns).  GPU availability is only getting better and better, and they're cramming more and more cores into them.  Over the next few years you can expect to see the cost of this kind of computing to drop and become easier and easier.  What's "irritating" today at 2.7 hours is going to be trivial in the not-so-distant future.

Code Kata 19 as a MapReduce job for Hadoop

I wanted to do a brief discussion of Dave Thomas's 19th Code Kata. The gist of the problem is that given a wordlist of words can you determine a path from a source word to an end word, changing only one letter at a time, such that every intermediate step is also a valid word in the dictionary. For instance, the supplied example from 'cat' to 'dog' is: cat, cot, cog, dog. The general idea behind these code katas is that they are designed to be well-formed coding exercises to give programmers a chance to stretch and develop new skills. I've written more graph algorithms that run in a single-process space than I'd care to discuss (here's a simple graph implementation I wrote to learn Google's Go for instance Go-lang datastructures ), so I wanted to solve this one using a different paradigm of problem solving. For this one I wanted to write it as a MapReduce job. I'm familiar with CouchDB and it's built-in MapReduce implementation, but I wanted to go with something that worked in a fully-distributed mode as well as serve as practice for Hadoop. I love all my NoSQL options. The code for this discussion, as well as instructions to setup and run, is available at : codekata19-mapreduce. For the sake of discussion, let's decompose the problem into two subproblems: graph construction and graph traversal.

Problem 1 - Graph construction

Given a list of words, we need to construct a graph such that every vertex is a valid word and every edge represents a valid single-letter transform to get to the next vertex. To solve this using MapReduce we make the input set of data the wordlist. The only "global" information that we need to pass around with the actual processing job is the dictionary so each JVM/Mapper in the system knows what constitutes a valid word (this is done using Hadoop's DistributedCache mechanism). The Map function will emit a key for every valid, single-step transform we can determine as valid. So the output is essentially the edge-set of our graph. The reduce function's only job in this case is to make sure we don't have duplicates and to format the data in a way such that we can use this MapReduce job's output as the input to the next phase. The final results of this will look something like:

cat   cog,|-1|WHITE|

with an intentional trailing pipe. The code for this MapReduce job is available here: CodeKata19.java.

Problem 2 - Graph traversal

Graph traversal in this case is a brute-force breadth-first search of the entire graph. Cailin has done an excellent writeup of parallel, distributed breadth-first search at (Breadth-first graph search using iterative map reduce algorithm). The only modification I made to the proposed setup was including a complete path after the last pipe. The algorithm will run until no more GRAY nodes exist in the network, meaning all reachable nodes from a selected start point have been reached. It's worth noting that instead of being the "fastest" implementation, which could easily fit inside a single-process space, this is more of an exercise in distributed algorithms. One of the key advantages is that once this completes running we have the single-source, all-destination pair shortest path result. The final result contains all shortest paths to all feasible solutions. The code for this MapReduce job is available here: CodeKata19Search.java.

 

Using HBase's Thrift interface with Ruby

In my continued fooling around with various key-value stores I've finally come across HBase. Naturally, since I do my day-to-day programming in Ruby I wanted to setup some basic examples. Though HBase does support a RESTful interface I thought I would get the Thrift interface working for some better throughput. If you need help Thrift running take a look at my post on Cassandra's thrift interface that has all the prerequisites listed. The example assumes a table "t1" and a column "f1".

Using Cassandra's Thrift interface with Ruby

If you've been trying to figure out how to work with Cassandra then you've probably come across Thrift. Thrift is a library written in the spirit of Google's protocol buffers, but developed by Facebook and then open-sourced in 2007. The quick and short of it is that Thrift enables you to create RPC style calls in a platform-independent and XML-free way that is extremely efficient and surprisingly easy to work with once you get all the pieces working. Rich Atkinson already has a great blog post on how to get up and running with Thrift on Snow Leopard. So if that's what you're running, I'm going to suggest you check it out. If you're running Ubuntu you'll need to satisfy the following dependencies:
sudo aptitude -q -y install libexpat1-dev libboost1.37-dev g++ autoconf automake libtool
and the source can be obtained with:
svn co http://svn.apache.org/repos/asf/incubator/thrift/trunk thrift
and then you can proceed with the standard "configure && make && make install". Hopefully at this point you have the Thrift native libraries installed. Since this is about Ruby, you should also install the Thrift gem that will take advantage of the native libraries.
sudo gem install thrift
Armed with both native library and gem, let's go ahead and navigate to your Cassandra install's interface directory (cassandra/interface) and build the ruby code:
thrift --gen rb:new_style cassandra.thrift
This will generate (as of this writing...) three files: gen-rb/cassandra.rb, gen-rb/cassandra_constants.rb, and gen-rb/cassandra_types.rb. At this point you can create a temp.rb file in the gen-rb folder to play around with connections. Here's a short example of how to make a GET request for a specific key: It's worth noting that there *is* a gem available on github from fauna/cassandra that creates a much easier-to-work-with client, but since the interface for Cassandra is still evolving and changing the client is broken at the moment. As far as I know this only applies to Cassandra 0.4.1 DEV and newer. I'm very much looking forward to a working update.

Provisioning script for Ubuntu Intrepid and Ruby 1.9.1

Here's a simple gist I use to provision either Amazon EC2 AMIs or Slicehost images running Ubuntu Intrepid. It'll setup all the requirements to build Ruby 1.9.1 from source since the official Ubuntu package isn't due out until Karmic Koala is released. It has a handful of constants at the top of the file you need to define for everything to work right. Of note are the application name and the machine's FQDN it should answer on. If your using EC2 you might have to tweak some configuration afterward since the FQDN in DNS probably won't match the IP of the machine's interface. Also, if you plan on using authorized_keys and deploying from a git repository it makes things a lot easier if you tar and gzip the relevant files and put them in an S3 bucket to pull from. The script handles this case as well. As always, make sure you understand what a provisioning script does before you accept it with blind faith. At Flatterline we use this as our base template.

Symmetric indices to make JOINs faster

I am frequently asked how to increase the performance of Rails, and here's a great starting point. This advice generalizes to just about any database or platform that relies on B-Tree indices. If your using MySQL out of the box, then this definitely applies. Consider the following three models which are a very basic "has and belongs to many" setup: So as you can see, a user can be in many groups and a group can have many users, all by way of the memberships join relationship. The two example use cases I'm going to work with are:
  1. Given a user, what groups is he/she in?, and
  2. Given a group, who are the members?
Both of these are pretty typical, but can yield surprisingly different results from the database's perspective. If we try and get this data from the console we either start with a user and navigate to group, or the reverse. Here's the MySQL EXPLAIN output from the console ( I recommend viewing the RAW output unless I can figure out how to make github display it correctly): Totally understanding the output of the EXPLAIN syntax is well outside the scope of this post, but we're going to need to cover the basics. The first thing you should notice is the word ALL in the type column and the NULL in possible_keys. This indicates that MySQL's query optimizer has no index to read from and will be forced to perform a table scan to return the result. In general, this will kill your performance. Note the rows value of 100. This value will be whatever the size of your table is. If you have 500,000 records, then the database will check all 500,000 rows. It's worth noting that for small datasets you'll see ALL and a possible_keys value. This means the optimizer believes that scanning the table will be faster than actually loading the index into memory. This is generally fine. So let's go ahead and add a composite index on [user_id,group_id]. The SQL is:
ALTER TABLE memberships ADD INDEX test_index(user_id,group_id)
Now let's repeat the previous queries. This is where I see most people stop when it comes to performance optimization. Note though that these tables aren't the same! If you join from users to groups (the first query) you see a massive speedup. Only one row is consulted (instead of 100) and it's in the index. A further benefit we see in both queries is the "Using index" in the Extra column. This means that MySQL can determine the query result without ever checking the actual table because all required info is in the index (ie no extra disk hits). Unfortunately, joining from groups to users (second query) still (sorta) sucks. It says index instead of ALL, but that just means it will have to scan the entire index rather than scan the entire table on disk. This is a marginal improvement at best, so 50% of our use cases still suck. Here's the explanation: B-Tree indices are unidimensional structures. That means that the interior nodes of the index tree are strongly ordered, and thus cannot be arbitrarily accessed out of order. If that doesn't make any sense, it means that joining from users to groups is not an equivalent operation to joining groups to users because of the ordering of the index elements. So let's cleanup the second use case by adding an additional index, exactly like the first, except the order of the elements is reversed. Here's the SQL:
ALTER TABLE memberships ADD INDEX test_index1(group_id,user_id)
Now we have symmetric indices. Let's run our queries again with our second index in place: Voila! Now it doesn't matter which way we join the tables because we have an index that is correctly ordered based on the directionality of the join. You can even see that the optimizer selects a different index (key) depending on which direction you join the tables, exactly as expected. Also, both queries now only require consulting the exact number of rows necessary and won't involve any further disk hits as both queries can be satisfied with data available entirely within the index.

Ubuntu JeOS

If you do anything on virtual machines, linux or not, then you should definitely checkout Ubuntu JeOS (pronounced "juice"). JeoS is shorthand for "just enough operating system." It's a simple short install process that's hidden in the server installation ISO. On the startup you just have to press F4 and select "Minimal virtual machine". Rather than installing the entire Ubuntu server package you get an extremely pruned-down version that only includes the absolute bare essentials to get a server running. If your like me and believe that VMs should be created for a singular purpose only, then it will make your life even easier to create a bunch of small VMs. According to their specs you should expect something like 380MB disk images running in about 128MB of RAM.

Entrepreneurs: regarding equity to developers

Dear enthusiastic entrepreneur, I would like first and foremost to wish you well on your endeavor. Starting a business is no small feat and you will certainly need all the help you can get to bring your idea to life. That being said, let's talk about money. You will require money if you plan on developing software. If you've considered finding and asking a developer for his or her time in exchange for sweat equity, I'm going to strongly advise you against it. Not only are these requests often derided, they are often insulting to the developer. Worse yet, they make you appear an amateur with little understanding of how the technology field and development works. In the words of the Joker, "If you're good at something, never do it for free." All the engineers I trust to execute a project swiftly, professionally and with quality, especially under startup conditions, will not touch an upfront equity deal with a 10-foot pole. They understand not only the value of their abilities, but that they could just as easily band together and make their own startup. A common trait among these engineers is an understanding of their own entrepreneurial value. Investing time in your goals, when they could be developing their own, comes across as a high opportunity cost with an extremely unlikely payoff. As great as you think your idea may be, 90% of startups still fail. If you've ever tried to secure external business funding, whether it be angel or venture based, then you know that there's most likely weeks of negotiation coupled with presentations, milestones and the unmistakable feeling that you're in a fish bowl being watched. This shouldn't come as a surprise since you're asking someone for a large sum of money. The lender wants to see mitigated risk as well as receive some ownership for their money. Equity to a developer works the same way, except it's often approached as hiring an employee instead of bringing on an investor. Be prepared with your business plan, term sheet, and probably a beer. An "I'm giving you the opportunity to get in on the ground floor" attitude is going to make you the target of contempt. So when is equity OK? It will vary depending on the unique situation of the developer, but a good rule of thumb would be to not offer it until AFTER you've developed a successful working relationship in which you have demonstrated your ability to execute your business plan. Your stock options are worth zero until you successfully exit and, even then, only if the strike price is better than market value. Proof of execution is what matters today, not ideas. It's not that somebody is working for free, it's that they're working for a better long-term payoff. A payoff which you must prove is likely to happen. Also, just like an investment, the more risk that's on the table at the time of the offer, the more of your company you should be prepared to part with. If you offer a fractional percentage, don't expect a call back. It's going to be a rough road, as it is with all startups, and I wish you the best of luck in your venture. -Chris

Converting latitude and longitude to timezones

If you've fought it out with localization(l10n) of timezones then you know it can be a pain in the ass.  Further, suppose your localizing arbitrary information where all you've really been given is an address.  The relevant information isn't necessarily in your system and the user might be in the wrong timezone anyway, so no sense in using that. Here's a quick Ruby means to convert latitude and longitude into a timezone ActiveSupport recognizes.  This snippet only relies on Hpricot and the freely available Geonames API: GMT offsets are a convenient way for moving time data in and out of UTC as well as for not having to deal with arbitrary string names.