Cloud hosting vs colocation

Edit: I'm linking to the discussion on HN that has another 90 comments or so.  More discussion.

I understand that it might sound heretical to claim that traditional colocation is still the more cost-effective approach to hosting, but let's look at the numbers.

 

tl;dr

Cloud computing isn't that "cheap" unless you're talking about tiny costs that early, early stage startups might face. Once you know you will require dedicated infrastructure it's very likely worth it to get your own hardware.  You're employing technical people anyway; have some of them manage your infrastructure.  You would have needed some people doing it with your cloud hosting anyway.  Being able to throw stuff away is nice, but so is having authoritative control over your business operations and in-house accountability.

The actual numbers

Hypothetically, let's look at a scenario where we need 4 web servers, 50 app servers, and 2 reasonably high powered databases. For all intents and purposes, this is a small data center and the kind of configuration that cloud companies like Amazon claim they are designed to handle. It also mimics several of my clients' infrastructures, which are built on Rails, so it feels particularly apropos.

Secondly, let's suppose that each of these servers requires 8 cores to keep up with the workload.  So all together we have 56 servers at 8 cores a piece.  For the database, let's assume those are also 8 core machines but will require 68GB of RAM a piece plus about 10TB of storage.  I think it would make more sense to actually add more RAM, but that's the largest amount you can get from Amazon right now on their Instance Types list.

We're going to talk about the total cost of ownership across a year for both options.  I am also going to make the assumption that the duty cycle of activity is fairly uniform.

Note: If you're going to disagree with one of my assumptions it's this last one.  I am perfectly aware that a uniform duty cycle is unheard of when it comes to web applications because we're saying that our traffic load is uniform across the day. I am, however, making this assumption because for all the companies I talk to that cite the ability to dynamically decommission hardware, none of them do.  There seems to be a massive perceived risk in starting up and shutting down instances on the fly and whether or not the custom experience will suffer.  If you are currently cloud hosted and you decommission hardware daily then congratulations: anecdotally you are extremely rare.

How many cores do we need?

54 * 8 
= 432 cores

While it's totally overkill (and awesome), I put together a linear optimization model using glpk.  If we optimize in server count then the fewest instances in Amazon that will meet this requirement are 54 high CPU extra large instances costing about $0.68 per hour.  Further, let's assume 1 month is 720 hours.  So for one year:

54 HCPUEL * 720hr * 12m * $0.68 
= $317,260.80

That covers our application servers. For our databases we're going to need high-memory instances.  The only one that offers 68GB of RAM is an "High-Memory Quadruple Extra Large Instance" which runs $2.00/hour.

2 HMQEL * 720hr * 12m * 2 
= $34,560

Now, these 2 databases are going to need 10TB of storage apiece, so let's add in 10TB of storage.  Checking the price sheet again: $0.10 per GB-month of provisioned storage.

$0.10 * 1024GB * 10TB * 2 servers * 12months 
= $24,576

 

We are currently up to $376,396.80 and we haven't talked about bandwidth or management.  Since we have a uniform duty cycle (I can hear you screaming) let's suppose that all our web servers receive 1Mbps worth of traffic each.  Not a ton, but certainly not trivial either.  Well, at 1Mbps then every 8 seconds we're transferring a megabyte per machine.  How much is that over a year?

(1Mbps * 54 servers * (60sec * 60min * 24hr * 30days)) / (8 * 1024) 
= 17,085.9375 GB/month

or

17,085GB * $0.15/GB * 12m
= $21,754.69

There's one last variable to account for, and that's EBS IOP activity.  You will be billed for every million IO requests you make to an EBS volume.  With uniform duty cycle let's say we're looking at 300 read IOPS and 150 write IOPS per second.  So 450 IOPs (once again, quite low).

(450IOPs * (60sec*60min*24hr*30days*12months) * 2 DB servers) * $0.10 / 1,000,000 
= $2,799.36

Thus, our total cost for the year:

 

$400,950.85

 

What would it cost to colocate these servers ourselves?  Let's look at some similar numbers.  The boys (and girls) over at Dell offer a reasonably priced 8 core server for about $1901.  Take a look for yourself over at Dell. I'm looking at the Dell R515 w/ 8 cores, 16GB of RAM, dual gigabit NICs, and redundant power.

Initial cost of servers:

54 servers * $1901 
= $102,654

Let's add in 10% of the cost of servers in additional networking hardware just in case our colocation option doesn't offer any beyond an uplink port.

$10,265.40

Since the R515 is a 2U server we're going to need 108Us worth of rack space.  A typical rack has 40Us of storage, but since we want room for power spikes and our network hardware let's say we need 4 racks.  Further, let's say we want to host this hardware in a Tier IV data center.  That's going to run about $1000/month per rack.

12m * $1000 * 4racks 
= $48,000

Depending on the data center they may give you a certain number of hours per month of "remote hands" where you can have techs handling your hardware, or you can colocate in your city and pay employees to interact with your hardware.  If touching hardware is that abhorrent to you, you probably wouldn't have read this far anyway.

Let's talk bandwidth fees.  Data centers don't usually bill like Amazon.  Amazon charges per gigabyte, and most DCs charge at what's called the 95th percentile usage or Burstable Billing.  So if 95% of the time your only using 54Mbps, then you're billed for a 54Mbps link.  Let's choose a higher number for cost and get some premium, guaranteed bandwidth at $150 per Mbps (quite high).

54Mbps * $150 * 12m
= $97,200

Lastly, let's choose some really sweet databases.  We can pump 64GB of RAM in an R515 so let's use that for comparison bringing the cost to $3831 and we'll add 10TB of high-performance direct attached storage per server.  I customized a Dell PowerVault MD1220 with 10TB, and SAS drives with a hardware buffered RAID controller for $12,200 apiece.

$3831 * 2servers + $12,200 * 2 servers 
= $32,062

Our servers, bandwidth, databases, and storage come out to a grand total of:

$290,181.40

That's $110,769.45 less than the cost of the equivalent Amazon service.  So let's close out with a click commentary on how to compare these numbers.

 

  1. Note that these numbers are for the first year of operation.  If you don't change anything in year 2 you will have an additional $400,950 for Amazon because it's a subscription.  If you bought the hardware, it's yours.  You'll just have the cost of colocation and bandwidth: $145,200. Have your accountant look into depreciating a capital asset and that number may come down further with tax advantages.
  2. I tried to achieve a certain level of "spec parity" between the two, but that isn't really fair.  The hardware we're getting from Dell won't be virtualized (unless we add it), shared, and completely abstracted away.  If you've ever had to deal with EBS volumes then you know the performance is questionable at best and you have zero visibility into it.  We have direct attached storage connected and no contention with other customers.  The same applies to our inner-network bandwidth: only our machines are linked on private switches.
  3. The advantage of cloud computing is you just "throw away" instances running poorly or misbehaving.  You will have to manage and periodically fix broken hardware.  Some of the hardware like the PowerEdge storage includes 3 years of 24x7x365 4 hour max response support contract.  It's an opt-in for general servers but isn't ridiculously expensive.  A glance at the AWS support options appears we'd need Platinum support for something comparable.  That runs the greater of $15K a month 10% of monthly usage.  That's $33,412 $15,000 a month.
  4. Never under-estimate the power of taking on responsibility for your entire stack down to the hardware. Yes, you will have more responsibility and you will need to employ people.  You had to employ people with EC2 so you're just having the same people work on your own hardware.  Need more bandwidth? Add switches.  Need intrusion detection? Add it, you have control of the network.  Want to share IPs for high-availability configurations? Add it, the switches are yours.  Want to be PCI-compliant? Not a problem, your cabinets are yours and you can prove physical access restriction easily.
  5. You want to experiment with ideas and write code that can interface with your physical infrastructure? Then look at Amazon Virtual Private Cloud.  Add cloud instances and link them in to your network.  Throw them away when you're done and move them to dedicated hardware if they're doing something you plan on doing for the long-term.
  6. 37Signals manages their own hardware.  "We were able to cut response times to about 1/3 of their previous levels even when handling over 20% more requests per minute." If it's cool enough for them, then it's cool enough for you: http://37signals.com/svn/posts/1819-basecamp-now-with-more-vroom
  7. What if we say we only keep 25 nodes up full-time and make 29 of them half-time to reflect a more likely duty cycle?
    25 HCPUEL * 720hr * 12m * $0.68 + 29 HCPUEL * 360hr * 12m * $0.68 
    $232,070.4
    A savings of $85,190 but that's still $315,760.85.
  8. Yes, you will have to buildout your own data center and that will take some time. Most colocation facilities will be more than happy to help you with this.  Yes, it will also take more time and you can't just click a button.  The result though is most likely worth it.  Look at most companies that maintain a tyrannical control over their products and integration and tell me they aren't generally perceived as market leaders.
  9. What about special Reserved Instances? The pricing options will help, but you won't come out ahead over colocation.  Considering the example above used high-CPU extra larges then the cost of the reserved instance is $1820 + a "reduced" hourly rate.  A mere $100 less than purchasing a comparable server outright.  Plus, at the end of the year you'll have to make the investment again.

 

An argument against hashing credit cards

I'm still editing this, but thought I'd at least open it up to read in the meantime.

Also, special thanks for helping me review this one to:

Dave Hoover @redsquirrel

Blake Smith @blacksmith

Jason Casillas @RAGEBARRAGE

The setting

Periodically I'm asked about whether or not it's considered safe to store hashes of credit cards.  It's possibly for a secondary form of user authentication or potentially fraud detection to see if a card is used in different places by different users. In general, I am strongly against storing hashes of credit cards, because combinatorically it isn't that much safer than storing the number outright against a clever attacker.

The combinatorics

When it comes to security, or more specifically cryptanalysis, your system is only as safe as the narrowest/smallest brute-forceable surface.  Consider an arbitrary and perfect hash algorithm h that generates 160 bits of output. If you had the hashed output c and the function h you would need to find the input p such that h(p) = c and you'd have all the secrets. Well, the size of all possible inputs is infinite but no hash is perfect, so where's the upper-bound? It's 2^160 ( the size of the hash, N bits can represent 2^N unique pieces of information: see Integers in Computer Science and Pigeonhole principal).

Here we consider the difference between an infinite set and 2^160. Thanks to a great writeup by Jeff Bonwick we know that fully populating a 128 bit set would literally require more energy then it would take to boil the oceans of Earth, so 160 bits should be plenty safe, right?

Let's keep going and move on to look at the structure of a credit card number:

0000-0000-0000-0000

Consider a typical credit card number is 16 digits (and only digits, no letters/symbols) long, so that gives us 10^16 total possibilities.  Written out: 10,000,000,000,000,000. 10 quadrillion numbers.  Not necessarily at the point at which we start calling things intractable, but still prohibitively large.  Let's try and remove some entropy from this number.

First of all, there's the Bank card number that accounts for the first 4-6 digits of the credit card number. I'm going to X out the ones that we may know ahead of time.  In the event of a 4 card bank number:

XXXX-0000-0000-0000

and a 6 digit:

XXXX-XX00-0000-0000

In the case of a 4 digit card we now have just 1,000,000,000,000 (10^12, 1 trillion) possibilities and the 6 leaves us with 10,000,000,000 (10^10, 10 billion) numbers.

So given a stack of 160 bit hashed credit card numbers how much computer power would it take to reverse one of them out by brute forcing the hash? 10 quadrillion is certainly smaller than 2^160 so brute forcing the number is going to be the easier target. If we select a specific bank it's even easier.  First, let's look at some code:

The CPU implementation

Assuming that we're going after a 6 digit credit card number, how long would it take us to try all the possibilities?  We have three tasks:

  1. Generate a number
  2. Determine if it's valid
  3. Calculate the hash of the input and find a collision

The code to do that is here https://github.com/cchandler/cc-hash-probe .

Let's see how long it takes to get my 2.8Ghz Intel Core i7 to do the above steps 10,000,000 times:

14sec

So if it takes 15 seconds to do 10 million iterations scanning every possible number for a 6 digit card it would take us 4.16 hours. In the event of a 4 digit card we'd have to do about 100 times the work or about 416 hours. Inconvenient, but not unthinkable.  Especially considering I have more cores I'm not using.

However, we have other faster/better options...

The GPU

My commodity laptop has a nVidia GeForce GT330M card inside it. I've highlighted the relevant stats:

Geforce_specs

That card most people aren't using has 48 cores, is capable of executing 32 hardware threads at a time (the warp size), and is running at a Clock rate of 1.1Ghz by itself.  How long does it take my GPU to calculate the above 3 steps 10M times?

1.9s

Only about 2 seconds to do an equivalent workload.  My GPU code is available here: https://github.com/cchandler/cc-hash-probe/blob/master/gpu.cu

Let's do some math: at 2 seconds / 10M hashes we'll get through a 6 digit BIN in 33 minutes down from 4.16 hours. The 4 digit BIN in 55.56 hours down from 416.

But wait! It gets better.  Amazon semi-recently announced the general availability of their High Performance Computing GPU computer cluster instances.  If you haven't had a chance to play with them yet, you'll discover they come equipped with dual nVidia Tesla C2050 (Fermi) cards.  For perspective, these cards have 448 cores a piece.  For each instance you fire up, at $2.30/hr you get 896 cores @1.15Ghz or roughly ~20 times the computer power of my laptop (I'm rounding up).

Let's suppose these cards do 20 times the workload of my laptop, then in 2 seconds we'll have 200,000,000 hashes instead of 10,000,000.  How long will it take us to go through that 6 digit card now?  200M hashes in 2 seconds will yield all possible outputs in a little over a minute and a half (1.67m). The 4 digit card will be ours in 2.7 hours.

A more clever attack

So now we know how fast we can potentially recover all the possible values using our fancy-pants Amazon GPU instances.  A total brute-force attack might still be a bit prohibitive because we don't want to shell out that much money for the compute instances.  

Thanks to institutional banking, it turns out that a massive amount of money is deposited in very few banks.  Here are the top 3:

  • Bank of America
  • JPMorgan Chase
  • Citigroup

What if we only bothered to go after credit cards at these banks?  A quick check at Wikipedia confirms that a list of known BIN numbers is available.

Here are all currently listed Bank of America BINs:

  • 377311 - MBNA Europe Bank (Bank of America) bmi plus Credit Card (UK)
  • 377311 - MBNA Europe Bank (Bank of America) Virgin Atlantic Credit Card (UK)
  • 41177 - Bank of America (US; formerly FleetBoston Financial|Fleet) VISA Debit Card
  • 414716 - Bank of America (US) - Alaska Airlines Signature Visa Credit Card
  • 417008-11 - Bank of America (USA; Formerly Fleet) - Business Visa Card 
  • 421764 to 66 - Bank of America VISA Debit Card
  • 4256 - Bank of America General Motors|GM Visa Check Card
  • 426428 to 29 - Bank of America (formerly MBNA) Platinum Visa Credit Card
  • 426451 to 52, 65 - Bank of America (formerly MBNA) Platinum Visa Credit Card
  • 430536, 44, 46, 50, 94 - Bank of America (formerly Fleet) Visa Credit Card
  • 431301 to 05, 07, 08- Bank of America (formerly MBNA) Preferred Visa & Visa Signature Credit Cards
  • 432624 to 30 - Bank of America (formerly Fleet National Bank) Visa Check Card, Debit
  • 4342 - Bank of America Classic Visa Credit Card
  • 4356 - Bank of America Visa Debit Card
  • 435680 to 90 - Bank of America, Visa, Platinum Check Card, Debit
  • 449533 - Bank of America (USA), National Association - Classic, Debit, Visa
  • 4635 - Bank of America Business Platinum Debit
  • 4744 - Bank of America Visa Debit
  • 474480 - Bank of America Visa Debit, Midwest USA
  • 4888** - Bank of America (US) - Visa Credit Card
  • 5401 - Bank of America (formerly MBNA) MasterCard Gold Credit Card
  • 549035 - MBNA American Bank [Now part of Bank of America]
  • 549099 - MBNA American Bank [Now part of Bank of America]
  • 587781 - Bank of America ATM Card

More generally: 37 6-digits and 7 4-digits. If we used the high-powered compute instances that means reversing out all the Bank of America cards would take:

2.7 hours * 7 cards + 0.02 hours * 37 cards = 19.64 hours

Just 20 hours of compute time to potentially recover all possible BofA numbers. Considering it's $2.30/hr for these instances that's about $46.00.

The takeaway

Don't hash credit card numbers. Or, if you insist on doing it, store/use them in a way that guarantees if they are reversed the would-be attacker can't connect them back to personal identifiable information that would make them useable elsewhere (eg foreign key on user table and don't store them with created_at/updated_at timestamps that can be linked to other tables/columns).  GPU availability is only getting better and better, and they're cramming more and more cores into them.  Over the next few years you can expect to see the cost of this kind of computing to drop and become easier and easier.  What's "irritating" today at 2.7 hours is going to be trivial in the not-so-distant future.