Saturday, April 16, 2011

A sharding model

A hybrid database model
When playing around with Cassandra, I have realized that with it's limiting data model and lack of queries, using pure Cassandra may not be the optimal solution in applications that require joining lot's of different data and use complex queries as you'd have to denormalize the data substantially and this could become a management nightmare when you need to update something that is stored in five different column families etc so I decided to try to build a hybrid of the non-relational Cassandra and the great MySQL.

Sharding
To achieve near infinite scaling and fault tolerance on relational databases, pretty much the only way to go that I know of is resolving to sharding that basically splits up the entire schema by rows so there are many databases on different machines with each database containing the entire schema but only a subset of the rows, a slice of the total knowledge.

This way, if we know which database holds which data, we can go ask it for it or write to it and thus both read and write performance is nicely distributed among the different machines. This also improves availability as even if one of the machines dies and we have a total of say 20 of them simple cheap commodity hardware boxes, only one twentieth of our service would be disrupted.

Sharding is a big topic that I don't pretend to fully cover in this blog post, but I'll try to remind the general idea through my own experience.

How does it work
The first thing to figure out in a specific problem is by which to shard your data. This is an important choice because you want to place most tightly coupled data on the same shard so you can still do joins over the different data and this is only possible if the data to join lives on the same database. In a user-centric system such as social networks, it's usually a good idea to shard by user such that everything to do with a single user such as profile info, friends, pictures etc are on the same shard - this way joins can still be performed for given user.

Shard key
This brings us to sharding key that is used to choose which shard to place the data on. A simple candidate for it could be the first letter of the user name, but this is actually a bad idea, as there are much more people names starting with S than Z and you want the keys to be distributed equally on every machine.

Since in a sharded system, the data is divided between many machines, you have to figure out on which box to write new data and which machine to fetch and update it from. This brings us to sharding scheme, which decides, which shard should get which keys.

Sharding scheme
The simplest way would be to take some user property such as email, hash it for better distribution (as hash is a rather random-looking string even though the same input always produces the same hash) and then take a modulo with the number of shards in the pool.

Is pseudocode:
shardCount = 10
shardKey = kallaspriit@gmail.com
hash = sha1(shardKey)
shardId = hash % shardCount

This divides all the user emails among the shards equally and is very simple to implement. This would work great to the point that we add or remove any shards. This would change the shard count and thus the modulo operation result for most of the keys which would mean we'd have to relocate most of the data that would be very expensive thing to do so this approach is only reasonable if one is pretty sure that the number of shards wont change and if you are aiming for big scalability, this wont be the case.

Consistent hashing
One might wonder whether there is a way to minimize the number of keys that we have to relocate in the case of adding or removing shards and in fact, there is. This method is called consistent hashing and having n shards, the number of keys that need to be relocated when adding or removing a shard is at most 1/n.

How consistent hashing works in short is by hashing both the shards and shard keys on an imaginary circle and to find which shard a key belongs to, just move along this circle until you reach a hash of one of the shards.
In the picture, A, B and C are the shards and numbers are the shard keys. They are are hashed into a range that is connected end-to-end such that if no shard is found for a key in the end of the range, the search continues from the start. In the example above, keys 1, 4 match to server A, 2 to B and 3 to C. If we were to remove server B, key 2 would also hash to server C and same logic applies for adding nodes. In practise, for better distribution, each node is hashed to the circle many times (64 worked good for me with low enough standard deviation).

Dictionary lookup
I implemented the consistent hashing algorithm described above in PHP but then I realized that this is still good enough as it minimizes but does not eliminate the need for relocating data, that would be a pain. The most powerful way to implement sharding is by keeping a lookup service that knows where to map each shard key. This eliminates relocating stemming from changing the number of shards but supports relocating very precisely should it be necessary.

The initial way I imagined this was to create a central web-service that each server node could contact to find which shard to store or fetch data from. Then I remembered that "central" is a very bad concept in the world of distributed and this would become both a bottleneck and a single-point-of-failure.

Then I decided to implement this using a database that every server node could contact for this mapping purpose. This way there isn't a single service node to become the bottleneck and talking directly with a database is generally faster than doing it through a web-service, even if a lightweight one such as REST. There would also be security issues to solve.

Implementation
I implemented this in PHP and the MySQL database and to test the concept, built a simplistic blog on it. Basically the blog allows people to register, create their blog, add entries and others can read these.

The first thing I discovered building such a system is that it is not enough to match a single shard key (say the email that I used as username) to a shard as to login, I needed to know which shard does the globally unique email address resolve to but to display anyone's blog, I need to find the shard by blog name. To overcome this, I introduced extra level of abstraction by mapping any number of identifiers (email, blog name etc) to a shard key and then the shard keys to the shards.

As in a bigger system, there may be multiple separate components requiring separate schema's, I also introduced the concept of shard groups that included the SQL needed to setup new shards. This way it would be easy to build an administration panel where by just entering the access information of new shard database and choosing the group, the database could be setup automatically.

All of this led up to four tables for my MySQL-based sharding schema:


The group_id and identifier form a primary key for alias table as in a single group, the identifier must be unique. This is also a convenient way to ensure that the email address users register with are unique as without this, one would have to query all of the shards to find out whether an email address is already taken.

The key table just matches the shard key to a shard. Shard table contains the information required to contact the shard and a reference to group so it knows how to initially setup the shard with the SQL provided. The shard also has a capacity field which shows the relative power of the sharding machine. This enables sharding over various performance boxes as the load balancer makes sure that servers with capacity index twice as large will get about twice the number of keys. The number of keys a shard is currently storing is also kept here and is used for the load-balancing. When a new shard is added, it starts to get slightly more keys than the rest until the number of stored rows balance out.

In the actual implementation, there are two levels of cache between the clients and the database. For a short period of time, very fast local machine cache (PHP APC) is used. When this times out, global network of memcache is consulted and only when this is missed, an actual query to the database is made. This is of-course true only for fetch operations, adding new entries always require talking to the database but for fetches, it should have near 100% cache hit rate so the database should not get much load.

What next
I have tested this setup and it works great, the only problems being that this single central lookup database does not scale, eventually becoming a bottleneck and is also a single point of failure.

This is where finally Cassandra comes in as I plan to port this system over from using this central relational database to the distributed Cassandra, that should solve the problems of scaling and availability. The schema is simple enough so it should not be too hard to implement on Cassandra so I will try it and see how it pans out, You should be able to read about it in coming posts so stay tuned.

Saturday, April 9, 2011

The data model

The way Cassandra organizes, stores and retrieves it's data is what makes it special. The model in conceptually different from what one might be used to from relational databases but this makes it's distributed nature possible.

Cassandra's data model trades ACID-compliant data practices for important advantages in performance, availability, and operational manageability.

Columns
Starting from bottom up, the smallest increment of data are columns, that represent a tuple (well, triplet actually), that contains a name, value and a timestamp.

A column JSON-encoded could look like this:

{
    name: "email",
    value: "foo@bar.com",
    timestamp: 345678765
}

All the values are supplied by the client, including the timestamp, which means that the clocks on the clients and server environments should be synchonized, as the timestamps are used for conflict resolution (remember, Cassandra embraces always-write so conflicts are solved on reading). Timestamps can be anything you like, but microseconds since 1970 is a convention.

Super columns
Super columns can be used like columns except they add another layer to the hierarchy so instead of mapping keys to values, it maps keys to another set of columns thus a super column is a associative array of columns and the keys for it are also sorted.

One can thus think of columns and super columns in terms of maps: A row in a regular column family is basically a sorted map of column names to column values; a row in a super column family is a sorted map of super column names to maps of column names to column values.

Column families
A column family is the closest to a table in a relational system as it's a collection of columns. These you have to define in your storage configuration file and they can't be modified without restarting the Cassandra process. Column families hold ordered list of columns, that you can retrieve by name.

The way they are ordered can be chosen by the user and this is important for operations that slice a collection of columns. Natively supported ordering implementations include ASCII, UTF-8, Long and UUID (lexical or time).

A column family may have virtually unlimited set of column names defined, which makes it common to use the column names as a piece of runtime populated data.

Rows
Each column family is stored in a separate file, that is sorted in row (i.e key) major order so related columns that you access together should be kept in the same column family. The row key determines which machine the data is stored on. Each row key can have data from multiple column families associated with it, but since these are logically distinct, the interface is oriented around accessing one column family per key at a time.

Keyspaces
Keyspaces are the first and largest dimension of the Cassandra hash system. It's a container of column families which makes them equaivalent of a schema (collection of tables) in a relational database. Keyspaces are the configuration and management point for column families, and is also the structure on which batch inserts are applied.

Visually
Below is my attempt to visualize the data model.

Sunday, April 3, 2011

Cassandra elevator pitch

This is a good elevator pitch for the Cassandra database in 50 words written by Eben Hewitt:

"Apache Cassandra is an open source, distributed, decentralized, elastically scalable, highly available, fault tolerant, tuneably consistent, column-oriented database that bases its distribution design on Amazon's Dynamo and its data model on Google's Bigtable. Created at Facebook, it is now used at some of the most popular sites on the Web".


Distributed and decentralized means that it is meant to be deployed on more then one, possible hundreds of common commodity hardware servers and it can run on them seemingly as a unified whole, application can talk to it like a simple single-box database. It is decentralized as every node of the database is equal, there are no master servers coordinating the rest and thus there is no single point of failure. Every node is equal and holds part of the data and share the total load equally.

Cassandra is elastically scalable as it's is easy to add new nodes as the application becomes more popular or remove some should the opposite happen. You can just add new nodes and Cassandra will begin to send it work without the application even knowing that anything has changed.

Replication of data among the nodes is used so if a node fails, the data is still available in some other nodes so it can be recovered. Users can tune this replication factor to choose between the overhead and peace of heart. In fact there are several replication strategies. The simple, rack unaware method simply places replicated data on the next node along the ring while if you are running a bigger deployment on several data centers/availability zones, you can use the rack-aware strategy in which case replica 2 is placed in the first node along the ring that belongs in another data center than the first; the remaining N-2 replicas, if any, are placed on the first nodes along the ring in the same rack as the first. This sort of replication means that Cassandra is highly available and since every node is equals and failing of any random one does not cause the whole system to become unstable, it is also fault tolerant.

Consistency means that whenever you read from the database, you get the most recently written value. Imagine buying something off eBay and someone else buying the exact same thing at almost the exact same moment. Who-ever got their request processed first should be able to buy the item while the other buyer should be notified that the product is already sold. This again seems like a logical thing to expect from a database, but this comes at a price in distributed databases as all update operations should be executed synchronously, meaning they must block, locking all replicas until the operation completes and locking in distributed systems is very expensive.

In many cases, you don't actually need this as for example when you have replies in a forum, it won't hurt anyone if right after changing a post, someone views the thread and still sees the unchanged entry. In Cassandra, one can choose in each case whether absolute consistency is needed or not and tune the value in between. This is why Cassandra can be called "tuneably consistent". This eventual consistency means that all updates propagate through the system to all replicas in some time. The optimistic approach to replication is not blocking the client and propagate changes in the background but this means we have to detect and resolve possible conflicts. Whether to do this during writing or reading the data is an important design choice and dictates whether the system is always readable or writable - again in the world of compromises, you can't have it all.

Cassandra chose to be always writable, deferring the complexity of conflict resolution to read operations. By tuneably consistent, it is mean that clients can choose how many replicas to block for during update operations or consult during read operations. You could always set this consistency level to the number of replicas, but this would impact performance and lose the point of using something like Cassandra in the first place. Setting it to below the replication factor means that operations can succeed even if some nodes are down and you can not expect to always read the most recently written value but as described earlier, this is not always necessary. Nice thing about Cassandra is that you can still choose to opt for consistency for critical operations that require this.

Saturday, April 2, 2011

ACID, what and why?

One of the fundamental features of modern databases are transactions and confirming to the ACID properties. A transaction is transformation of state from one consistent state to another that happens virtually in a single step.

ACID is an acronym for:

  • Atomic means all or nothing meaning when a transaction is executed, all the statements contained must all succeed or none of them can have any effect, there can be no partial failure where some parts of it succeed and others fail, this would put the database in an inconsistent state. Common example is with money transfers where it's obviously very important that if on one side, the transferred amount is  subtracted, on the other side, it is added, even if the power goes out somewhere in the middle.
  • Consistent means that the database moves from one consistent state to another with no possibility that some other client could read the data in some intermediate state.
  • Isolated means that transactions executed concurrently will not clash with each other, each executes in it's own space and if they wish to update the same data, they have to wait for one of them to complete.
  • Durable means simply that once transaction succeeds, the data is not lost, the modified data is available for future queries to read and modify.
These properties are obviously desirable but on distributed systems, they may be hard to enforce without sacrificing performance and in future entries we will learn how clients of Cassandra can choose to tune between the different merits of consistency, availability and partition tolerance.

These compose the CAP theorem which states that it is impossible for a distributed computer system to simultaneously provide all three of the following guarantees:
  • Consistency - all nodes see the same data at all times
  • Availability - node failures do not stop the rest of the system operating
  • Partition tolerance - the system can be split up to run on multiple nodes and can continue to function in spite of node and network failures
Cassandra has chosen availability and partition tolerance (AP) while relational databases are available and consistent (AC) and for example MongoDB, HBase, Google Bigtable chose CP.

Transactions can be implemented across several machines using two phase commit, but as this locks all affected resources, it makes transactions even more expensive then they already are making this quickly not an optimal solution for high performance applications.

In Cassandra, one can choose the balance between consistency and latency but more on this later.

Next we will take a closer look at Cassandra's architecture and design philosophies.

Why consider non-relational databases?

Why would you want to be using non-relational databases such as Cassandra in the first place when we have great relational databases like MySQL that have proven to work great for decades, there is lots of know-how available and learning to be productive at it is rather easy. Relational databases provide us with great tools like data normalization, consistency, transactions, complicated joins and queries.

The question to "what's wrong with relational databases" is really "nothing". They work great for many applications and are simple to use. The problem with relational databases arise when we begin to need more performance out of our database than normal RDBMS deployments can handle.

Modern web applications often connect millions of people in an interactive way with lots of communication and data stored and exchanged at any time and unfortunately, there is no simple way to scale the performance of relational databases horizontally by simply adding more nodes.

When an application is becoming popular, the performance issues often arise. This is first mitigated by throwing beefier machines at it, which can postpone the problem for some time until there are no more affordable powerful enough machines. Then often a cache layer is introduced into the application if not already there, this can reduce the read load significantly but will make the application much more complex.

Read load can be further reduced by creating read-only replicas of the main database, so all the reads go to any of the replicas and the main database is only used for writing. This complicates things further and brings problems like what happens if you update something and redirect user to a page that shows the updated content. On this page, you will read from the replica and perhaps store the new value to a cache that you cleared when updating the record. Replicas lag slightly behind the main and what happens if you read from it before it has had time to propagate. This can happen often as the redirect is fast. Now you read old data and show it to the user that can get confused why the update had no effect. So instead of clearing some caches on data updating, deleting, you have to update them that further complicates everything. Notice that we have so far only been able to improve read performance, all the writes still happen on the main database.

Now you've implemented caching and read replicas and optimized all the queries but your site is becoming ever more popular so it's still not enough. You might now partition the database by functionality, for example by moving the search, comments or any other loosely coupled part to some other machine but you can do this only that many times. I've been down this exact road in my own experience and it's not a pretty ride.

At this point, you need a solution that really scales horizontally, meaning you can easily add or remove some machines handling all the data as your needs change. There are a couple of ways to do this, most popular being sharding your database so different rows of the same tables live of different machines or by using a relational database. Actually sharding is very powerful and when designing a truly scalable architecture, I would try to mix both sharding and non-relational databases to get the best out of both worlds.

Friday, April 1, 2011

Foreword

My name is Priit Kallas and I'm a Information Technology student in Tartu Ülikool, Estonia with passion for PHP and lately, web applications performance and non-relational databases. I work with PHP for a living and have been actively using and learning it since when there was PHP3. I'm also a Zend PHP 5.3 Certified Engineer and in no way associated with authors of Cassandra. Everything in this blog is what comes out of my own experience, reading of other articles and the book "Cassandra The Definitive Guide" by Eben Hewitt.

This blog is for other people enthusiastic about high performance web applications, PHP and non-relational databases who look for a place to get started. It includes the concepts of the technology and also practical examples of how to use it in your own applications. It's best read in succession from the beginning as it slightly builds on previous posts.