This is a howto for consistent hashing of keys in key-value stores, with a focus on cache servers. We have code to share, in the form of a set of patches to ketama on github. At Wayfair, we have been using this code, or slightly less evolved versions of it, with our production Redis servers for a long while, and our production Memcached servers for a short while. It is now mature enough that we can release the code and describe our results. The code is cache-server agnostic: there is nothing specific to Memcached or Redis in it, and it could be used with other servers. The internal project that led to this was one of those little efforts intended to smooth over some blips in our fault log, and which has now produced a very useful set of building blocks for scalable systems.
Before I get started, I have three public messages, for three audiences:
- To the non-free key-value-store salesmen of the world: one of my main purposes here is to continue to be able to use free key-value stores, as my team helps to scale Wayfair up into larger and more numerous data centers. This relationship that you keep trying to initiate is just not going to work. Please stop calling me.
- To Brad Fitzpatrick and the other authors of Memcached: Memcached, you’re still the one, after all these years! You rock the low-latency performance like nobody else.
- To the author of Redis, with gratitude I say this: I’m sure Redis cluster is going to be an awesome thing for a lot of users. But Salvatore, don’t sweat it too hard. We’ve got you covered, with this technique, for some scaling scenarios that people can go very far with. As long as plain old Redis is fast and awesome, we won’t be needing Redis cluster in order to go wide and large with Redis.
Consistent hashing is an idea that has been around for a while. It solves a common problem with sharded cache server clusters, which can be disrupted when a single node fails and sets off a festival of rehashing, key shuffling and cache-filling database calls. The history of the idea and its implementations is roughly this: Danny Lewin, an Akamai founder, thought of it. (I won’t dwell on the details, but if you don’t know the story, look it up: in addition to being a hero of applied mathematics for this invention, he is a hero, full stop, for what he tried to do just before he was killed during 9/11.) A group of MIT professors and students including him wrote a paper in 1997 (Karger, Leighton, Lewin et al.), and Akamai has an in-house implementation. Various people, at various times, have written versions in Perl, Python, PHP, and other languages. Richard Jones, then CTO of Last.fm, wrote an open source C library called ‘ketama’ in 2007, with wrappers or ports in a few languages. Akamai alumni at Basho have incorporated an Erlang version of the idea into Riak. Jeremy Zawodny and others have written excellent blog posts describing configuration niceties of the way the idea is used at Craig’s List and other places. I have put a list of links at the end of this post. Of the implementations we looked at, ketama has the broadest language/platform coverage. We started using it early in our process of design and prototyping, and although we have found reasons to patch it, we never felt compelled to switch to something else or write an alternative from scratch. My thanks go to Peter Newell, now of Srsly Software, for finding it, and for convincing me that it was going to work. He did that by making sure you could write the keys from client code in any one of the supported languages, and read them from any of the other languages, and be sure they would always be on the right servers. I’m not sure about the other implementations, but the Riak and ketama versions strictly observe the original Akamai design of scattering a large number of points (160 of them) around a unit circle, a new set for each shard. Ketama calls this circle the ‘continuum’. I call it the ‘ring’. This is done so that when a node disappears, the keys from the vanished shard are distributed to the other members of the ring in even quantities. If you have n shards, with m as the maximum amount of memory used on any one of them, you only need m/(n-1) extra space on each shard to hold all the extra keys for a single node failure, m/(n-2) to support 2 node failures, etc. If you were to divide the key space naively as with a pizza cutter, with one slice per shard, you would need 2m on each shard to support a single node failure, 3m to support 2 failures (worst case, when the nodes are contiguous in the ring), etc.
What’s up with the name ‘ketama’? A Google search tells us that ‘ketama’ can be a type of marijuana or a flamenco band. Last.fm is a music site, so the flamenco interpretation is tempting. On the other hand ‘hashing’ and ‘hash’ make an obvious play on words. A quick look at the code removes any ambiguity. There are functions in the library called ‘ketama_roll’ (create) and ‘ketama_smoke’ (destroy). Draw your own conclusions about people in the music business, their sly senses of humor, and what they might be smoking at parties. Not to mention that there were a couple of half-baked things in the code. Heh. Absent our patches it used to read from the configuration files too often, didn’t use shared memory in exactly the way we wanted, and could not modify its configuration on the fly. We’re actually not making a pull request yet, because we have a not-quite-fully-tested version that eliminates shared memory from the system altogether. But these are quibbles, about problems we were able easily either to solve, or to work around with wrapper code up the stack. Mad props, and many thanks, to Mr. Jones for his excellent library.
Pre-existing state at Wayfair
Before implementing the consistent hash, we had been using Memcached at Wayfair in a very simple configuration: twin large-memory machines, one master, one slave. Why not a sharded system? Simple. We could afford the big boxes, and we hadn’t ever had a very large key space. But we knew that we would eventually come to a maximum height for those twin mounds of memory, and that we would want to move to a larger number of cache servers that we could scale horizontally. As it says in all the Memcached documentation: if you need more space, just add more nodes! So we decided to get ahead of the problem.
Why both Memcached and Redis, you may ask? The short answer is that we’re in a period of experimenting with key-value stores for various purposes. For now, we’re using Memcached for completely transient in-memory data, where every cache key has a lazy-initialization path. We use Redis for in-memory data that is backed by disk storage, where each Redis node replicates to a slave, which writes to the disk. The master does not write to disk, and when failover occurs to the slave, they switch roles. This is typically, but not exclusively, for values that do not change very often, and are populated by a batch process of some kind. If there’s no lazy-initialization path for a key, it’s going in Redis, at least for now.
How should the clients find and talk to the nodes of the cluster? Put the list of nodes in a configuration file, and you’re all set, right? Heh. After looking at the literature and giving the matter some thought, we came up with three candidate architectures.
- 3-tiered design. All the clients talk to a central dispatcher (with a SPOF-avoiding twin, naturally), which keeps track of which nodes are in the cluster, probably with some sort of monitor background thread that updates an in-memory configuration. The dispatcher helps with race conditions and ‘thundering herd’ problems in failover situations, and reduces the number of connections among different servers.
- Fully distributed configuration. The clients are occasionally updated with fresh lists of all the nodes in the cluster, and they are set up to try a progression of candidates if their default choice is unavailable. This is also good for thundering herds and race conditions. However, it might lead to a lot of running through lists of possible shard locations under stress. This seems wasteful, but is perhaps not disastrous.
- Distributed configuration, but with an external nanny, or let’s say an ‘animal trainer’. In this scenario the clients would check a primary shard location, possibly with a locally configured failover-to-secondary option. But the overall system would depend on an independent monitor of the members of the ring. Zookeeper is good at this kind of thing, so we started experimenting with it in this role.
With the persistent cache, you can’t do that, because there is no lazy initialization path for the cached items. We configure our persistent cache ring with the consistent hash for simplicity’s sake, and so we can scale all this out horizontally as far as we like. But if an arc disappears, the failover is to a waiting hot spare slave backup, which can be hustled into service quickly by Zookeeper, whose agents give the ip address of the shard to the slave node, and turn it into the master.
The first operating system on which we started to use ketama was FreeBSD. The library wouldn’t compile, so I wrote some autotools scripts, and hacked the code to get it to build on Mac OSX, Solaris, FreeBSD, and Linux. Ketama uses md5 as its hashing algorithm by default, but we discovered a patch file already in the tarball that implemented a faster algorithm, FNV-1a. So Steve and Viktoras from our team merged all those changes together, ran the resulting system through Valgrind to look at memory usage, and made some improvements. They also changed the way the code loads its configuration, to make it easier to change the list of servers on the fly.
In the mean time we switched our servers from FreeBSD to Linux. Thanks anyway, FreeBSD! Your strict, old-school C compiler and standard library made it a lot easier to see the need for some improvements that will help us even on Linux, which had been allowing the looser code to run.
Rolling this out was quite a thing. We handed the project off to Andrii from Site Reliability Engineering, and Aaron, from our Storefront team, to tame the animal trainer, as it were, write a Zookeeper script, and refactor the cache client wrapper code until it behaved as desired. Here is a picture of a test we ran, of the spiffy new 14-node cluster in our Massachusetts data center (it has a twin in Washington state, and more coming soon). In these graphs, we’re counting requests to the different nodes of the cluster. The gaps are periods where we forced a particular node, or more than one, off line. You can see that traffic to the other nodes picks up when that happens. We set Zookeeper to fail over upon 5 consecutive failed health checks, so it won’t be too twitchy. The failover takes 5-7 seconds, and during that time, it’s not as if the requests are failing–they are just going back to the source, or origin, as the content delivery network people would say.
Happy hashing with this, if it fits the size and performance requirements of the cache that you have. And if it doesn’t, and you’re willing to use some even more involved techniques, you can always scale Memcached to Facebook scale! The Facebookies describe all that in this blog post, which has a link to a Usenix paper they wrote.
Here’s a little bibliography of consistent hashing. It includes the original research paper, some blog posts about its use at various sites, and links to a bunch of code:
Lecture: Consistent Hashing, Danny Lewin, and the Creation of Akamai, by Tom Leighton, Akamai founder.