Generating key for an object

Kresten Krab Thorup krab at
Fri Apr 8 06:57:44 EDT 2011

We just had an enlightening experience, that I'd like to share regarding "choosing keys" ...

In one of the systems we're currently prototyping, we're moving a searchable audit log to Riak; and we want to size it to handle a sustained throughput of 500 writes/second (N=3), keeping the data around for 5 years (thats ~45 billion keys x N=3).  The problem is that with this rate, there is no chance that we can keep all the keys in memory, and so bitcask was not really an option.  Well, yes, ... bitcask could handle it with by just adding extra machines, but it is also OK if GETs are relatively slow.

Enter innostore.  As opposed to bitcask, it's slower, but lets you define how much memory to allocate for the in-memory representation of key/values, and lets us put much more data onto our hardware.

We were using 160-bit random keys (easily safe enough to not run into the "birthday problem"), but throughput was not good enough ( fell dramatically to ~200/sec on our 4-node macmini dev cluster, when we ran out of memory after only ~10M key/values ).

Innostore uses a B-tree, and we realized that it was really suffering from the random keys, because it then needs to do I/O on random nodes of the B-tree.

So we changed the keys to be    "<<timestamp>>:<<random-bits>>"   i.e., such that successive writes have keys that are lexicographically close.  The random bits are there to make the chance of conflict small enough.   

Using such keys cause the underlying B-tree to only writes to a few nodes at a time, and ideally innostore only needs to keep tree-nodes in memory corresponding to a path from the root of the tree to the node currently being added to.

Now our little play-cluster with 4 mac mini's can handle sustained 2000 writes/sec, or 10x throughput of our first setup.  All ... just by choosing some better keys.



More information about the riak-users mailing list