Bitcask - large keydirs

Aphyr aphyr at
Thu Mar 10 18:36:34 EST 2011

TLDR: hey, what about using extendible hashing for bitcask keydirs? 
Constant-time lookups with two disk seeks end-to-end, much larger 
keyspaces than currently supportable, but without the total rehashing 
cost. Also avoids the O(log N) insertion/search/deletion costs of b-trees.

At length:

I've been thinking a lot recently about how to do quick lookups of keys 
where the space is much larger than memory--say, a few billion 32-byte 
keys. Similarly, bitcask is going to need to store more keys than can 
fit in an in-memory hashtable at some point.

One possibility is constructing bytewise (or multi-byte-wise) tries from 
the keys. These have the advantage of being orderable (hmm, range 
queries? faster bucket listing?), reasonably short, and supporting 
log(n) operations. You could cache the initial levels of the trie in 
memory and drop to disk for the leaves. An adaptive caching algorithm 
could also be used to maintain frequently accessed leaf nodes in memory. 
(the FS cache may actually provide acceptable results as well). It also 
takes advantage of the relatively low entropy of most Riak keys, and 
similar keys could be fast to access if they reside in nearby pages.

The major disadvantage is that trees can involve a lot of O(log N) churn 
for insertions, which... theoretically... sucks on disk. Obviously there 
are ways to make it perform well because ReiserFS and most DB indexes 
make use of them, but... maybe there are alternatives.

Ideally we want constant time operations, but hash tables usually come 
with awkward rehashing periods or insane space requirements. O(N) 
rehashing can block other operations, which blows latency through the 
roof when disks are involved. Not a good property for a k/v store.

So I started doodling some hybrid tree-hash structures, browsing through 
NIST's datastructures list, and lo and behold, there is actually a 
structure which combines some of the advantages of tries but behaves 
well for disk media!

You store values on disk in buckets which are small multiples of the 
page size. Finding a value involves choosing the bucket, reading the 
bucket from disk, and a linear search for the value.

To choose the bucket you use a hash table which specifies the on-disk 
address of the right bucket for your value. Here's the catch: you keep 
the index table small. In fact, it's a bitwise trie (or, even faster, a 
flattened hashtable) of the least significant bits of the hash of the 
key. As buckets fill up, you split them in half and (possibly) increase 
the depth of the index. Hence growth/shrinking is incremental and only 
operates on one bucket at a time.

In the case of bitcask, where values can be variable length, it probably 
makes sense to store the file ID/offset in the bucket, and take the hit 
of a second seek to support faster/more predictable searching over each 

The downside is that this is still an in-memory hash table and can only 
store an additional (values/bucket). Perhaps dropping the index to disk 
as well and taking advantage of the FS cache over it could work?

--Kyle Kingsbury

More information about the riak-users mailing list