Bitcask - large keydirs
aphyr at aphyr.com
Thu Mar 10 18:36:34 EST 2011
TLDR: hey, what about using extendible hashing for bitcask keydirs?
Constant-time lookups with two disk seeks end-to-end, much larger
keyspaces than currently supportable, but without the total rehashing
cost. Also avoids the O(log N) insertion/search/deletion costs of b-trees.
I've been thinking a lot recently about how to do quick lookups of keys
where the space is much larger than memory--say, a few billion 32-byte
keys. Similarly, bitcask is going to need to store more keys than can
fit in an in-memory hashtable at some point.
One possibility is constructing bytewise (or multi-byte-wise) tries from
the keys. These have the advantage of being orderable (hmm, range
queries? faster bucket listing?), reasonably short, and supporting
log(n) operations. You could cache the initial levels of the trie in
memory and drop to disk for the leaves. An adaptive caching algorithm
could also be used to maintain frequently accessed leaf nodes in memory.
(the FS cache may actually provide acceptable results as well). It also
takes advantage of the relatively low entropy of most Riak keys, and
similar keys could be fast to access if they reside in nearby pages.
The major disadvantage is that trees can involve a lot of O(log N) churn
for insertions, which... theoretically... sucks on disk. Obviously there
are ways to make it perform well because ReiserFS and most DB indexes
make use of them, but... maybe there are alternatives.
Ideally we want constant time operations, but hash tables usually come
with awkward rehashing periods or insane space requirements. O(N)
rehashing can block other operations, which blows latency through the
roof when disks are involved. Not a good property for a k/v store.
So I started doodling some hybrid tree-hash structures, browsing through
NIST's datastructures list, and lo and behold, there is actually a
structure which combines some of the advantages of tries but behaves
well for disk media!
You store values on disk in buckets which are small multiples of the
page size. Finding a value involves choosing the bucket, reading the
bucket from disk, and a linear search for the value.
To choose the bucket you use a hash table which specifies the on-disk
address of the right bucket for your value. Here's the catch: you keep
the index table small. In fact, it's a bitwise trie (or, even faster, a
flattened hashtable) of the least significant bits of the hash of the
key. As buckets fill up, you split them in half and (possibly) increase
the depth of the index. Hence growth/shrinking is incremental and only
operates on one bucket at a time.
In the case of bitcask, where values can be variable length, it probably
makes sense to store the file ID/offset in the bucket, and take the hit
of a second seek to support faster/more predictable searching over each
The downside is that this is still an in-memory hash table and can only
store an additional (values/bucket). Perhaps dropping the index to disk
as well and taking advantage of the FS cache over it could work?
More information about the riak-users