Storing large collections in Riak (or any distributed store)

Scott Lystig Fritchie fritchie at snookles.com
Wed Feb 9 12:28:48 EST 2011


Nathan Sobo <nsobo at pivotallabs.com> wrote:

ns> Is a key-value store actually inappropriate for this problem?

No.  One way to do it is to use a single KV key to store multiple
addresses worth of info.  Pick a relatively big number, 50K
subscribers/key, though it may vary.  Use a key naming scheme so that
you can pre-calculate all keys for a given list, e.g. bucket =
list-subscribers, key = name + range index #, or perhaps list name
+ start-of-hash-range + end-of-hash-range.

How do you know the range index # or start & ends of range?  One method
would be hashing, MD5 or SHA1 or whatever.  If you store all addresses
for a list with a fixed number of hash hunks, e.g. 100, then each hash
hunk will have roughly 20K entries for a 2M subscriber list.  To find
all subscribers, fetch 100 known keys.  

If you want to keep addresses in sorted order, it's more work but also
doable.  A naive plan is to make your hash function F(addr) = first
letter of 'addr'.  Keys get clumpy that way, but only slightly more
creativity can get around it.

To find a particular subscriber, hash that subscriber's address and
fetch 1 key.  You're also getting a lot of uninteresting data with that
key's value, but if event is uncommon, it's not a problem.  (If that
event actually is common, consider moving the commonly-queried data
elsewhere.  Or duplicate that info in another smaller key somewhere
else.)  Similar logic for list maintenance events (add subscriber,
delete subscriber).

-Scott




More information about the riak-users mailing list