Storing large collections in Riak (or any distributed store)
Scott Lystig Fritchie
fritchie at snookles.com
Wed Feb 9 12:28:48 EST 2011
Nathan Sobo <nsobo at pivotallabs.com> wrote:
ns> Is a key-value store actually inappropriate for this problem?
No. One way to do it is to use a single KV key to store multiple
addresses worth of info. Pick a relatively big number, 50K
subscribers/key, though it may vary. Use a key naming scheme so that
you can pre-calculate all keys for a given list, e.g. bucket =
list-subscribers, key = name + range index #, or perhaps list name
+ start-of-hash-range + end-of-hash-range.
How do you know the range index # or start & ends of range? One method
would be hashing, MD5 or SHA1 or whatever. If you store all addresses
for a list with a fixed number of hash hunks, e.g. 100, then each hash
hunk will have roughly 20K entries for a 2M subscriber list. To find
all subscribers, fetch 100 known keys.
If you want to keep addresses in sorted order, it's more work but also
doable. A naive plan is to make your hash function F(addr) = first
letter of 'addr'. Keys get clumpy that way, but only slightly more
creativity can get around it.
To find a particular subscriber, hash that subscriber's address and
fetch 1 key. You're also getting a lot of uninteresting data with that
key's value, but if event is uncommon, it's not a problem. (If that
event actually is common, consider moving the commonly-queried data
elsewhere. Or duplicate that info in another smaller key somewhere
else.) Similar logic for list maintenance events (add subscriber,
More information about the riak-users