Storing large collections in Riak (or any distributed store)

Alexander Sicular siculars at gmail.com
Wed Feb 9 13:09:02 EST 2011


The only way this is functional is if you implement a uniformly random
hash function so that you know which key any given address will hash
to. Separately, churn will eat you up if you constantly need to take
addys out of your keys. Also, as mentioned elsewhere, Riak links won't
work at these numbers.

Check out more tech slides/blogs on how twitter does this. Basically
double/reverse/reciprocal look up lists with recipient notification.
When @aplusk tweets twitter does like 4million (or however many
followers he has) writes. I recommend @rk's qcon sf 2010 talk on nosql
at twitter.

Best, alexander

On 2011-02-09, Scott Lystig Fritchie <fritchie at snookles.com> wrote:
> Nathan Sobo <nsobo at pivotallabs.com> wrote:
>
> ns> Is a key-value store actually inappropriate for this problem?
>
> No.  One way to do it is to use a single KV key to store multiple
> addresses worth of info.  Pick a relatively big number, 50K
> subscribers/key, though it may vary.  Use a key naming scheme so that
> you can pre-calculate all keys for a given list, e.g. bucket =
> list-subscribers, key = name + range index #, or perhaps list name
> + start-of-hash-range + end-of-hash-range.
>
> How do you know the range index # or start & ends of range?  One method
> would be hashing, MD5 or SHA1 or whatever.  If you store all addresses
> for a list with a fixed number of hash hunks, e.g. 100, then each hash
> hunk will have roughly 20K entries for a 2M subscriber list.  To find
> all subscribers, fetch 100 known keys.
>
> If you want to keep addresses in sorted order, it's more work but also
> doable.  A naive plan is to make your hash function F(addr) = first
> letter of 'addr'.  Keys get clumpy that way, but only slightly more
> creativity can get around it.
>
> To find a particular subscriber, hash that subscriber's address and
> fetch 1 key.  You're also getting a lot of uninteresting data with that
> key's value, but if event is uncommon, it's not a problem.  (If that
> event actually is common, consider moving the commonly-queried data
> elsewhere.  Or duplicate that info in another smaller key somewhere
> else.)  Similar logic for list maintenance events (add subscriber,
> delete subscriber).
>
> -Scott
>
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>

-- 
Sent from my mobile device




More information about the riak-users mailing list