Storing large collections in Riak (or any distributed store)

Jeremiah Peschka jeremiah.peschka at gmail.com
Wed Feb 9 13:12:33 EST 2011


Incidentally, this is also how flickr handles writes - when you upload a photo it gets written to wherever your other photos go. When someone tags it or adds it to a group, it gets copied into that group.

Unless, of course, it's all changed since the last time I looked for information about how flickr actually works.

-- 
Jeremiah Peschka
Sent with Sparrow
On Wednesday, February 9, 2011 at 10:09 AM, Alexander Sicular wrote: 
> The only way this is functional is if you implement a uniformly random
> hash function so that you know which key any given address will hash
> to. Separately, churn will eat you up if you constantly need to take
> addys out of your keys. Also, as mentioned elsewhere, Riak links won't
> work at these numbers.
> 
> Check out more tech slides/blogs on how twitter does this. Basically
> double/reverse/reciprocal look up lists with recipient notification.
> When @aplusk tweets twitter does like 4million (or however many
> followers he has) writes. I recommend @rk's qcon sf 2010 talk on nosql
> at twitter.
> 
> Best, alexander
> 
> On 2011-02-09, Scott Lystig Fritchie <fritchie at snookles.com> wrote:
> > Nathan Sobo <nsobo at pivotallabs.com> wrote:
> > 
> > ns> Is a key-value store actually inappropriate for this problem?
> > 
> > No. One way to do it is to use a single KV key to store multiple
> > addresses worth of info. Pick a relatively big number, 50K
> > subscribers/key, though it may vary. Use a key naming scheme so that
> > you can pre-calculate all keys for a given list, e.g. bucket =
> > list-subscribers, key = name + range index #, or perhaps list name
> > + start-of-hash-range + end-of-hash-range.
> > 
> > How do you know the range index # or start & ends of range? One method
> > would be hashing, MD5 or SHA1 or whatever. If you store all addresses
> > for a list with a fixed number of hash hunks, e.g. 100, then each hash
> > hunk will have roughly 20K entries for a 2M subscriber list. To find
> > all subscribers, fetch 100 known keys.
> > 
> > If you want to keep addresses in sorted order, it's more work but also
> > doable. A naive plan is to make your hash function F(addr) = first
> > letter of 'addr'. Keys get clumpy that way, but only slightly more
> > creativity can get around it.
> > 
> > To find a particular subscriber, hash that subscriber's address and
> > fetch 1 key. You're also getting a lot of uninteresting data with that
> > key's value, but if event is uncommon, it's not a problem. (If that
> > event actually is common, consider moving the commonly-queried data
> > elsewhere. Or duplicate that info in another smaller key somewhere
> > else.) Similar logic for list maintenance events (add subscriber,
> > delete subscriber).
> > 
> > -Scott
> > 
> > _______________________________________________
> > riak-users mailing list
> > riak-users at lists.basho.com
> > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> 
> -- 
> Sent from my mobile device
> 
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20110209/d662c010/attachment.html>


More information about the riak-users mailing list