Storing large collections in Riak (or any distributed store)

Alexander Sicular siculars at
Wed Feb 9 13:19:02 EST 2011

Not surprising, re. Flickr. Don't be too clever when disk is cheap and  
only getting cheaper. Remember, we in nosql land where denormilization  
is ... the norm.

@siculars on twitter

Sent from my iPhone

On Feb 9, 2011, at 13:12, Jeremiah Peschka  
<jeremiah.peschka at> wrote:

> Incidentally, this is also how flickr handles writes - when you  
> upload a photo it gets written to wherever your other photos go.  
> When someone tags it or adds it to a group, it gets copied into that  
> group.
> Unless, of course, it's all changed since the last time I looked for  
> information about how flickr actually works.
> -- 
> Jeremiah Peschka
> Sent with Sparrow
> On Wednesday, February 9, 2011 at 10:09 AM, Alexander Sicular wrote:
>> The only way this is functional is if you implement a uniformly  
>> random
>> hash function so that you know which key any given address will hash
>> to. Separately, churn will eat you up if you constantly need to take
>> addys out of your keys. Also, as mentioned elsewhere, Riak links  
>> won't
>> work at these numbers.
>> Check out more tech slides/blogs on how twitter does this. Basically
>> double/reverse/reciprocal look up lists with recipient notification.
>> When @aplusk tweets twitter does like 4million (or however many
>> followers he has) writes. I recommend @rk's qcon sf 2010 talk on  
>> nosql
>> at twitter.
>> Best, alexander
>> On 2011-02-09, Scott Lystig Fritchie <fritchie at> wrote:
>>> Nathan Sobo <nsobo at> wrote:
>>> ns> Is a key-value store actually inappropriate for this problem?
>>> No. One way to do it is to use a single KV key to store multiple
>>> addresses worth of info. Pick a relatively big number, 50K
>>> subscribers/key, though it may vary. Use a key naming scheme so that
>>> you can pre-calculate all keys for a given list, e.g. bucket =
>>> list-subscribers, key = name + range index #, or perhaps list name
>>> + start-of-hash-range + end-of-hash-range.
>>> How do you know the range index # or start & ends of range? One  
>>> method
>>> would be hashing, MD5 or SHA1 or whatever. If you store all  
>>> addresses
>>> for a list with a fixed number of hash hunks, e.g. 100, then each  
>>> hash
>>> hunk will have roughly 20K entries for a 2M subscriber list. To find
>>> all subscribers, fetch 100 known keys.
>>> If you want to keep addresses in sorted order, it's more work but  
>>> also
>>> doable. A naive plan is to make your hash function F(addr) = first
>>> letter of 'addr'. Keys get clumpy that way, but only slightly more
>>> creativity can get around it.
>>> To find a particular subscriber, hash that subscriber's address and
>>> fetch 1 key. You're also getting a lot of uninteresting data with  
>>> that
>>> key's value, but if event is uncommon, it's not a problem. (If that
>>> event actually is common, consider moving the commonly-queried data
>>> elsewhere. Or duplicate that info in another smaller key somewhere
>>> else.) Similar logic for list maintenance events (add subscriber,
>>> delete subscriber).
>>> -Scott
>>> _______________________________________________
>>> riak-users mailing list
>>> riak-users at
>> -- 
>> Sent from my mobile device
>> _______________________________________________
>> riak-users mailing list
>> riak-users at
> _______________________________________________
> riak-users mailing list
> riak-users at
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

More information about the riak-users mailing list