Storing large collections in Riak (or any distributed store)
siculars at gmail.com
Wed Feb 9 13:19:02 EST 2011
Not surprising, re. Flickr. Don't be too clever when disk is cheap and
only getting cheaper. Remember, we in nosql land where denormilization
is ... the norm.
@siculars on twitter
Sent from my iPhone
On Feb 9, 2011, at 13:12, Jeremiah Peschka
<jeremiah.peschka at gmail.com> wrote:
> Incidentally, this is also how flickr handles writes - when you
> upload a photo it gets written to wherever your other photos go.
> When someone tags it or adds it to a group, it gets copied into that
> Unless, of course, it's all changed since the last time I looked for
> information about how flickr actually works.
> Jeremiah Peschka
> Sent with Sparrow
> On Wednesday, February 9, 2011 at 10:09 AM, Alexander Sicular wrote:
>> The only way this is functional is if you implement a uniformly
>> hash function so that you know which key any given address will hash
>> to. Separately, churn will eat you up if you constantly need to take
>> addys out of your keys. Also, as mentioned elsewhere, Riak links
>> work at these numbers.
>> Check out more tech slides/blogs on how twitter does this. Basically
>> double/reverse/reciprocal look up lists with recipient notification.
>> When @aplusk tweets twitter does like 4million (or however many
>> followers he has) writes. I recommend @rk's qcon sf 2010 talk on
>> at twitter.
>> Best, alexander
>> On 2011-02-09, Scott Lystig Fritchie <fritchie at snookles.com> wrote:
>>> Nathan Sobo <nsobo at pivotallabs.com> wrote:
>>> ns> Is a key-value store actually inappropriate for this problem?
>>> No. One way to do it is to use a single KV key to store multiple
>>> addresses worth of info. Pick a relatively big number, 50K
>>> subscribers/key, though it may vary. Use a key naming scheme so that
>>> you can pre-calculate all keys for a given list, e.g. bucket =
>>> list-subscribers, key = name + range index #, or perhaps list name
>>> + start-of-hash-range + end-of-hash-range.
>>> How do you know the range index # or start & ends of range? One
>>> would be hashing, MD5 or SHA1 or whatever. If you store all
>>> for a list with a fixed number of hash hunks, e.g. 100, then each
>>> hunk will have roughly 20K entries for a 2M subscriber list. To find
>>> all subscribers, fetch 100 known keys.
>>> If you want to keep addresses in sorted order, it's more work but
>>> doable. A naive plan is to make your hash function F(addr) = first
>>> letter of 'addr'. Keys get clumpy that way, but only slightly more
>>> creativity can get around it.
>>> To find a particular subscriber, hash that subscriber's address and
>>> fetch 1 key. You're also getting a lot of uninteresting data with
>>> key's value, but if event is uncommon, it's not a problem. (If that
>>> event actually is common, consider moving the commonly-queried data
>>> elsewhere. Or duplicate that info in another smaller key somewhere
>>> else.) Similar logic for list maintenance events (add subscriber,
>>> delete subscriber).
>>> riak-users mailing list
>>> riak-users at lists.basho.com
>> Sent from my mobile device
>> riak-users mailing list
>> riak-users at lists.basho.com
> riak-users mailing list
> riak-users at lists.basho.com
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the riak-users