Is it inefficient to map over a small bucket when you have millions of other buckets?
deinspanjer at mozilla.com
Sun Jul 11 12:56:16 EDT 2010
I'm thinking about the pros and cons of Riak vs HBase for Mozilla's
Weave (now Firefox Sync) 2.0 engine.
The primary use case is that when a user's client performs a sync, it
needs to retrieve all the new items since the last time it synced for
each collection (bookmarks, tabs, history, etc.) that the client is
configured to sync.
If a particular client doesn't sync often, it is possible that
there might be thousands of items to retrieve, this means that using
links *might* run into issues.
HBase's use of ordered keys pushes for a schema where you'd have the
modified timestamp in the key. That would allow for quick and easy
scanning of just the new items.
Riak however, has a few interesting features such as the on-demand
creation of new buckets that might make it much more flexible... if
there is a highly performant mechanism for the client to retrieve new data.
What prompted me to post this message was something I thought I
remembered seeing regarding mapping over buckets in Riak. Unfortunately
I can't find the reference now.
Is it true that in order to map over all the keys in a single bucket,
the Riak cluster must actually traverse the entire global keyspace of
all buckets to find the keys that are part of the desired bucket?
In the case where you have tens of millions of users, and you have
either one bucket per user or (if it were feasible) one bucket per user
per collection, it seems like it would be impossible to efficiently
perform a map reduce on one user's bucket.
That seems like such a common scenario, I must have misinterpreted what
I read. I'd really appreciate some clarification there and also would
be very interested in any schema proposals or thoughts you might have
about this use case.
More information about the riak-users