Is it inefficient to map over a small bucket when you have millions of other buckets?

Daniel Einspanjer deinspanjer at mozilla.com
Sun Jul 11 12:56:16 EDT 2010


  I'm thinking about the pros and cons of Riak vs HBase for Mozilla's 
Weave (now Firefox Sync) 2.0 engine.
https://wiki.mozilla.org/Labs/Weave/Sync/2.0/API

The primary use case is that when a user's client performs a sync, it 
needs to retrieve all the new items since the last time it synced for 
each collection (bookmarks, tabs, history, etc.) that the client is 
configured to sync.
     If a particular client doesn't sync often, it is possible that 
there might be thousands of items to retrieve, this means that using 
links *might* run into issues.

HBase's use of ordered keys pushes for a schema where you'd have the 
modified timestamp in the key.  That would allow for quick and easy 
scanning of just the new items.

Riak however, has a few interesting features such as the on-demand 
creation of new buckets that might make it much more flexible... if 
there is a highly performant mechanism for the client to retrieve new data.

What prompted me to post this message was something I thought I 
remembered seeing regarding mapping over buckets in Riak.  Unfortunately 
I can't find the reference now.

Is it true that in order to map over all the keys in a single bucket, 
the Riak cluster must actually traverse the entire global keyspace of 
all buckets to find the keys that are part of the desired bucket?

In the case where you have tens of millions of users, and you have 
either one bucket per user or (if it were feasible) one bucket per user 
per collection, it seems like it would be impossible to efficiently 
perform a map reduce on one user's bucket.

That seems like such a common scenario, I must have misinterpreted what 
I read.  I'd really appreciate some clarification there and also would 
be very interested in any schema proposals or thoughts you might have 
about this use case.

-Daniel




More information about the riak-users mailing list