Is it inefficient to map over a small bucket when you have millions of other buckets?

Alexander Sicular siculars at
Mon Jul 12 10:44:02 EDT 2010

I believe the number of buckets are basically unlimited - as long as
you use the default bucket properties (which can be changed by conf
file), re. number of replicas. If you change bucket properties on the
fly that info runs around the cluster in the gossip channel.
Afaik, buckets are simply virtual namespaces. Your data structure
could have a bucket per user listing collections. Map over that then
use the collection value to deriv bucket/key pair to pass into
additional map phases in an m/r chain.

And no, I seriously doubt that riak traverses the entire key space for
every m/r in the sense that it touches every key in the system
regardless of bucket. That is why riak has an input param required for
each m/r. I would love to hear otherwise.


On 2010-07-11, Daniel Einspanjer <deinspanjer at> wrote:
>   I'm thinking about the pros and cons of Riak vs HBase for Mozilla's
> Weave (now Firefox Sync) 2.0 engine.
> The primary use case is that when a user's client performs a sync, it
> needs to retrieve all the new items since the last time it synced for
> each collection (bookmarks, tabs, history, etc.) that the client is
> configured to sync.
>      If a particular client doesn't sync often, it is possible that
> there might be thousands of items to retrieve, this means that using
> links *might* run into issues.
> HBase's use of ordered keys pushes for a schema where you'd have the
> modified timestamp in the key.  That would allow for quick and easy
> scanning of just the new items.
> Riak however, has a few interesting features such as the on-demand
> creation of new buckets that might make it much more flexible... if
> there is a highly performant mechanism for the client to retrieve new data.
> What prompted me to post this message was something I thought I
> remembered seeing regarding mapping over buckets in Riak.  Unfortunately
> I can't find the reference now.
> Is it true that in order to map over all the keys in a single bucket,
> the Riak cluster must actually traverse the entire global keyspace of
> all buckets to find the keys that are part of the desired bucket?
> In the case where you have tens of millions of users, and you have
> either one bucket per user or (if it were feasible) one bucket per user
> per collection, it seems like it would be impossible to efficiently
> perform a map reduce on one user's bucket.
> That seems like such a common scenario, I must have misinterpreted what
> I read.  I'd really appreciate some clarification there and also would
> be very interested in any schema proposals or thoughts you might have
> about this use case.
> -Daniel
> _______________________________________________
> riak-users mailing list
> riak-users at

Sent from my mobile device

More information about the riak-users mailing list