Is it inefficient to map over a small bucket when you have millions of other buckets?

Nicolas Fouché nicolas at silentale.com
Tue Jul 13 06:02:11 EDT 2010


Giving just a bucket WILL traverse the entire keyspace.
Here is a conversation I had on IRC yesterday (GMT+1).

[4:18pm] danoyoung: this may be obvious, but I just want to confirm,
on the QA session from the last webinar, it was stated that listing
keys for a given bucket requires examining keys that don't belong to
the requested bucket.....
[4:18pm] danoyoung: So if I have a number of buckets, one with say
1000 objects in it, but I have 25 million objects scatter throughout a
number of buckets (on a given node), then a M/R query would need to
visit all 25 million even though I'm only interested in the 1000
objects?
[4:21pm] justinsheehy: yes, that's correct.  (if you specify the M/R
by bucket instead of getting the input keylist some other way)
[4:21pm] justinsheehy: this is because the backends aren't required to
separate the buckets in any way, and in some cases simply use {bucket,
key} as their internal key.
[4:21pm] justinsheehy: that's what allows buckets to be "free"
[4:23pm] nfo: justinsheehy: so same behavior with  "GET
/riak/bucket?keys=true" I suppose.
[4:23pm] danoyoung: hmm..ok, thanx. so it seems that I'll need to have
a secondary index somewhere (w/n riak or else) to keep track of all my
keys w/n a given bucket...
[4:24pm] justinsheehy: nfo: yes, that's effectively the same thing
[4:24pm] danoyoung: to avoid expensive list keys operations....
[4:27pm] nfo: justinsheehy: about this, I find "sad", for beginners,
that the Fast Track gives an example of a bucket listing without any
warning. It's the very first map/reduce Riak query a person sees, and
it's not efficient.
[4:27pm] nfo: {"inputs":"goog",
[4:27pm] nfo: https://wiki.basho.com/display/RIAK/Loading+Data+and+Running+MapReduce+Queries
[4:28pm] justinsheehy: nfo: fair enough, I'll make sure that gets (at
least) a note.
[4:28pm] danoyoung: if I had buckets with tens of thousands of object
in them, then what would be the most effecient way to run M/R against
them all, break up into multiple M/R jobs that process say 500
bucket/key values at a time...tand then aggregate these results....
[4:28pm] danoyoung: i meant 5000...
[4:28pm] seancribbs: danoyoung: knowing the keys AOT is crucial
[4:31pm] seancribbs: so if you have some way to build objects that
have links to the objects you want to query across, and the first ones
have intuitive keys, you'll avoid the cost of list-keys

-Nicolas

On Mon, Jul 12, 2010 at 4:44 PM, Alexander Sicular <siculars at gmail.com> wrote:
> I believe the number of buckets are basically unlimited - as long as
> you use the default bucket properties (which can be changed by conf
> file), re. number of replicas. If you change bucket properties on the
> fly that info runs around the cluster in the gossip channel.
> Afaik, buckets are simply virtual namespaces. Your data structure
> could have a bucket per user listing collections. Map over that then
> use the collection value to deriv bucket/key pair to pass into
> additional map phases in an m/r chain.
>
> And no, I seriously doubt that riak traverses the entire key space for
> every m/r in the sense that it touches every key in the system
> regardless of bucket. That is why riak has an input param required for
> each m/r. I would love to hear otherwise.
>
> -Alexander
>
> On 2010-07-11, Daniel Einspanjer <deinspanjer at mozilla.com> wrote:
>>   I'm thinking about the pros and cons of Riak vs HBase for Mozilla's
>> Weave (now Firefox Sync) 2.0 engine.
>> https://wiki.mozilla.org/Labs/Weave/Sync/2.0/API
>>
>> The primary use case is that when a user's client performs a sync, it
>> needs to retrieve all the new items since the last time it synced for
>> each collection (bookmarks, tabs, history, etc.) that the client is
>> configured to sync.
>>      If a particular client doesn't sync often, it is possible that
>> there might be thousands of items to retrieve, this means that using
>> links *might* run into issues.
>>
>> HBase's use of ordered keys pushes for a schema where you'd have the
>> modified timestamp in the key.  That would allow for quick and easy
>> scanning of just the new items.
>>
>> Riak however, has a few interesting features such as the on-demand
>> creation of new buckets that might make it much more flexible... if
>> there is a highly performant mechanism for the client to retrieve new data.
>>
>> What prompted me to post this message was something I thought I
>> remembered seeing regarding mapping over buckets in Riak.  Unfortunately
>> I can't find the reference now.
>>
>> Is it true that in order to map over all the keys in a single bucket,
>> the Riak cluster must actually traverse the entire global keyspace of
>> all buckets to find the keys that are part of the desired bucket?
>>
>> In the case where you have tens of millions of users, and you have
>> either one bucket per user or (if it were feasible) one bucket per user
>> per collection, it seems like it would be impossible to efficiently
>> perform a map reduce on one user's bucket.
>>
>> That seems like such a common scenario, I must have misinterpreted what
>> I read.  I'd really appreciate some clarification there and also would
>> be very interested in any schema proposals or thoughts you might have
>> about this use case.
>>
>> -Daniel
>>
>> _______________________________________________
>> riak-users mailing list
>> riak-users at lists.basho.com
>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>
>
> --
> Sent from my mobile device
>
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>




More information about the riak-users mailing list