map-reduce Problem ?

Alexander Sicular siculars at gmail.com
Mon Nov 15 18:50:37 EST 2010


Thanks Dan and Kevin,

How about a custom hint file (iirc is the file that is read into mem with all keys) which stems by bucket. And by stem I mean order... with offsets. Redis does all kinds of stuff like this in memory and persists to disk via an append only file. So riak can just grab only specific keys in one go skipping the bucket union scan. This kind of method could be extended to keys eventually.

Thanks, Alexander


On Nov 15, 2010, at 5:51 PM, Kevin Smith wrote:

> In general, Riak backends combine the bucket name and key into a single value used as the primary key. For the Basho-written backends the combined value is an Erlang tuple of the form {BucketName, Key}.
> 
> When you list all the keys in a bucket, the backends execute a fold over all their data. The function executed within the fold examines each tuple and keeps tuples which include the desired bucket name. This arrangement means listing keys is still not as scalable as it could be. Consider the case of a Riak cluster storing two buckets, A and B. Bucket A has 5000 keys while bucket B has 250000 keys. Listing the keys for bucket A means examining 255000 (bucket A + bucket B) to find the 5000 you really want.
> 
> The first round of listing keys improvements focused on reducing the memory footprint of folding over the keys in each backend and aggregating their results. Before the improvements we would manifest the aggregated list in memory which didn't scale for large datasets.
> 
> The next thing to improve, IMHO, is adding "bucket awareness" to each backend and the vnode API such that Riak no longer has to examine each datum in order to find the ones belonging to a specific bucket. This is what my patch to Innostore does. Innostore is already "bucket aware" internally so I was able to leverage this fact and constrain the scope of key access to only the desired bucket.
> 
> --Kevin
> On Nov 15, 2010, at 5:11 PM, Alexander Sicular wrote:
> 
>> So I get that riak is not bucket aware. When you pass a bucket as an
>> input in an m/r, as riak sifts through all the keys, how does riak
>> isolate bucket specific keys? Are keys stored as /bucket/key internaly
>> and there is a string comparison on split(key,'/') ? Or is there
>> something else going on.
>> 
>> Thank you.
>> 
>> 
>> 
>> On 2010-11-15, Kevin Smith <ksmith at basho.com> wrote:
>>> We are giving some thought on how to do that. The main issues wrt to
>>> bitcask's key listing performance is that bitcask is not bucket aware and
>>> lacks the notion of secondary indices. Not being bucket aware means bitcask
>>> has to examine all bucket/key pairs to find the ones related to a given
>>> bucket. This isn't to say we won't address the problem but merely to point
>>> out there's some engineering work required to solve the problem correctly.
>>> 
>>> innostore is moderately bucket-aware right now so I've forked it
>>> (http://github.com/kevsmith/innostore) and added bucket-aware key listing.
>>> Based on some very basic testing I'm seeing 2.5x speed up in overall key
>>> listing performance compared to the official version. I'm hoping the patch,
>>> or a modified form of it, will make the next release. If you can handle inno
>>> being a bit slower than bitcask and slightly more difficult to set up and
>>> tune then this might be an option for you.
>>> 
>>> I've done some basic vetting of the code but I want to emphasize this is a
>>> prototype only and hasn't received anything even close to the normal amount
>>> of testing we put into a release. Please keep this in mind if you decide to
>>> use my forked repo.
>>> 
>>> --Kevin
>>> On Nov 15, 2010, at 11:57 AM, Greg Steffensen wrote:
>>> 
>>>> Along these lines, are there any ideas floating around about how to speed
>>>> up the listing of keys in a bucket?  For the bitcask backend, it seems
>>>> like an index of keys-by-bucket ought to be the kind of thing that could
>>>> be stored in the hints files to speed this up without affecting
>>>> performance for live reads and writes.
>>>> 
>>>> Greg
>>>> 
>>>> On Mon, Nov 15, 2010 at 11:46 AM, Sean Cribbs <sean at basho.com> wrote:
>>>> This is possible with Riak's MapReduce but you will likely have increasing
>>>> difficulty as your dataset grows, because of the impact of needing to list
>>>> keys in a bucket and then eliminate data points you aren't interested in.
>>>> In the longer term, there will be improvements to MapReduce such that if
>>>> your keys are meaningful, you will be able to filter them more easily
>>>> (without examining the data first).  You might find Kevin Smith's overview
>>>> enlightening: http://www.slideshare.net/hemulen/riak-mapred-preso
>>>> 
>>>> Sean Cribbs <sean at basho.com>
>>>> Developer Advocate
>>>> Basho Technologies, Inc.
>>>> http://basho.com/
>>>> 
>>>> On Nov 15, 2010, at 11:34 AM, Prometheus WillSurvive wrote:
>>>> 
>>>>> Hi ,
>>>>> 
>>>>> We have a huge database (around 4 billion record - 30 TB) storing the
>>>>> video watch infromation ie view count , comment , favorited etc. I want
>>>>> to produce daily report for all videos view counts. It means I need to
>>>>> look 2 day , today and yesterday so subtract yesterdey view count from
>>>>> today view count so I can find the daliy impression. Our Fat DB team
>>>>> doing this a few complex queries. I would like to ask you is this
>>>>> possible with Riak map-reduce way .  I want to make a demonstration to
>>>>> the team to show this ..
>>>>> 
>>>>> This is the scenario. We have similar data models for other thins. This
>>>>> could be a start.
>>>>> 
>>>>> We have 30xHP DL380  x32 Gig Ram  Farm  to test this scenario.
>>>>> 
>>>>> Any riak map-reduce experienced member can show some idea on this..  I
>>>>> guess.
>>>>> 
>>>>> Regards
>>>>> 
>>>>> Prometheus
>>>>> _______________________________________________
>>>>> riak-users mailing list
>>>>> riak-users at lists.basho.com
>>>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>>> 
>>>> 
>>>> _______________________________________________
>>>> riak-users mailing list
>>>> riak-users at lists.basho.com
>>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>>> 
>>>> 
>>>> _______________________________________________
>>>> riak-users mailing list
>>>> riak-users at lists.basho.com
>>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>> 
>>> 
>>> _______________________________________________
>>> riak-users mailing list
>>> riak-users at lists.basho.com
>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>> 
>> 
>> -- 
>> Sent from my mobile device
> 





More information about the riak-users mailing list