map-reduce Problem ?

Kevin Smith ksmith at basho.com
Mon Nov 15 12:55:18 EST 2010


We are giving some thought on how to do that. The main issues wrt to bitcask's key listing performance is that bitcask is not bucket aware and lacks the notion of secondary indices. Not being bucket aware means bitcask has to examine all bucket/key pairs to find the ones related to a given bucket. This isn't to say we won't address the problem but merely to point out there's some engineering work required to solve the problem correctly.

innostore is moderately bucket-aware right now so I've forked it (http://github.com/kevsmith/innostore) and added bucket-aware key listing. Based on some very basic testing I'm seeing 2.5x speed up in overall key listing performance compared to the official version. I'm hoping the patch, or a modified form of it, will make the next release. If you can handle inno being a bit slower than bitcask and slightly more difficult to set up and tune then this might be an option for you.

I've done some basic vetting of the code but I want to emphasize this is a prototype only and hasn't received anything even close to the normal amount of testing we put into a release. Please keep this in mind if you decide to use my forked repo.

--Kevin
On Nov 15, 2010, at 11:57 AM, Greg Steffensen wrote:

> Along these lines, are there any ideas floating around about how to speed up the listing of keys in a bucket?  For the bitcask backend, it seems like an index of keys-by-bucket ought to be the kind of thing that could be stored in the hints files to speed this up without affecting performance for live reads and writes.
> 
> Greg
> 
> On Mon, Nov 15, 2010 at 11:46 AM, Sean Cribbs <sean at basho.com> wrote:
> This is possible with Riak's MapReduce but you will likely have increasing difficulty as your dataset grows, because of the impact of needing to list keys in a bucket and then eliminate data points you aren't interested in.  In the longer term, there will be improvements to MapReduce such that if your keys are meaningful, you will be able to filter them more easily (without examining the data first).  You might find Kevin Smith's overview enlightening: http://www.slideshare.net/hemulen/riak-mapred-preso
> 
> Sean Cribbs <sean at basho.com>
> Developer Advocate
> Basho Technologies, Inc.
> http://basho.com/
> 
> On Nov 15, 2010, at 11:34 AM, Prometheus WillSurvive wrote:
> 
>> Hi ,
>> 
>> We have a huge database (around 4 billion record - 30 TB) storing the video watch infromation ie view count , comment , favorited etc. I want to produce daily report for all videos view counts. It means I need to look 2 day , today and yesterday so subtract yesterdey view count from today view count so I can find the daliy impression. Our Fat DB team doing this a few complex queries. I would like to ask you is this possible with Riak map-reduce way .  I want to make a demonstration to the team to show this ..
>> 
>> This is the scenario. We have similar data models for other thins. This could be a start. 
>> 
>> We have 30xHP DL380  x32 Gig Ram  Farm  to test this scenario. 
>> 
>> Any riak map-reduce experienced member can show some idea on this..  I guess.
>> 
>> Regards
>> 
>> Prometheus
>> _______________________________________________
>> riak-users mailing list
>> riak-users at lists.basho.com
>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> 
> 
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> 
> 
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com





More information about the riak-users mailing list