Riak Map Reduce Performance

Fisher, Ryan rfisher at cyberpointllc.com
Tue Aug 23 15:08:54 EDT 2011


Hi all,

We have been using riak for a few months now (started using 14.0 and we
have recently upgraded to 14.2).  Development of our app has been going
well and I am now integrating my code w/ a larger system.  The testing of
the overall read / write performance of our cluster seems good as well.

I am now starting to dive further into map reduce queries, and unlike the
regular read / writes that seem to perform very fast, I am seeing map
reduce performance is getting worse as our data set grows.

The query I am using to test the map / reduce speed and get a key count is
this:

map = function (v) { return [1]; }
reduce = Riak.reduceSum


It takes 138 seconds using that query on a bucket w/ 50,000 keys.
It takes around 20 seconds using that query on a bucket w/ 108 keys.

Do these query times for map reduce seem appropriate?

I'll try and give an overall picture of how we currently use riak and
maybe someone can say if the performance of our map / reduce operations is
on par or if there are things I could tweak to try and get the query times
to come down a bit.

The system we have sends data to riak at a fairly fast pace and we need to
keep all incoming data for 30 minutes, so we can examine the data and
retrieve any individual keys.  After 30 minutes we can aggregate messages
into groups to reduce the overall number of keys and data.

We currently have an "incoming" bucket where keys are written at a rate of
around 20 / second.  An archiving thread checks every so often for keys
that are older then 30 min. and if it finds any it removes them from the
'incoming' bucket and aggregates them into an 'archive' bucket for the
given hour.  

As you can imagine this causes the bitcask files to fragment and grow
fairly large, however it seems like the best way to maintain some
granularity of the data, but not be forced to keep every single data point
that flows into the system.  It also gives us a predictable growth rate
for the 'archive' bucket even if the 'incoming' data increases beyond the
20 / second. 

One thing I was wondering and planning on trying was to reduce the bitcask
configuration merge threshold and trigger values to help keep the files a
little smaller which might help the map reduce performance some?  Or
should that even matter since I'm only looking through keys and those
should be in memory anyway?

We currently have a 4 node cluster running ubuntu 11.04 x64.  Each riak
node has 8 GB of memory, and 'free ­m' on the nodes reports around 4000
used and 4000 free on average.

Basho Bench using the riakc_pb.config with a get=1, update=2, and put=3
for 5 minutes seems to be good... Here is the graph:
http://tinypic.com/r/30tm977/7

So does anyone have any similarly sized systems where they use map reduce?
 Or can anyone recommend performance tweaks I can make that would help
accelerate these queries?

Thank you,
-ryan
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4913 bytes
Desc: not available
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20110823/0cb4d714/attachment.p7s>


More information about the riak-users mailing list