Performance issues with small dataset

Sean Cribbs sean at
Thu Jan 13 09:27:34 EST 2011

> Unfortunately, even if additional nodes yield linear performance
> gains, the m/r overhead seems very large -- if I'm getting 1.5 seconds
> to process 1,000 items on one node, it seems apparent that I should
> get roughtly 1.5 seconds to process 3,000 items on 3 nodes, which
> still is awfully slow.
> Do you know how Riak compares to HBase, MongoDB or Cassandra for large
> dataset processing and analysis with m/r, when talking hundreds of
> millions, or even billions of keys? It would seem that key traversal
> performance would preventing Riak from competing in that space. Maybe
> you could do something with Riak Search, but I'm not sure if it would
> comparable.

To be fair, you can't do a microbenchmark and then try to extrapolate it to large datasets; things change at scale. Also, key-listing has been a known limitation of Riak for a long time, and one we have been quite vocal about. There have been improvements recently, but it's still an O(N) computation where N is the total number of keys stored in the cluster. Therefore, it's important to structure your data such that you limit the use of key lists. Compare performance after you have done that, and run your benchmark on something other than a single node (4 or more in a cluster is best), with a dataset that approximates the target size.

Sean Cribbs <sean at>
Developer Advocate
Basho Technologies, Inc.

More information about the riak-users mailing list