Performance issues with small dataset
nx at nu-ex.com
Thu Jan 13 10:37:18 EST 2011
I think you can avoid listing all keys in a bucket by maintaining a
separate object that contains a list of the current keys. I usually
append the keys to a "/bucket/_collection" object.
On Thu, Jan 13, 2011 at 9:27 AM, Sean Cribbs <sean at basho.com> wrote:
>> Unfortunately, even if additional nodes yield linear performance
>> gains, the m/r overhead seems very large -- if I'm getting 1.5 seconds
>> to process 1,000 items on one node, it seems apparent that I should
>> get roughtly 1.5 seconds to process 3,000 items on 3 nodes, which
>> still is awfully slow.
>> Do you know how Riak compares to HBase, MongoDB or Cassandra for large
>> dataset processing and analysis with m/r, when talking hundreds of
>> millions, or even billions of keys? It would seem that key traversal
>> performance would preventing Riak from competing in that space. Maybe
>> you could do something with Riak Search, but I'm not sure if it would
> To be fair, you can't do a microbenchmark and then try to extrapolate it to large datasets; things change at scale. Also, key-listing has been a known limitation of Riak for a long time, and one we have been quite vocal about. There have been improvements recently, but it's still an O(N) computation where N is the total number of keys stored in the cluster. Therefore, it's important to structure your data such that you limit the use of key lists. Compare performance after you have done that, and run your benchmark on something other than a single node (4 or more in a cluster is best), with a dataset that approximates the target size.
> Sean Cribbs <sean at basho.com>
> Developer Advocate
> Basho Technologies, Inc.
> riak-users mailing list
> riak-users at lists.basho.com
More information about the riak-users