Data loads

Reid Draper reiddraper at gmail.com
Thu Aug 30 11:29:36 EDT 2012


Welcome to the list Pinney :)

On Aug 30, 2012, at 10:59 AM, Pinney Colton <pinney.colton at bitwisedata.com> wrote:

> Hi all -
> 
> This is my first post to the list.  I'm a relative Riak newbie, though I have some experience working with multi-terabyte datasets on other platforms.  Last night, I kicked off my first "large" load of data as a test of the platform with about 1,000,000 json objects being loaded into a bucket.  I had a couple performance issues, so I'm wondering if someone on the list could be kind enough to answer a few questions that will help me troubleshoot.
> 
> a) I haven't analyzed all of my load log data yet, but it look like writes went from about 0.02 seconds per object to a couple minutes per object!  This is the typical "dev" setup from the tutorials, and I forgot to divide available RAM by 4 to arrive at a number per node - is this likely the result of a memory constraint, or should I be looking elsewhere, beyond just bumping the memory on my VM?  I looked at the logs, but I'm not sure what I should be looking for.

I'm wondering if you're starting to swap. Have you set swappiness to 0 on the machine? If not, I'd recommend that change.

> 
> b) I am using protocol buffers, and I saw similar initial performance when running the load from a separate machine vs. having the data on the riak machine itself.  Is that what you would recommend?  I'm wondering if there is any hard/fast rule re: CPU/Memory contention on the machine vs. network performance of loading from a different machine.

I don't think there is a hard and fast rule here, but I would try doing it over the network rather than on the same node.

> 
> c) I'm using a sha256 hash as my bucket name.  I read that buckets and keys are concatenated internally and that all objects have just one "bucketkey".  Am I putting significantly more pressure on memory by using such a long bucket name?  Or is Riak managing that for me via some sort of compression?  If that long hash is being replicated for each of those million objects, I can see where my memory estimates would have been low.  I can always use an integer ID for my bucket name, the hash just existed elsewhere in my application, so I used it without thinking about it too much.

This depends on the backend you use. Bitcask holds all bucket/keys in memory, so their size is important. Leveldb doesn't have this constraint,
and even has key prefix compression.

> 
> Thanks in advance for your help!  Loving Riak so far, in spite of these trivial hurdles.
> 
> Regards,
> Pinney
> 
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com





More information about the riak-users mailing list