Random but frequent crashes

Michael Jakl development at semanticlabs.at
Fri Nov 18 06:27:09 EST 2011

I'm testdriving Riak (1.0.1) using Bitcask and a lot of data
(currently ~25 million documents). I've deployed Riak on three
machines with a n_val of 3 (acutally, I left it at the default).

Soon after I started an import process, Riak crashed about every 6
million documents (sometimes more frequently) leaving no obvious cause
in the logfiles. I've opened a ticket (Bug 1282 [1]), but maybe it's
better to discuss it here since I'm not having much information on
this. The only node that crashes is the one I'm adding the data to,
the other two nodes didn't crash yet. The importer and the (crashing)
Riak node are on the same machine and I'm currently using the HTTP
Java client (before that, I was using the PBC Java client).

It seems that the crashes occur after a long running gc alert in the
logfiles, yet that may be unrelated (the memory usage on my machine
does not go up).

I'm running Riak on machines with 24GB of RAM, the bucket-name is
about 10 chars long and the keys 20 chars. I expect about 200 million
documents with roughly 20k of data, but currently I've only imported
only 24 million. The first crash happened after 6 million documents.

The nofile limit for Riak is 32000 on Linux Debian 6 with all updates
installed. The capacity planning page tells me that I've enough RAM
(recommendation: 3 nodes with 14GB of RAM). The bitcask directory has
about 180GB of data and contains 344 files.

I've tried switching to eleveldb thinking it might be a memory issue,
but that used up more disk space than I have available. My migration
plan was to install Riak on another machine, setup leveldb and tried
to join the node running bitcask, disk consumption went from 200G to
over a terrabyte during the process.

I've upgraded to Riak 1.0.2, but the changelog does not mention
anything related to that.

What could I do to identify the problem? Are there any debugging
switches I could turn on (I've recently activated the
sasl_error_logger)? I'm thinking of activating the Heartbeat
management in vm.args, but that wouldn't fix the root cause... . I've
just restarted Riak using 1.0.2 and cleaned all logfiles. Up until
now, crashes were frequent enough that I should be able to provide a
set of logfiles on monday, but are there any obvious things I might
have forgotten?


 1: https://issues.basho.com/show_bug.cgi?id=1282

