Riak performance problems when LevelDB database grows beyond 16GB

Matthew Von-Maszewski matthewv at basho.com
Thu Oct 18 08:21:15 EDT 2012


I am currently responsible for tuning Google's leveldb implementation for Riak.  I have read through most of the thread and have a couple of information requests.  Then I will try to address various questions and comments from the thread.  In general, you are filling leveldb faster than its background compaction (optimization) can keep up.  I am willing to work with you to figure out why and what can be done about it.

Questions / requests:

1.  Execute the following on one of the servers:

     sort /home/riak/leveldb/*/LOG* >log_jan.txt

     Tar/gzip the log_jan.txt and email it back.

2.  Execute the following on one of the servers:

    grep -i flags /proc/cpuinfo

    Include the output (actually just one line will do) in a reply.

3.  On a running server that is processing data, execute:

   grep -i swap /proc/meminfo

    Include the full output (3 lines) in a reply.

4.  Pick a server, then one directory in /home/riak/leveldb.  Select 3 of the largest *.sst files.  Tar/gzip those and email back.

Notes about other messages on this thread:

a.  the gdb stack traces are nice!  They clearly indicate that the leveldb has intentionally entered a "stall" state because compaction is not keeping up with the input stream.  Riak 1.2.1rc1 contains code that attempts to slow the write rate to allow the background compactions to catch up.  It is not working in your case.

b.  there is a performance bug in the cache code, not your main problem though.  this is why Evan asked you to reduce the cache size from 377,487,360.  Yes, I created the bug and will get it addressed soon.

c.  the compaction process is disk and cpu intensive.  The fact that your CPUs are not heavily loaded, yet the client/request code is stalled waiting for compaction to catch up, suggests the disk is thrashing / could use some help.  Again, this is why Evan had you work some configuration settings there.

d.  you comment about using O_NOATIME is valid.  The issue is that the flag is relatively new.  We are supporting some really old compilers and linux/solaris versions.  It is easier to ask everyone to work noatime at the mount level than have conditional code for some and mount level tuning for others.  But your comment is still correct.

e.  a non-zero sized lost/BLOCKS.bad means data corruption.  It looks like you already figured that out.  Either the crc code or the decompression code found an issue during compaction and moved the bad data to the side.

f.  max_open_files in 1.1 was a hard limit on the number of open files per vnode (per subdirectory in /home/riak/leveldb).  1.2 uses the number as more of a memory consumption per file suggestion.  A future release will drop the option and substitute something like "file_cache_size".  Memory is the critical resource, not file handles (at least for Riak … I am told Google uses this code in Android, so it might be critical there).

What issues did I miss?


