Demetri Mouratis dmourati at
Wed May 16 13:52:44 EDT 2012


We have a three node Riak cluster set up in a pre-production environment 
with Level DB configured on the backend.  Systems are beefy dual 6 core, 
96GB RAM, running all SSDs.  Preliminary testing showed some issues with 
long latencies (~10-30 seconds and increasing) shown in 
node_get_fsm_time_100.  We raised our initial concerns at the Riak 
workgroup in San Francisco last week.

After the workgroup, we made the following changes to our configuration:

1.  Tuned /etc/security/limits.conf to add:

riak            soft    nofile          2048
riak            hard    nofile          10240

2.  Added noatime to riak filesystem mount (running on 6-device RAID 
6/RAID 10 Intel 710 200 GB SSD)

/dev/mapper/vg_raid10-lv_riak on /var/lib/riak type ext4 (rw,noatime)

3.  Edited eleveldb config to add write buffer and cache size

       %% eLevelDB Config
  {eleveldb, [
              {data_root, "/var/lib/riak/leveldb"},
              {write_buffer_size, 16777216},
              {cache_size, 1073741824}

At first blush, this tuning seemed to correct the problem.  Bash bench 
testing failed to uncover any latency.  The get_fsm_time returned to 
near zero.  However, over the weekend and into this week the peak delays 
started to creep back up linearly.  See graphs from Ganglia:

Average get times remain constant.  Put times do not show similar delay.

In talking with Basho folks, we learned the behavior is likely caused by 
"LevelDB Compaction."


What can we do to reduce/eliminate the latency shown in 



More information about the riak-users mailing list