Riak Memory Usage Constantly Growing

Shane McEwan shane at mcewan.id.au
Tue Oct 2 07:08:24 EDT 2012


G'day!

Just recently we've noticed memory usage in our Riak cluster constantly 
increasing.

The memory usage reported by the Riak stats "memory_total" parameter has 
been less than 100MB for nearly a year but has recently increased to 
over 1GB.

If we restart the cluster memory usage usually returns back to what we 
would call "normal" but after a week or so of stability the memory usage 
starts gradually growing again. Sometimes after a growth spurt over a 
few days the memory usage will plateau and be stable again for a week or 
two and then put on another growth spurt. The memory usage starts 
increasing at the same moment on all 4 nodes.

This graph [http://imagebin.org/230614] shows what I mean. The green 
shows the memory usage as reported by "memory_total" (left-hand y-axis 
scale). The red line shows the memory used by Riak's beam.smp process 
(right-hand y-axis scale).

Also notice that the gradient of the recent growth seems to be 
increasing compared to the memory increases we had in August.

We might have just assumed that the memory usage was normal Riak 
behaviour. Perhaps we have just tipped over some sort of internal buffer 
or cache and that causes some more memory to be allocated. However, 
whenever we notice the memory usage increasing it always coincides with 
the "riak-admin top" command failing to run.

We try to run "riak-admin top" to diagnose what is using the memory but 
it returns: "Output server crashed: connection_lost". If we restart the 
cluster the top command works fine (but, of course, there's nothing 
interesting to see after a restart!).

So our theory at the moment is that some sort of instability or race 
condition is causing Riak to start consuming more and more memory. A 
side effect of this instability is that the internal processes needed 
for running the top command are not working correctly. The actual 
functionality of Riak doesn't seem to be affected. Our application is 
running fine. We see a slight increase in "FSM Put" times and CPU usage 
during the memory growth phases but all other parameters we're 
monitoring on the system seem unaffected.

There's nothing abnormal in the logs. We get a lot of 
"riak_pipe_builder_sup {sink_died,normal}" messages but they can be 
ignored, apparently. The cluster is under constant load so we would 
expect to see either gradual memory increase or a steady state but not 
both. Erlang process count, open file handles, etc are stable.

So I was wondering if anyone has seen similar behaviour before?
Is there anything else we can do to diagnose the problem?
I'm accessing the stats URL once per minute, could that have any side 
effects?
We'll be upgrading to Riak 1.2 and new hardware in the next few weeks so 
should we just ignore it and hope it goes away?
Any other ideas?
Or is this just normal?

Riak config:
4 VMware nodes
ring_creation_size, 256
n_val, 3
eleveldb backend:
   max_open_files, 20
   cache_size, 15728640
"riak_kv_version":"1.1.1",
"riak_core_version":"1.1.1",
"stdlib_version":"1.17.4",
"kernel_version":"2.14.4"
Erlang R14B03 (erts-5.8.4)

Thanks!

Shane.







More information about the riak-users mailing list