Kelly McLaughlin kelly at basho.com
Tue Oct 2 10:55:18 EDT 2012

John and Shane,

I have been looking into some memory issues lately and I would be very interested in more
information about your particular problems. If either of you are able to get some output 
from etop using the -sort memory option when you are having elevated memory usage it 
would be very helpful to see. I know that sometimes you get the connection_lost message 
when trying to use etop, but I have found that sometimes if you keep trying it may succeed 
after a few attempts. 

Are either of you using MapReduce? I see that John is using 2I. Shane, do you also use 2I?
Finally, do you notice a lot of messages to the console or console log that have the either the 
phrase 'monitor large_heap' or 'monitor long_gc'?


On Oct 2, 2012, at 6:11 AM, "John E. Vincent" <lusis.org+riak-users at gmail.com> wrote:

> I would highly suggest you upgrade to 1.2 when possible. We were, up
> until recently, running on 1.4 and seeing the same problems you
> describe. Take a look at this graph:
> http://i.imgur.com/0RtsU.png
> That's just one of our nodes but all of them exhibited the same
> behavior. The falloffs are where we had to bounce riak.
> This is what one of our nodes looks like now and has looked like since
> the upgrade:
> http://i.imgur.com/pm7Nk.png
> The change was SO dramatic that I seriously though /stats was broken.
> I've verified outside of Riak and inside. The memory usage change was
> very positive. Evidently there's even still a memory leak.
> We're heavy 2i users. No multi backend.
> On Tue, Oct 2, 2012 at 4:08 AM, Shane McEwan <shane at mcewan.id.au> wrote:
>> G'day!
>> Just recently we've noticed memory usage in our Riak cluster constantly
>> increasing.
>> The memory usage reported by the Riak stats "memory_total" parameter has
>> been less than 100MB for nearly a year but has recently increased to over
>> 1GB.
>> If we restart the cluster memory usage usually returns back to what we would
>> call "normal" but after a week or so of stability the memory usage starts
>> gradually growing again. Sometimes after a growth spurt over a few days the
>> memory usage will plateau and be stable again for a week or two and then put
>> on another growth spurt. The memory usage starts increasing at the same
>> moment on all 4 nodes.
>> This graph [http://imagebin.org/230614] shows what I mean. The green shows
>> the memory usage as reported by "memory_total" (left-hand y-axis scale). The
>> red line shows the memory used by Riak's beam.smp process (right-hand y-axis
>> scale).
>> Also notice that the gradient of the recent growth seems to be increasing
>> compared to the memory increases we had in August.
>> We might have just assumed that the memory usage was normal Riak behaviour.
>> Perhaps we have just tipped over some sort of internal buffer or cache and
>> that causes some more memory to be allocated. However, whenever we notice
>> the memory usage increasing it always coincides with the "riak-admin top"
>> command failing to run.
>> We try to run "riak-admin top" to diagnose what is using the memory but it
>> returns: "Output server crashed: connection_lost". If we restart the cluster
>> the top command works fine (but, of course, there's nothing interesting to
>> see after a restart!).
>> So our theory at the moment is that some sort of instability or race
>> condition is causing Riak to start consuming more and more memory. A side
>> effect of this instability is that the internal processes needed for running
>> the top command are not working correctly. The actual functionality of Riak
>> doesn't seem to be affected. Our application is running fine. We see a
>> slight increase in "FSM Put" times and CPU usage during the memory growth
>> phases but all other parameters we're monitoring on the system seem
>> unaffected.
>> There's nothing abnormal in the logs. We get a lot of "riak_pipe_builder_sup
>> {sink_died,normal}" messages but they can be ignored, apparently. The
>> cluster is under constant load so we would expect to see either gradual
>> memory increase or a steady state but not both. Erlang process count, open
>> file handles, etc are stable.
>> So I was wondering if anyone has seen similar behaviour before?
>> Is there anything else we can do to diagnose the problem?
>> I'm accessing the stats URL once per minute, could that have any side
>> effects?
>> We'll be upgrading to Riak 1.2 and new hardware in the next few weeks so
>> should we just ignore it and hope it goes away?
>> Any other ideas?
>> Or is this just normal?
>> Riak config:
>> 4 VMware nodes
>> ring_creation_size, 256
>> n_val, 3
>> eleveldb backend:
>>  max_open_files, 20
>>  cache_size, 15728640
>> "riak_kv_version":"1.1.1",
>> "riak_core_version":"1.1.1",
>> "stdlib_version":"1.17.4",
>> "kernel_version":"2.14.4"
>> Erlang R14B03 (erts-5.8.4)
>> Thanks!
>> Shane.
