advice on debugging OOM?

Michael Radford mrad at blorf.com
Mon Apr 23 13:49:28 EDT 2012


Yesterday three of four riak nodes in my cluster died due to running
out of memory. Unfortunately, the beam processes were killed by the
kernel oom-killer, so they didn't have a chance to write an
erl_crash.dump.

I'm wondering if anyone has any advice on figuring out what the
culprit might be, if this happens again, or any ways that queries
could easily cause sudden increases in memory use.

My only "nonstandard" usages (apart from normal read/write/delete) are:
1. a list-keys operation, interspersed with reads, for creating a backup
2. search queries feeding into riak_kv_mapreduce:reduce_identity, to
get lists of keys matching a query
3. multiple-key lookups, implemented by feeding lists of keys into
this map function:

map_key_data_object_value({error, notfound}, KeyData, _Arg) ->
  [{KeyData, not_found}];
map_key_data_object_value(RiakObject, KeyData, _Arg) ->
  [{KeyData, riak_object:get_value(RiakObject)}].

An additional wrinkle is that I'm doing all this via erlang rpc to
functions that invoke the native erlang client, to work around
performance issues with map/reduce via the protobufs api. But the code
for that is very simple, no loops or anything like that which would
even have the potential for unbounded memory use.

Is there any additional logging I could turn on, or any other ideas
besides periodically collecting memory usage stats and hoping to catch
something before it crashes the node?

Also, I am very tempted to enable heart in /etc/riak/vm.args, since
(a) it's clearly possible for this to happen again, and (b) the
failure seemed to cascade from one node to the next. As of last
August, the advice from this list was not to enable heart, because of
the potential to get stuck in a tight restart loop. But I don't see
how that is necessarily worse than not attempting to restart at all.

Thanks,
Mike




More information about the riak-users mailing list