my cluster spontaneously loses a node after ~48hrs

Jason Golubock jason at soundhound.com
Mon May 4 00:20:47 EDT 2015


Hi,

I'm attempting to set up a simple 3-node cluster.
I have riak 2.0.5 installed on all 3 nodes, and riak starts up
fine using "service riak start". joining nodes works fine.
everything looks good.

Until.. after about 48 hours of sitting idle
(or after heavy read/write, it doesn't seem to matter),
one of the nodes inevitably gets "stuck".  the beam.smp process is still
running, and now using 100% CPU, but according to riak ping and
riak-admin cluster status, "Node is not running!", and the node
does not respond to read/write requests.

This happens with a totally fresh install of riak,
brand new ring, and zero records read or written.
I just start up the cluster, let it sit there,
and eventually one of the nodes locks up by itself.

There is no error log or crash report.
I can't find any corresponding "event" in any log file.
I am not aware of any regular network partition or anything that
might account for this on my end (though obviously something isn't 
right).

I'm using all defaults except
- switched to leveldb.
- tried this with anti-entropy set both ways: either "active" or 
"passive"
- search is "on"
- set log level to debug
- i've tried with both 2 node and with 3 node cluster

The console log is filled with these messages:

2015-04-19 00:01:47.754 [debug] 
<0.22331.7>@riak_core_metadata_exchange_fsm:exchange:163 completed 
metadata exchange with 'riak at 10.1.6.126'. nothing repaired
2015-04-19 00:01:57.753 [debug] 
<0.202.0>@riak_core_broadcast:exchange:424 started 
riak_core_metadata_manager exchange with 'riak at 10.1.6.126' (<0.22348.7>)
2015-04-19 00:01:57.754 [debug] 
<0.22348.7>@riak_core_metadata_exchange_fsm:exchange:163 completed 
metadata exchange with 'riak at 10.1.6.126'. nothing repaired


This result is 100% repeatable and i'm dead in the
water with no idea what the problem is. exhaustive search has provided 
no answers.
i'm running CentOS 6.6 on fairly beefy hardware with 128GB ram.
many thanks to anyone who can help me figure out what's going on!!!

~ Jason





More information about the riak-users mailing list