Possible handoff stalls

Armon Dadgar armon.dadgar at gmail.com
Mon Mar 19 14:10:10 EDT 2012

I wanted to ping the mailing list and see if anybody else has encountered
stalls in the partition handoffs on Riak 1.1. We added a new node to our cluster
last Friday, but noticed that the partition handoffs appear to have stopped 
after about 7-8 hours. 

Most of the handoffs completed, and the only handoffs that remained were from node 3 to node 2.
The ring claimant (node 1), indicated that node 3 was unreachable (via ring_status).
However, Riak control did not indicate that node 3 was unreachable, and in fact it was
actually live and continuing to serve request.

To resolve this, I tried to just restart node 3. I ran "riak stop" multiple times, but this did
not actually seem to do anything (The node was continuing to run and serve requests).
Next, I attached to the node and ran "init:stop()." This started to shut down various
sub-systems, but the node was still running. Sending a SIGTERM signal to the beam vm
finally killed it. Restarting the node with "riak start" worked as expected,
and the node promptly resumed the handoffs, and finished in a few hours.

I'm not sure exactly what the issue was, but something seemed to cause a
stalling of the handoffs.

I've attached the contents of our console.log, erlang.log, error.log and crash.log
from the relevant times if that is useful.

Best Regards,

Armon Dadgar

