Possible handoff stalls
armon.dadgar at gmail.com
Mon Mar 19 14:40:33 EDT 2012
Okay, good to know this is a known issue. I attached the
logs for the last time this occurred in my original email.
I'll try to capture this information if the problem occurs again.
On Mar 19, 2012, at 11:36 AM, Jon Meredith wrote:
> Hi Armon,
> We've recently patched an issue that affects handoffs here https://github.com/basho/riak_core/pull/153
> If the issue repeats for you, as well as the logs it would be very useful if you could follow the instructions from the pull request above ro the 'riak_core_handoff_manager:status().' command against all nodes.
> The pull request works around an issue where it looks like the kernel has closed a socket (no evidence of it any longer with netstat/ss) but the erlang process is still stuck in an receive call from it (gen_tcp:recv/2 to be more precise).
> Please let us know if you hit it again.
> Best, Jon.
> On Mon, Mar 19, 2012 at 12:10 PM, Armon Dadgar <armon.dadgar at gmail.com> wrote:
> I wanted to ping the mailing list and see if anybody else has encountered
> stalls in the partition handoffs on Riak 1.1. We added a new node to our cluster
> last Friday, but noticed that the partition handoffs appear to have stopped
> after about 7-8 hours.
> Most of the handoffs completed, and the only handoffs that remained were from node 3 to node 2.
> The ring claimant (node 1), indicated that node 3 was unreachable (via ring_status).
> However, Riak control did not indicate that node 3 was unreachable, and in fact it was
> actually live and continuing to serve request.
> To resolve this, I tried to just restart node 3. I ran "riak stop" multiple times, but this did
> not actually seem to do anything (The node was continuing to run and serve requests).
> Next, I attached to the node and ran "init:stop()." This started to shut down various
> sub-systems, but the node was still running. Sending a SIGTERM signal to the beam vm
> finally killed it. Restarting the node with "riak start" worked as expected,
> and the node promptly resumed the handoffs, and finished in a few hours.
> I'm not sure exactly what the issue was, but something seemed to cause a
> stalling of the handoffs.
> I've attached the contents of our console.log, erlang.log, error.log and crash.log
> from the relevant times if that is useful.
> Best Regards,
> Armon Dadgar
> riak-users mailing list
> riak-users at lists.basho.com
> Jon Meredith
> Platform Engineering Manager
> Basho Technologies, Inc.
> jmeredith at basho.com
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the riak-users