Possible handoff stalls

Michael Clemmons glassresistor at gmail.com
Tue Mar 20 19:26:09 EDT 2012


Jon,
Thanks so much.  Yes beam.swp was maxing cpu and memory on the 1 node.  I
managed to get it to exit and now its a 3 node cluster.  I'll take your
advice on changing the handoff.

On Tue, Mar 20, 2012 at 3:57 PM, Jon Meredith <jmeredith at basho.com> wrote:

> Hi Michael,
>
> When you say 'Only thing node is failing', do you mean the hardware is
> failing (drives etc) or a problem with Riak itself?  If it's a riak problem
> sharing the error messages from the logs would be helpful.
>
> The fix will be included in the next point release, but we haven't set a
> date for that yet.  The files for packaging Riak are included with the
> distribution but you'll need to get erlang set up and built as well to be
> able to build it (make package, you'll have to set a few variables our
> build system uses), but I'd recommend holding off until we do an official
> version.
>
> If you'd like to increase the amount of handoff you can add
>  {handoff_concurrency, 4} to the riak_core section of app.config which will
> take effect next restarts or you could attach to the riak console (riak
> attach) and run
>
>   rpc:multicall(riak_core_handoff_manager, set_concurrency, [4], 5000).
>
> (the dot is important), then ^D to disconnect.
>
> The handoff concurrency value was reduced from 4 to 1 for the 1.0.3
> release around concerns that users building larger clusters would overwhelm
> new nodes when they were added as the concurrency value applied to outbound
> handoff.  For 1.1 we've changed things so that the concurrency value
> applied to inbound and outbound so it is safer to set it higher.
>
> As part of the pull request above we've also changed the logging slightly
> to avoid printing out handoff starting messages until handoff succeeds.  In
> 1.1.0/1.1.1 when handoff concurrency is exceeded you may see repeats of
> 'Starting handoff' messages if the destination node denies the transfer due
> to hitting the limit.
>
> Cheers, Jon.
>
>
> On Tue, Mar 20, 2012 at 4:28 PM, Michael Clemmons <glassresistor at gmail.com
> > wrote:
>
>> [apologies for the delay on this email sent to armon only first]
>>
>> I'm having similar issues on a testing cluster for 1.1.1rc1.  I'm having
>> 1 out of 4 nodes failing multiple times and not restarting well, there are
>> like 100 pending transfers.  Only thing node is failing.  I've stopped
>> pointing traffic at the nodes and have attempted to remove this machine
>> from  the cluster.
>> Its slowly leaving but is moving very slowly for not much data, the
>> metadata is important and loosing any would be a significant time
>> consumer(but obviously not vital since we used a day old build).
>> What the likely hood that pull request will make it into a deb build in
>> the near future or will the make file generate a deb?
>> -Michael
>>
>>
>> On Mon, Mar 19, 2012 at 11:40 AM, Armon Dadgar <armon.dadgar at gmail.com>wrote:
>>
>>> Okay, good to know this is a known issue. I attached the
>>> logs for the last time this occurred in my original email.
>>>
>>> I'll try to capture this information if the problem occurs again.
>>> Thanks.
>>>
>>>  Best Regards,
>>>
>>> Armon Dadgar
>>>
>>> On Mar 19, 2012, at 11:36 AM, Jon Meredith wrote:
>>>
>>> Hi Armon,
>>>
>>> We've recently patched an issue that affects handoffs here
>>> https://github.com/basho/riak_core/pull/153
>>>
>>> If the issue repeats for you, as well as the logs it would be very
>>> useful if you could follow the instructions from the pull request above ro
>>> the 'riak_core_handoff_manager:status().' command against all nodes.
>>>
>>> The pull request works around an issue where it looks like the kernel
>>> has closed a socket (no evidence of it any longer with netstat/ss) but the
>>> erlang process is still stuck in an receive call from it (gen_tcp:recv/2 to
>>> be more precise).
>>>
>>> Please let us know if you hit it again.
>>>
>>> Best, Jon.
>>>
>>> On Mon, Mar 19, 2012 at 12:10 PM, Armon Dadgar <armon.dadgar at gmail.com>wrote:
>>>
>>>> I wanted to ping the mailing list and see if anybody else has
>>>> encountered
>>>> stalls in the partition handoffs on Riak 1.1. We added a new node to
>>>> our cluster
>>>> last Friday, but noticed that the partition handoffs appear to have
>>>> stopped
>>>> after about 7-8 hours.
>>>>
>>>> Most of the handoffs completed, and the only handoffs that remained
>>>> were from node 3 to node 2.
>>>> The ring claimant (node 1), indicated that node 3 was unreachable (via
>>>> ring_status).
>>>> However, Riak control did not indicate that node 3 was unreachable, and
>>>> in fact it was
>>>> actually live and continuing to serve request.
>>>>
>>>> To resolve this, I tried to just restart node 3. I ran "riak stop"
>>>> multiple times, but this did
>>>> not actually seem to do anything (The node was continuing to run and
>>>> serve requests).
>>>> Next, I attached to the node and ran "init:stop()." This started to
>>>> shut down various
>>>> sub-systems, but the node was still running. Sending a SIGTERM
>>>> signal to the beam vm
>>>> finally killed it. Restarting the node with "riak start" worked as
>>>> expected,
>>>> and the node promptly resumed the handoffs, and finished in a few hours.
>>>>
>>>> I'm not sure exactly what the issue was, but something seemed to cause a
>>>> stalling of the handoffs.
>>>>
>>>> I've attached the contents of our console.log, erlang.log, error.log
>>>> and crash.log
>>>> from the relevant times if that is useful.
>>>>
>>>>  Best Regards,
>>>>
>>>> Armon Dadgar
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> riak-users mailing list
>>>> riak-users at lists.basho.com
>>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>>>
>>>>
>>>
>>>
>>> --
>>> Jon Meredith
>>> Platform Engineering Manager
>>> Basho Technologies, Inc.
>>> jmeredith at basho.com
>>>
>>>
>>>
>>> _______________________________________________
>>> riak-users mailing list
>>> riak-users at lists.basho.com
>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>>
>>>
>>
>>
>> --
>> -Michael
>>
>>
>
>
> --
> Jon Meredith
> Platform Engineering Manager
> Basho Technologies, Inc.
> jmeredith at basho.com
>
>


-- 
-Michael
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20120320/8c174183/attachment.html>


More information about the riak-users mailing list