Possible handoff stalls

Jon Meredith jmeredith at basho.com
Tue Mar 20 18:57:06 EDT 2012


Hi Michael,

When you say 'Only thing node is failing', do you mean the hardware is
failing (drives etc) or a problem with Riak itself?  If it's a riak problem
sharing the error messages from the logs would be helpful.

The fix will be included in the next point release, but we haven't set a
date for that yet.  The files for packaging Riak are included with the
distribution but you'll need to get erlang set up and built as well to be
able to build it (make package, you'll have to set a few variables our
build system uses), but I'd recommend holding off until we do an official
version.

If you'd like to increase the amount of handoff you can add
 {handoff_concurrency, 4} to the riak_core section of app.config which will
take effect next restarts or you could attach to the riak console (riak
attach) and run

  rpc:multicall(riak_core_handoff_manager, set_concurrency, [4], 5000).

(the dot is important), then ^D to disconnect.

The handoff concurrency value was reduced from 4 to 1 for the 1.0.3 release
around concerns that users building larger clusters would overwhelm new
nodes when they were added as the concurrency value applied to outbound
handoff.  For 1.1 we've changed things so that the concurrency value
applied to inbound and outbound so it is safer to set it higher.

As part of the pull request above we've also changed the logging slightly
to avoid printing out handoff starting messages until handoff succeeds.  In
1.1.0/1.1.1 when handoff concurrency is exceeded you may see repeats of
'Starting handoff' messages if the destination node denies the transfer due
to hitting the limit.

Cheers, Jon.

On Tue, Mar 20, 2012 at 4:28 PM, Michael Clemmons
<glassresistor at gmail.com>wrote:

> [apologies for the delay on this email sent to armon only first]
>
> I'm having similar issues on a testing cluster for 1.1.1rc1.  I'm having 1
> out of 4 nodes failing multiple times and not restarting well, there are
> like 100 pending transfers.  Only thing node is failing.  I've stopped
> pointing traffic at the nodes and have attempted to remove this machine
> from  the cluster.
> Its slowly leaving but is moving very slowly for not much data, the
> metadata is important and loosing any would be a significant time
> consumer(but obviously not vital since we used a day old build).
> What the likely hood that pull request will make it into a deb build in
> the near future or will the make file generate a deb?
> -Michael
>
>
> On Mon, Mar 19, 2012 at 11:40 AM, Armon Dadgar <armon.dadgar at gmail.com>wrote:
>
>> Okay, good to know this is a known issue. I attached the
>> logs for the last time this occurred in my original email.
>>
>> I'll try to capture this information if the problem occurs again.
>> Thanks.
>>
>>  Best Regards,
>>
>> Armon Dadgar
>>
>> On Mar 19, 2012, at 11:36 AM, Jon Meredith wrote:
>>
>> Hi Armon,
>>
>> We've recently patched an issue that affects handoffs here
>> https://github.com/basho/riak_core/pull/153
>>
>> If the issue repeats for you, as well as the logs it would be very useful
>> if you could follow the instructions from the pull request above ro the
>> 'riak_core_handoff_manager:status().' command against all nodes.
>>
>> The pull request works around an issue where it looks like the kernel has
>> closed a socket (no evidence of it any longer with netstat/ss) but the
>> erlang process is still stuck in an receive call from it (gen_tcp:recv/2 to
>> be more precise).
>>
>> Please let us know if you hit it again.
>>
>> Best, Jon.
>>
>> On Mon, Mar 19, 2012 at 12:10 PM, Armon Dadgar <armon.dadgar at gmail.com>wrote:
>>
>>> I wanted to ping the mailing list and see if anybody else has encountered
>>> stalls in the partition handoffs on Riak 1.1. We added a new node to our
>>> cluster
>>> last Friday, but noticed that the partition handoffs appear to have
>>> stopped
>>> after about 7-8 hours.
>>>
>>> Most of the handoffs completed, and the only handoffs that remained were
>>> from node 3 to node 2.
>>> The ring claimant (node 1), indicated that node 3 was unreachable (via
>>> ring_status).
>>> However, Riak control did not indicate that node 3 was unreachable, and
>>> in fact it was
>>> actually live and continuing to serve request.
>>>
>>> To resolve this, I tried to just restart node 3. I ran "riak stop"
>>> multiple times, but this did
>>> not actually seem to do anything (The node was continuing to run and
>>> serve requests).
>>> Next, I attached to the node and ran "init:stop()." This started to shut
>>> down various
>>> sub-systems, but the node was still running. Sending a SIGTERM signal to
>>> the beam vm
>>> finally killed it. Restarting the node with "riak start" worked as
>>> expected,
>>> and the node promptly resumed the handoffs, and finished in a few hours.
>>>
>>> I'm not sure exactly what the issue was, but something seemed to cause a
>>> stalling of the handoffs.
>>>
>>> I've attached the contents of our console.log, erlang.log, error.log and
>>> crash.log
>>> from the relevant times if that is useful.
>>>
>>>  Best Regards,
>>>
>>> Armon Dadgar
>>>
>>>
>>>
>>> _______________________________________________
>>> riak-users mailing list
>>> riak-users at lists.basho.com
>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>>
>>>
>>
>>
>> --
>> Jon Meredith
>> Platform Engineering Manager
>> Basho Technologies, Inc.
>> jmeredith at basho.com
>>
>>
>>
>> _______________________________________________
>> riak-users mailing list
>> riak-users at lists.basho.com
>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>
>>
>
>
> --
> -Michael
>
>


-- 
Jon Meredith
Platform Engineering Manager
Basho Technologies, Inc.
jmeredith at basho.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20120320/77f37619/attachment.html>


More information about the riak-users mailing list