Hinted handoff failed because of tcp errors

Ryan Maclear ryanm at miranetworks.net
Tue Nov 1 03:43:47 EDT 2016


Good Day,

We have a 4 node riak cluster running inside AWS. The riak is riak-kv 2.1.2
with AAE enabled on Ubuntu 14.04.4 LTS

We are in the process of replacing one node with another using the process
described here:

http://docs.basho.com/riak/kv/2.1.4/using/cluster-operations/replacing-node/

We have successfully replaced two of the nodes so far but we are having a
problem with the third. If we look at /var/log/riak/console.log we see the
start of the hinted handoff, and some time later (sometimes minutes and
sometimes hours) we see:

2016-10-31 06:30:40.090 [error]
<0.19834.2101>@riak_core_handoff_sender:start_fold:272
hinted transfer of riak_kv_vnode from 'riak at aew54.miranetworks.net'
274031556999544297163190906134303066185487351808 to '
riak at aew75.miranetworks.net' 274031556999544297163190906134303066185487351808
failed because of TCP recv timeout
2016-10-31 06:30:40.090 [error]
<0.187.0>@riak_core_handoff_manager:handle_info:303
An outbound handoff of partition riak_kv_vnode
274031556999544297163190906134303066185487351808 was terminated for reason:
{shutdown,timeout}

So the handoff was terminated due to a tcp timeout. The handoff then starts
again.

This has been going on for some times (about two weeks now).

The current member status is as follows:

riak-admin member-status
================================= Membership
==================================
Status     Ring    Pending    Node
-------------------------------------------------------------------------------
leaving     0.0%      --      'riak at aew54.miranetworks.net'
valid      25.0%      --      'riak at aew59.miranetworks.net'
valid      25.0%      --      'riak at aew73.miranetworks.net'
valid      25.0%      --      'riak at aew74.miranetworks.net'
valid      25.0%      --      'riak at aew75.miranetworks.net'
-------------------------------------------------------------------------------
Valid:4 / Leaving:1 / Exiting:0 / Joining:0 / Down:0


Here are some questions:

1. What is the default tcp timeout?
2. Is there any way to increase this timeout?
3. Is there any way to increase the rate of handoff?
4. Are there any other parameters we can tune to try and avoid this?

The output from riak-admin transfers is as follows:

'riak at aew54.miranetworks.net' waiting to handoff 1 partitions

Active Transfers:

transfer type: hinted
vnode type: riak_kv_vnode
partition: 274031556999544297163190906134303066185487351808
started: 2016-11-01 05:30:47 [2.10 hr ago]
last update: 2016-11-01 07:36:51 [3.03 s ago]
total size: 78393086512 bytes
objects transferred: 11440967

                         1513 Objs/s
riak at aew54.miranetworks.n  =======>  riak at aew75.miranetworks.n
et                                   et
        |======                                     |  15%
                          1.53 MB/s


Thanks,
Ryan Maclear
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20161101/24887893/attachment-0002.html>


More information about the riak-users mailing list