TCP recv timeout and handoffs almost all the time

Guido Medina guido.medina at temetra.com
Thu Jul 18 14:55:43 EDT 2013


Follow the white rabbit:

http://docs.basho.com/riak/latest/cookbooks/Linux-Performance-Tuning/

Most recommended parameters are on that link.

HTH,

Guido.

On 18/07/13 19:48, Simon Effenberg wrote:
> Sounds like zdbbl.. I'm running 1.3.1 but it started after added 6 more
> nodes to the previously 12 node cluster. So maybe it is because of a 18
> node cluster?
>
> I'll try the zdbbl stuff. Any other hint would be cool (if the new
> kernel parameters are also good for 1.3.1.. could you provide them?).
>
> Cheers
> Simon
>
> On Thu, 18 Jul 2013 19:34:18 +0100
> Guido Medina <guido.medina at temetra.com> wrote:
>
>> If what you are describing is happening for 1.4, type riak-admin diag
>> and see the new recommended kernel parameters, also, on vm.args
>> uncomment the +zdbbl 32768 parameter, since what you are describing is
>> similar to what happened to us when we upgraded to 1.4.
>>
>> HTH,
>>
>> Guido.
>>
>> On 18/07/13 19:21, Simon Effenberg wrote:
>>> Hi @list,
>>>
>>> I see sometimes logs talking about "hinted_handoff transfer of .. failed because of TCP recv timeout".
>>> Also riak-admin transfers shows me many handoffs (is it possible to give some insights about "how many" handoffs happened through "riak-admin status"?).
>>>
>>> - Is it a normal behavior to have up to 30 handoffs from/to different nodes?
>>> - How can I get down to the problem with the TCP recv timeout? I'm not sure if this is a network problem or if the other node is too slow. The load is ok on the machines (some IOwait but not 100%). Maybe interfering with AAE?
>>>
>>> Here the log information about the TCP recv timeout. But that is not that often but handoffs happens really often:
>>>
>>> 2013-07-18 16:22:05.654 UTC [error] <0.28933.14>@riak_core_handoff_sender:start_fold:216 hinted_handoff transfer of riak_kv_vnode from 'riak at 10.46.109.207' 1118962191081472546749696200048404186924073353216 to 'riak at 10.46.109.205' 1118962191081472546749696200048404186924073353216 failed because of TCP recv timeout
>>> 2013-07-18 16:22:05.673 UTC [error] <0.202.0>@riak_core_handoff_manager:handle_info:282 An outbound handoff of partition riak_kv_vnode 1118962191081472546749696200048404186924073353216 was terminated for reason: {shutdown,timeout}
>>>
>>>
>>> Thanks in advance
>>> Simon
>>>
>>> _______________________________________________
>>> riak-users mailing list
>>> riak-users at lists.basho.com
>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>
>> _______________________________________________
>> riak-users mailing list
>> riak-users at lists.basho.com
>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>





More information about the riak-users mailing list