TCP recv timeout and handoffs almost all the time

Simon Effenberg seffenberg at team.mobile.de
Fri Jul 19 10:08:44 EDT 2013


only after restarting the Riak instance on this node the awaiting
handoffs where processed.. this is weird :(

On Fri, 19 Jul 2013 15:55:43 +0200
Simon Effenberg <seffenberg at team.mobile.de> wrote:

> It looked good for some hours but now again we got 
> 
> 2013-07-19 13:27:07.800 UTC [error] <0.18747.29>@riak_core_handoff_sender:start_fold:216 hinted_handoff transfer of riak_kv_vnode from 'riak at 10.46.109.207' 1136089163393944065322395631681798128560666312704 to 'riak at 10.47.109.202' 1136089163393944065322395631681798128560666312704 failed because of TCP recv timeout
> 
> and on the destination host I see:
> 
> 
> 2013-07-19 13:25:04.455 UTC [error] <0.28632.25>@riak_core_handoff_receiver:handle_info:80 Handoff receiver for partition 1136089163393944065322395631681798128560666312704 exited abnormally after processing 2 objects: {timeout,{gen_fsm,sync_send_all_state_event,[<0.1107.0>,{handoff_data,<<141,146,205,110,211,64,20,133,237,4,211,132,2,170,80,69,37,150,22,203,186,216,249,105,210,172,42,149,95,137,162,2,5,177,129,232,120,102,156,153,137,61,78,237,113,72,10,172,186,101,195,51,176,224,1,120,12,158,130,55,97,198,173,68,83,177,192,35,223,197,55,231,156,185,158,235,27,155,36,87,115,86,148,208,34,87,227,146,145,130,233,242,206,173,46,153,204,59,60,18,125,61,91,208,123,223,188,51,190,70,157,86,49,206,99,201,136,206,28,199,249,167,209,110,172,122,83,67,92,222,164,78,187,24,27,135,102,74,243,54,117,174,81,65,52,60,108,152,213,194,17,66,190,33,175,60,220,189,204,108,78,195,150,117,123,198,205,139,168,64,47,103,12,26,12,11,83,31,96,134,20,128,128,170,245,91,86,186,254,46,120,37,48,13,222,30,99,130,1,158,152,213,67,132,199,168,240,26,7,72,12,123,134,23,198,25,154,247,33,30,225,16,18,39,56,56,63,210,173,139,205,241,132,162,108,33,175,226,205,139,248,231,40,117,112,152,83,145,8,70,121,51,54,134,15,177,211,252,252,59,118,218,223,127,94,114,93,183,174,53,194,81,148,76,227,13,142,77,43,1,134,82,90,254,227,147,111,238,212,31,69,219,126,44,168,63,242,211,124,206,210,101,86,149,130,116,250,251,147,12,34,221,33,121,230,111,251,101,189,207,243,100,63,143,89,161,4,83,59,148,25,30,151,6,79,39,162,43,62,46,79,213,105,181,103,181,150,173,140,197,64,208,58,33,234,134,123,195,97,212,11,13,210,70,23,117,7,189,78,103,216,31,12,118,67,211,6,169,69,187,211,98,113,50,226,18,75,213,77,184,255,229,252,115,120,195,246,220,58,186,251,244,236,101,182,117,159,55,224,42,207,193,215,247,191,110,203,191,67,118,255,127,200,114,229,122,169,227,145,148,65,153,32,93,84,76,74,243,19,85,102,8,137,80,140,254,1>>},60000]}}
> 
> so both shows a timeout. How could I takle this down?
> 
> - could this happen when many read repairs occur (through AAE)?
> 
> Also our "fsm PUT time is going higher but not really the GET time".. is this the normal behavior in LOAD/read repair situations?
> 
> Also is this a bigger problem with eLevelDB or would it be the same case for Bitcask?
> 
> Cheers
> Simon
> 
> 
> On Fri, 19 Jul 2013 10:25:05 +0200
> Simon Effenberg <seffenberg at team.mobile.de> wrote:
> 
> > once again with the list included... argh
> > 
> > Hey Christian,
> > 
> > so it could be also a erlang limit? I found out why my riak instances
> > are all having different processlimits. My mcollectived daemons have
> > the different limits and when I triggered a puppetrun through
> > mcollective they got this processlimit as well.
> > 
> > Also in the crash log I see:
> > 
> > exception exit: {{system_limit,[{erlang,spawn
> > 
> > for the too many processes. So it doesn't look like a Erlang limit, do
> > it? But I will keep this +P in my mind!! Thanks a lot.
> > 
> > The zdbbl is now at 100MB.
> > 
> > Cheers
> > Simon
> > 
> > On Fri, 19 Jul 2013 08:49:50 +0100
> > Christian Dahlqvist <christian at basho.com> wrote:
> > 
> > > Hi Simon,
> > > 
> > > If you have objects that can be a s big as 15MB, it is probably wise to increase the size of +zdbbl in order to avoid filling up buffers when these large objects need to be transferred between nodes. What an appropriate level is depends a lot on the size distribution of your data and your access patterns, so I would recommend benchmarking to find a suitable value.
> > > 
> > > Erlang also has a default process limit of 32768 (at least in R15B01), which may be what you are hitting. You can override this to 256k by adding the following line to the vm.args file:
> > > 
> > >     +P 262144
> > > 
> > > Best regards,
> > > 
> > > Christian
> > > 
> > > 
> > > 
> > > On 19 Jul 2013, at 08:24, Simon Effenberg <seffenberg at team.mobile.de> wrote:
> > > 
> > > > The +zdbbl parameter helped a lot but the hinted handoffs didn't
> > > > disappear completely. I have no more busy dist port errors in the
> > > > _console.log_ (why aren't they in the error.log? it looks for me like a
> > > > serious problem you have.. at least our cluster was behaving not that
> > > > nice).
> > > > 
> > > > I'll try to increase the buffer size to a higher value because my
> > > > suggestion is that also the objects send from one to another are also
> > > > stored therein and we have sometimes objects which are up to 15MB.
> > > > 
> > > > But I saw now also some crashes in the last 6 hours on only two machines
> > > > complaining about too many processes
> > > > 
> > > > ----------------
> > > > console.log
> > > > 2013-07-19 02:04:21.962 UTC [error] <0.12813.29> CRASH REPORT Process <0.12813.29> with 15 neighbours exited with reason: {system_limit
> > > > 
> > > > crash.log
> > > > 2013-07-19 02:04:21 UTC =ERROR REPORT====
> > > > Too many processes
> > > > ----------------
> > > > 
> > > > the process has a process limit of 95142. So I will increase it now but I never saw any information about such problems on the linux tuning page. Am I missing something?
> > > > 
> > > > Cheers
> > > > Simon
> > > > 
> > > > 
> > > > On Thu, 18 Jul 2013 19:34:18 +0100
> > > > Guido Medina <guido.medina at temetra.com> wrote:
> > > > 
> > > >> If what you are describing is happening for 1.4, type riak-admin diag 
> > > >> and see the new recommended kernel parameters, also, on vm.args 
> > > >> uncomment the +zdbbl 32768 parameter, since what you are describing is 
> > > >> similar to what happened to us when we upgraded to 1.4.
> > > >> 
> > > >> HTH,
> > > >> 
> > > >> Guido.
> > > >> 
> > > >> On 18/07/13 19:21, Simon Effenberg wrote:
> > > >>> Hi @list,
> > > >>> 
> > > >>> I see sometimes logs talking about "hinted_handoff transfer of .. failed because of TCP recv timeout".
> > > >>> Also riak-admin transfers shows me many handoffs (is it possible to give some insights about "how many" handoffs happened through "riak-admin status"?).
> > > >>> 
> > > >>> - Is it a normal behavior to have up to 30 handoffs from/to different nodes?
> > > >>> - How can I get down to the problem with the TCP recv timeout? I'm not sure if this is a network problem or if the other node is too slow. The load is ok on the machines (some IOwait but not 100%). Maybe interfering with AAE?
> > > >>> 
> > > >>> Here the log information about the TCP recv timeout. But that is not that often but handoffs happens really often:
> > > >>> 
> > > >>> 2013-07-18 16:22:05.654 UTC [error] <0.28933.14>@riak_core_handoff_sender:start_fold:216 hinted_handoff transfer of riak_kv_vnode from 'riak at 10.46.109.207' 1118962191081472546749696200048404186924073353216 to 'riak at 10.46.109.205' 1118962191081472546749696200048404186924073353216 failed because of TCP recv timeout
> > > >>> 2013-07-18 16:22:05.673 UTC [error] <0.202.0>@riak_core_handoff_manager:handle_info:282 An outbound handoff of partition riak_kv_vnode 1118962191081472546749696200048404186924073353216 was terminated for reason: {shutdown,timeout}
> > > >>> 
> > > >>> 
> > > >>> Thanks in advance
> > > >>> Simon
> > > >>> 
> > > >>> _______________________________________________
> > > >>> riak-users mailing list
> > > >>> riak-users at lists.basho.com
> > > >>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> > > >> 
> > > >> 
> > > >> _______________________________________________
> > > >> riak-users mailing list
> > > >> riak-users at lists.basho.com
> > > >> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> > > > 
> > > > 
> > > > -- 
> > > > Simon Effenberg | Site Ops Engineer | mobile.international GmbH
> > > > Fon:     + 49-(0)30-8109 - 7173
> > > > Fax:     + 49-(0)30-8109 - 7131
> > > > 
> > > > Mail:     seffenberg at team.mobile.de
> > > > Web:    www.mobile.de
> > > > 
> > > > Marktplatz 1 | 14532 Europarc Dreilinden | Germany
> > > > 
> > > > 
> > > > Geschäftsführer: Malte Krüger
> > > > HRB Nr.: 18517 P, Amtsgericht Potsdam
> > > > Sitz der Gesellschaft: Kleinmachnow 
> > > > 
> > > > _______________________________________________
> > > > riak-users mailing list
> > > > riak-users at lists.basho.com
> > > > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> > > 
> > 
> > 
> > -- 
> > Simon Effenberg | Site Ops Engineer | mobile.international GmbH
> > Fon:     + 49-(0)30-8109 - 7173
> > Fax:     + 49-(0)30-8109 - 7131
> > 
> > Mail:     seffenberg at team.mobile.de
> > Web:    www.mobile.de
> > 
> > Marktplatz 1 | 14532 Europarc Dreilinden | Germany
> > 
> > 
> > Geschäftsführer: Malte Krüger
> > HRB Nr.: 18517 P, Amtsgericht Potsdam
> > Sitz der Gesellschaft: Kleinmachnow 
> > 
> > _______________________________________________
> > riak-users mailing list
> > riak-users at lists.basho.com
> > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> 
> 
> -- 
> Simon Effenberg | Site Ops Engineer | mobile.international GmbH
> Fon:     + 49-(0)30-8109 - 7173
> Fax:     + 49-(0)30-8109 - 7131
> 
> Mail:     seffenberg at team.mobile.de
> Web:    www.mobile.de
> 
> Marktplatz 1 | 14532 Europarc Dreilinden | Germany
> 
> 
> Geschäftsführer: Malte Krüger
> HRB Nr.: 18517 P, Amtsgericht Potsdam
> Sitz der Gesellschaft: Kleinmachnow 
> 
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


-- 
Simon Effenberg | Site Ops Engineer | mobile.international GmbH
Fon:     + 49-(0)30-8109 - 7173
Fax:     + 49-(0)30-8109 - 7131

Mail:     seffenberg at team.mobile.de
Web:    www.mobile.de

Marktplatz 1 | 14532 Europarc Dreilinden | Germany


Geschäftsführer: Malte Krüger
HRB Nr.: 18517 P, Amtsgericht Potsdam
Sitz der Gesellschaft: Kleinmachnow 




More information about the riak-users mailing list