TCP recv timeout and handoffs almost all the time

Simon Effenberg seffenberg at team.mobile.de
Fri Jul 19 03:35:05 EDT 2013


wow.. now I have something to search for..

riak46-1 Max processes             unlimited            unlimited            processes 
riak46-2 Max processes             unlimited            unlimited            processes 
riak46-3 Max processes             unlimited            unlimited            processes 
riak46-4 Max processes             unlimited            unlimited            processes 
riak46-5 Max processes             unlimited            unlimited            processes 
riak46-6 Max processes             unlimited            unlimited            processes 
riak46-7 Max processes             95142                95142                processes 
riak46-8 Max processes             unlimited            unlimited            processes 
riak46-9 Max processes             95142                95142                processes 
riak47-1 Max processes             191896               191896               processes 
riak47-2 Max processes             192920               192920               processes 
riak47-3 Max processes             unlimited            unlimited            processes 
riak47-4 Max processes             unlimited            unlimited            processes 
riak47-5 Max processes             unlimited            unlimited            processes 
riak47-6 Max processes             unlimited            unlimited            processes 
riak47-7 Max processes             95142                95142                processes 
riak47-8 Max processes             95142                95142                processes 
riak47-9 Max processes             95142                95142                processes 


riak46-{7..9}, riak47-1 and riak47-{7..9} are quiet newly reinstalled but all with puppet and in theory nothing special about them compared to the other once..

I need to have a look and probably try to enforce an "unlimited" process limit.

Cheers
Simon

On Fri, 19 Jul 2013 09:24:07 +0200
Simon Effenberg <seffenberg at team.mobile.de> wrote:

> The +zdbbl parameter helped a lot but the hinted handoffs didn't
> disappear completely. I have no more busy dist port errors in the
> _console.log_ (why aren't they in the error.log? it looks for me like a
> serious problem you have.. at least our cluster was behaving not that
> nice).
> 
> I'll try to increase the buffer size to a higher value because my
> suggestion is that also the objects send from one to another are also
> stored therein and we have sometimes objects which are up to 15MB.
> 
> But I saw now also some crashes in the last 6 hours on only two machines
> complaining about too many processes
> 
> ----------------
> console.log
> 2013-07-19 02:04:21.962 UTC [error] <0.12813.29> CRASH REPORT Process <0.12813.29> with 15 neighbours exited with reason: {system_limit
> 
> crash.log
> 2013-07-19 02:04:21 UTC =ERROR REPORT====
> Too many processes
> ----------------
> 
> the process has a process limit of 95142. So I will increase it now but I never saw any information about such problems on the linux tuning page. Am I missing something?
> 
> Cheers
> Simon
> 
> 
> On Thu, 18 Jul 2013 19:34:18 +0100
> Guido Medina <guido.medina at temetra.com> wrote:
> 
> > If what you are describing is happening for 1.4, type riak-admin diag 
> > and see the new recommended kernel parameters, also, on vm.args 
> > uncomment the +zdbbl 32768 parameter, since what you are describing is 
> > similar to what happened to us when we upgraded to 1.4.
> > 
> > HTH,
> > 
> > Guido.
> > 
> > On 18/07/13 19:21, Simon Effenberg wrote:
> > > Hi @list,
> > >
> > > I see sometimes logs talking about "hinted_handoff transfer of .. failed because of TCP recv timeout".
> > > Also riak-admin transfers shows me many handoffs (is it possible to give some insights about "how many" handoffs happened through "riak-admin status"?).
> > >
> > > - Is it a normal behavior to have up to 30 handoffs from/to different nodes?
> > > - How can I get down to the problem with the TCP recv timeout? I'm not sure if this is a network problem or if the other node is too slow. The load is ok on the machines (some IOwait but not 100%). Maybe interfering with AAE?
> > >
> > > Here the log information about the TCP recv timeout. But that is not that often but handoffs happens really often:
> > >
> > > 2013-07-18 16:22:05.654 UTC [error] <0.28933.14>@riak_core_handoff_sender:start_fold:216 hinted_handoff transfer of riak_kv_vnode from 'riak at 10.46.109.207' 1118962191081472546749696200048404186924073353216 to 'riak at 10.46.109.205' 1118962191081472546749696200048404186924073353216 failed because of TCP recv timeout
> > > 2013-07-18 16:22:05.673 UTC [error] <0.202.0>@riak_core_handoff_manager:handle_info:282 An outbound handoff of partition riak_kv_vnode 1118962191081472546749696200048404186924073353216 was terminated for reason: {shutdown,timeout}
> > >
> > >
> > > Thanks in advance
> > > Simon
> > >
> > > _______________________________________________
> > > riak-users mailing list
> > > riak-users at lists.basho.com
> > > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> > 
> > 
> > _______________________________________________
> > riak-users mailing list
> > riak-users at lists.basho.com
> > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> 
> 
> -- 
> Simon Effenberg | Site Ops Engineer | mobile.international GmbH
> Fon:     + 49-(0)30-8109 - 7173
> Fax:     + 49-(0)30-8109 - 7131
> 
> Mail:     seffenberg at team.mobile.de
> Web:    www.mobile.de
> 
> Marktplatz 1 | 14532 Europarc Dreilinden | Germany
> 
> 
> Geschäftsführer: Malte Krüger
> HRB Nr.: 18517 P, Amtsgericht Potsdam
> Sitz der Gesellschaft: Kleinmachnow 
> 
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


-- 
Simon Effenberg | Site Ops Engineer | mobile.international GmbH
Fon:     + 49-(0)30-8109 - 7173
Fax:     + 49-(0)30-8109 - 7131

Mail:     seffenberg at team.mobile.de
Web:    www.mobile.de

Marktplatz 1 | 14532 Europarc Dreilinden | Germany


Geschäftsführer: Malte Krüger
HRB Nr.: 18517 P, Amtsgericht Potsdam
Sitz der Gesellschaft: Kleinmachnow 




More information about the riak-users mailing list