Handoffs are too slow after netsplit

Charlie Voiselle cvoiselle at basho.com
Fri Feb 24 12:20:19 EST 2017


Andrey:

Another thing that can stall handoff is running commands that reset the vnode activity timer.  The `riak-admin vnode-status` command will reset the activity timer and should never be run more frequently that then vnode inactivity timeout; if you do, that can permanently stall handoff.  We have seen this before at a customer site where they were collecting the statics from the vnode-status command into their metrics system. 

Regards,
Charlie Voiselle


> On Feb 23, 2017, at 7:14 AM, Douglas Rohrer <drohrer at basho.com> wrote:
> 
> Andrey:
> 
> It's waiting for 60 seconds, literally...
> 
> See https://github.com/basho/riak_core/search?utf8=%E2%9C%93&q=vnode_inactivity_timeout <https://github.com/basho/riak_core/search?utf8=%E2%9C%93&q=vnode_inactivity_timeout> - handoff is not initiated until a vnode has been inactive for the specified inactivity period.
> 
> For demonstration purposes, if you want to reduce this time, you could set the riak_core.vnode_inactivity_timeout period lower ,which can be set in advanced.config. Also note that, depending on the backend you use, it's possible to have other settings set lower than the vnode inactivity timeout, you can actually prevent handoff completely - see http://docs.basho.com/riak/kv/2.2.0/setup/planning/backend/bitcask/#sync-strategy <http://docs.basho.com/riak/kv/2.2.0/setup/planning/backend/bitcask/#sync-strategy>, for examnple.
> 
> Hope this helps.
> 
> Doug
> 
> On Thu, Feb 23, 2017 at 6:40 AM Andrey Ershov <andrershov at gmail.com <mailto:andrershov at gmail.com>> wrote:
> Hi, guys!
> 
> I'd like to follow up on handoffs behaviour after netsplit. The problem is that right after network partition is healed, "riak-admin transfers" command says that there are X partitions waiting transfer from one node to another, and Y partitions waiting transfer in the opposite direction. What are they waiting for? Active transfers section is always empty. It takes about 1 minute for transfer to occur. I've increased transfer_limit to 100 and it does not help. 
> Also I've tried to attach to Erlang VM and execute riak_core_vnode_manager:force_handoff() on each node. This command returns 'ok'. But seems that it does not work right after network is healed. After some time 30-60 s, force_handoff() works as expected, but actually it's the same latency as in auto handoff case. 
> 
> So what is it waiting for? Any ideas?
> 
> I'm preparing real-time coding demo to be shown on the conference. So it's too much time to wait for 1 minute for handoff to occur just for a couple of keys...
> -- 
> Thanks,
> Andrey
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com <mailto:riak-users at lists.basho.com>
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com <http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com>
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20170224/6b81acac/attachment-0002.html>


More information about the riak-users mailing list