Simultaneous handoff and merge

Joe Caswell jcaswell at basho.com
Thu Apr 18 17:15:35 EDT 2013


Yuri,

  Bitcask merging is normal, but combined with incoming handoff, it may be
overloading the node.
Two things you might try:
-reduce handoff_concurrency to 1 on all nodes to reduce the impact of
handoff (http://docs.basho.com/riak/latest/references/Configuration-Files/)
-restrict when Bitcask is allowed to merge by setting the merge_window on
the node that is being overloaded
(http://docs.basho.com/riak/latest/tutorials/choosing-a-backend/Bitcask/#Con
figuring-Bitcask)

Joe Caswell

From:  Yuri Lukyanov <snaky at aboutecho.com>
Date:  Thursday, April 18, 2013 5:07 AM
To:  "riak-users at lists.basho.com" <riak-users at lists.basho.com>
Subject:  Simultaneous handoff and merge

Hi,

I have a cluster of 17 riak (1.2.1) nodes with bitcask as a backend.

Recetly one of the node was down for a while. After the node had been
started the cluster started doing handoffs as expected. But then a merge
process also began on the same node. I know this from the log messages like
this:

2013-04-18 08:14:09.061 [info] <0.22952.79> Merged
["/var/lib/riak/bitcask/496682197061674038608283517368424307461195825152"


And then something went wrong (the logs on the same node):


2013-04-18 08:39:22.217 [error] <0.31842.70> Supervisor riak_core_vnode_sup
had child undefined started with {riak_core_vnode,start_link,undefined} at
<0.4000.80> exit with reason
{timeout,{gen_server,call,[riak_core_handoff_manager,{add_outbound,riak_kv_v
node,208378163135070142634509751539626289911881007104,riak at nsto2r5,<0.4000.8
0>}]}} in context child_terminated

2013-04-18 08:42:46.067 [error] <0.5154.80> gen_server <0.5154.80>
terminated with reason:
{timeout,{gen_server,call,[riak_core_handoff_manager,{add_inbound,[]}]}}
2013-04-18 08:42:52.790 [error] <0.5154.80> CRASH REPORT Process
riak_core_handoff_listener with 1 neighbours exited with reason:
{timeout,{gen_server,call,[riak_core_handoff_manager,{add_inbound,[]}]}} in
gen_server:terminate/6 line 747
2013-04-18 08:42:53.450 [error] <0.31847.70> Supervisor
riak_core_handoff_listener_sup had child riak_core_handoff_listener started
with riak_core_handoff_listener:start_link() at <0.5154.80> exit with reason
{timeout,{gen_server,call,[riak_core_handoff_manager,{add_inbound,[]}]}} in
context child_terminated


The node itself was disappearing from time to time:

# riak-admin ring-status
Node is not running!

The beam process was still running though.

Maybe it's not releated to handoffs & merge. It was just a guess.


Any information and advice on this would be greatly appriciated. It's still
happening right now and I could gather more details if someone wanted me to.

Thanks in advance.
_______________________________________________ riak-users mailing list
riak-users at lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20130418/8e04f266/attachment.html>


More information about the riak-users mailing list