Simultaneous handoff and merge

Yuri Lukyanov snaky at aboutecho.com
Sat Apr 20 03:33:35 EDT 2013


More infortaion on this.

We have a merge_window set to {6, 9}. Every day at this time we have our
cluster heavely overloaded. ring-status often shows many nodes unreachable.

What would you suggest? Would {merge_window, always} be better since all
nodes would be merging at different times? I still have concerns about
this. Even if one node is merging at the moment, it looks like the whole
cluster is significantly affected.


On Thu, Apr 18, 2013 at 1:07 PM, Yuri Lukyanov <snaky at aboutecho.com> wrote:

> Hi,
>
> I have a cluster of 17 riak (1.2.1) nodes with bitcask as a backend.
>
> Recetly one of the node was down for a while. After the node had been
> started the cluster started doing handoffs as expected. But then a merge
> process also began on the same node. I know this from the log messages like
> this:
>
> 2013-04-18 08:14:09.061 [info] <0.22952.79> Merged
> ["/var/lib/riak/bitcask/496682197061674038608283517368424307461195825152"
>
>
> And then something went wrong (the logs on the same node):
>
>
> 2013-04-18 08:39:22.217 [error] <0.31842.70> Supervisor
> riak_core_vnode_sup had child undefined started with
> {riak_core_vnode,start_link,undefined} at <0.4000.80> exit with reason
> {timeout,{gen_server,call,[riak_core_handoff_manager,{add_outbound,riak_kv_vnode,208378163135070142634509751539626289911881007104,riak at nsto2r5,<0.4000.80>}]}}
> in context child_terminated
>
> 2013-04-18 08:42:46.067 [error] <0.5154.80> gen_server <0.5154.80>
> terminated with reason:
> {timeout,{gen_server,call,[riak_core_handoff_manager,{add_inbound,[]}]}}
> 2013-04-18 08:42:52.790 [error] <0.5154.80> CRASH REPORT Process
> riak_core_handoff_listener with 1 neighbours exited with reason:
> {timeout,{gen_server,call,[riak_core_handoff_manager,{add_inbound,[]}]}} in
> gen_server:terminate/6 line 747
> 2013-04-18 08:42:53.450 [error] <0.31847.70> Supervisor
> riak_core_handoff_listener_sup had child riak_core_handoff_listener started
> with riak_core_handoff_listener:start_link() at <0.5154.80> exit with
> reason
> {timeout,{gen_server,call,[riak_core_handoff_manager,{add_inbound,[]}]}} in
> context child_terminated
>
>
> The node itself was disappearing from time to time:
>
> # riak-admin ring-status
> Node is not running!
>
> The beam process was still running though.
>
> Maybe it's not releated to handoffs & merge. It was just a guess.
>
>
> Any information and advice on this would be greatly appriciated. It's
> still happening right now and I could gather more details if someone wanted
> me to.
>
> Thanks in advance.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20130420/6e3ff704/attachment.html>


More information about the riak-users mailing list