Handoff stalled on 1.0.2 riak cluster

Mark Phillips mark at basho.com
Sun Jun 3 22:43:20 EDT 2012


Hi John,

Assuming things aren't back to normal... A few things:

Attach to any running node and run this:

rpc:multicall([node() | nodes()], riak_core_vnode_manager, force_handoffs, []).

This will attempt to force handoff. If this restarts handoff, you've
got new issue that we'll need to track down. Please report back if
this gets handoffs running again .

Another possible fix:

Take a look at https://github.com/basho/riak_core/pull/153

This was fixed on 1.1, but it might be what's hitting you (though,
admittedly, your issue does seem like a perfect match for the issue
from the 1.0.2 release notes).

If this is what's ailing you, there's a work-around here:
https://github.com/basho/riak_core/pull/153#issuecomment-4527706

If neither of these work, let us know and we'll take a deeper look.
Specifically:

a) any log files you could send along would be helpful
b) the output of the following diagnostic:

f(Members).
Members = riak_core_ring:all_members(element(2,
riak_core_ring_manager:get_raw_ring())).
[{N, rpc:call(N, riak_core_handoff_manager, status, [])} || N <- Members].

Thanks, John.

Mark



On Sun, Jun 3, 2012 at 5:06 AM, John Axel Eriksson <john at insane.se> wrote:

> Hi.
>
> We had an issue where one of the riak servers died (had to be force
> removed from cluster). After we did that things got really bad and most
> data was unreachable for hours. I added a new node to replace the old one
> at one point as well - that never got any data and even now about a day
> later it hasn't gotten any data.
> What seems to be the issue now is that there are a few nodes are waiting
> on handoff of 1 partition. When I look at ring_status I see this:
>
> Attempting to restart script through sudo -u riak
> ================================== Claimant
> ===================================
> Claimant:  'riak at r-001.x.x.x
> Status:     up
> Ring Ready: true
>
> ============================== Ownership Handoff
> ==============================
> Owner:      riak at r-004.x.x.x
> Next Owner: riak at r-003.x.x.x
>
> Index: 930565495644285842450002452081070828921550798848
>   Waiting on: []
>   Complete:   [riak_kv_vnode,riak_pipe_vnode,riak_search_vnode]
>
>
> -------------------------------------------------------------------------------
>
> ============================== Unreachable Nodes
> ==============================
> All nodes are up and reachable
>
>
> Ok, so it looks like the problem described in the Release Notes for 1.0.2
> here https://github.com/basho/riak/blob/1.0.2-release/RELEASE-NOTES.org.
> Unfortunately I've run that code (through riak attach) with no result.
>
> It's been in this state for 12 hours now I think. What can we do to fix
> our cluster?
>
> I upgraded to 1.0.3 hoping it would fix our problems but that didn't help.
> I cannot upgrade to 1.1.x because we mainly use Luwak for large object
> support
> and that's discontinued in 1.1.x as far as I know.
>
> Thanks for your help,
> John
>
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20120603/c26f7dd4/attachment.html>


More information about the riak-users mailing list