Handoff stalled on 1.0.2 riak cluster

John Axel Eriksson john at insane.se
Mon Jun 4 03:10:58 EDT 2012


Thanks Mark,

I think I've pinpointed the problem. When our cluster died because one node
became unresponsive we saw the rest of our cluster go down. The nodes
started crashing and did so even when restarted after about a minute
(something about neighbours crashed in the logs).

So in a panic I disabled riak search since we don't use it and I think I
saw some mention of it in the logs. Anyway, after a few hours we got things
running again, unfortunately we're not entirely sure - it might have been
running out of file descriptors. I then added a new node which never got
any data and the handoff was stalled. Nothing worked to get it to
"unstall"... until I remembered that I disabled riak search. As soon as I
enabled that again the cluster started behaving as expected. Exactly why
that was I don't know.

Also, how stable/unstable would you say Luwak is? We're depending heavily
on it, we know it's not supported anymore and we haven't yet found a good
replacement? Should we be worried about our data? We've got maybe 700-800
GB in the cluster, large files from 2MB to 700-800MB.

Best,
John

On Mon, Jun 4, 2012 at 4:43 AM, Mark Phillips <mark at basho.com> wrote:

> Hi John,
>
> Assuming things aren't back to normal... A few things:
>
> Attach to any running node and run this:
>
> rpc:multicall([node() | nodes()], riak_core_vnode_manager, force_handoffs, []).
>
> This will attempt to force handoff. If this restarts handoff, you've got new issue that we'll need to track down. Please report back if this gets handoffs running again .
>
> Another possible fix:
>
> Take a look at https://github.com/basho/riak_core/pull/153
>
> This was fixed on 1.1, but it might be what's hitting you (though, admittedly, your issue does seem like a perfect match for the issue from the 1.0.2 release notes).
>
> If this is what's ailing you, there's a work-around here:
> https://github.com/basho/riak_core/pull/153#issuecomment-4527706
>
> If neither of these work, let us know and we'll take a deeper look. Specifically:
>
> a) any log files you could send along would be helpful
> b) the output of the following diagnostic:
>
> f(Members).
> Members = riak_core_ring:all_members(element(2, riak_core_ring_manager:get_raw_ring())).
> [{N, rpc:call(N, riak_core_handoff_manager, status, [])} || N <- Members].
>
> Thanks, John.
>
> Mark
>
>
>
> On Sun, Jun 3, 2012 at 5:06 AM, John Axel Eriksson <john at insane.se> wrote:
>
>> Hi.
>>
>> We had an issue where one of the riak servers died (had to be force
>> removed from cluster). After we did that things got really bad and most
>> data was unreachable for hours. I added a new node to replace the old one
>> at one point as well - that never got any data and even now about a day
>> later it hasn't gotten any data.
>> What seems to be the issue now is that there are a few nodes are waiting
>> on handoff of 1 partition. When I look at ring_status I see this:
>>
>> Attempting to restart script through sudo -u riak
>> ================================== Claimant
>> ===================================
>> Claimant:  'riak at r-001.x.x.x
>> Status:     up
>> Ring Ready: true
>>
>> ============================== Ownership Handoff
>> ==============================
>> Owner:      riak at r-004.x.x.x
>> Next Owner: riak at r-003.x.x.x
>>
>> Index: 930565495644285842450002452081070828921550798848
>>   Waiting on: []
>>   Complete:   [riak_kv_vnode,riak_pipe_vnode,riak_search_vnode]
>>
>>
>> -------------------------------------------------------------------------------
>>
>> ============================== Unreachable Nodes
>> ==============================
>> All nodes are up and reachable
>>
>>
>> Ok, so it looks like the problem described in the Release Notes for 1.0.2
>> here https://github.com/basho/riak/blob/1.0.2-release/RELEASE-NOTES.org.
>> Unfortunately I've run that code (through riak attach) with no result.
>>
>> It's been in this state for 12 hours now I think. What can we do to fix
>> our cluster?
>>
>> I upgraded to 1.0.3 hoping it would fix our problems but that didn't
>> help. I cannot upgrade to 1.1.x because we mainly use Luwak for large
>> object support
>> and that's discontinued in 1.1.x as far as I know.
>>
>> Thanks for your help,
>> John
>>
>> _______________________________________________
>> riak-users mailing list
>> riak-users at lists.basho.com
>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20120604/282d8b23/attachment.html>


More information about the riak-users mailing list