two nodes stuck leaving / transferring data to each other, jamming up cluster

Swinney, Austin Austin at vimeo.com
Mon Jun 4 18:36:47 EDT 2012


To answer my own question…

I did a `stop node`, and `mark down` through the Riak Control interface of the one stuck at 0.2% data (235) and trying to send files to 234.

leaving     0.2%      0.0%    'riak at 10.0.0.235<mailto:riak at 10.0.0.235>'

Then restarted 234:

leaving    15.6%     15.8%    'riak at 10.0.0.234<mailto:riak at 10.0.0.234>'

And joined it back to the cluster.  (it was thinking it was a cluster of 1, though not reported as this way through Control or member_status.

Now it thinks it is back and ready for duty.

================================= Membership ==================================
Status     Ring    Pending    Node
-------------------------------------------------------------------------------
down        0.2%      0.2%    'riak at 10.0.0.235<mailto:riak at 10.0.0.235>'
down       16.6%     16.6%    'riak at 10.0.0.83<mailto:riak at 10.0.0.83>'
valid       2.7%      9.0%    'riak at 10.0.0.168<mailto:riak at 10.0.0.168>'
valid       2.9%      8.2%    'riak at 10.0.0.169<mailto:riak at 10.0.0.169>'
valid       2.1%      6.8%    'riak at 10.0.0.170<mailto:riak at 10.0.0.170>'
valid      17.8%     11.3%    'riak at 10.0.0.231<mailto:riak at 10.0.0.231>'
valid      17.0%     14.5%    'riak at 10.0.0.232<mailto:riak at 10.0.0.232>'
valid      13.1%     12.9%    'riak at 10.0.0.233<mailto:riak at 10.0.0.233>'
valid      13.5%     10.5%    'riak at 10.0.0.234<mailto:riak at 10.0.0.234>'
valid      14.1%     10.0%    'riak at 10.0.0.84<mailto:riak at 10.0.0.84>'
-------------------------------------------------------------------------------
Valid:8 / Leaving:0 / Exiting:0 / Joining:0 / Down:2


I'm pretty sure I had done both of these things before (marking as down, restarting, etc), but I guess the time wasn't right.

That is fine.  At least the cluster is blowing chunks back and forth again like a good little party animal.

Best Regards,

Austin

On Jun 4, 2012, at 2:16 PM, Swinney, Austin wrote:

Hi All,

The following is about leveldb, riak (1.1.1 2012-03-07) RedHat x86_64, and one riak newbie known as me!

I had this problem over the weekend whereby two nodes are leaving and they are both stuck trying to send transfers from one to the other.

I had backed them up with tar, and after they became stuck, I tried launching new instances with those levedb tar file backups.  But those new hosts, although listed in connected_nodes,  are not in the ring_members.

Are there any work arounds to resolving the stuck ownership handoff between the two leaving nodes?   I tried different scenarios of marking them as down.  That didn't seem to help.  ring_status indicated it wanted them down, then it wanted them back online.  etc.

I don't really need either one.  I'd like to eject them both from the cluster and have it rebalance onto the new nodes.


Both these were asked to leave:
Owner:      riak at 10.0.0.235<mailto:riak at 10.0.0.235>
Next Owner: riak at 10.0.0.234<mailto:riak at 10.0.0.234>

ring_status output:

[root at ip-10-0-0-171 riak]# riak-admin ring_status
Attempting to restart script through sudo -u riak
================================== Claimant ===================================
Claimant:  'riak at 10.0.0.232<mailto:riak at 10.0.0.232>'
Status:     up
Ring Ready: false

============================== Ownership Handoff ==============================
Owner:      riak at 10.0.0.235<mailto:riak at 10.0.0.235>
Next Owner: riak at 10.0.0.234<mailto:riak at 10.0.0.234>

Index: 727896323280039539339725844419242519555200778240
  Waiting on: [riak_kv_vnode]
  Complete:   [riak_pipe_vnode,riak_search_vnode]

Index: 876330083321459366969787585241990013739006427136
  Waiting on: [riak_kv_vnode]
  Complete:   [riak_pipe_vnode,riak_search_vnode]

-------------------------------------------------------------------------------

============================== Unreachable Nodes ==============================
All nodes are up and reachable


And member_status output:

[root at ip-10-0-0-168 ~]# riak-admin member_status
Attempting to restart script through sudo -u riak
================================= Membership ==================================
Status     Ring    Pending    Node
-------------------------------------------------------------------------------
down       16.6%     16.6%    'riak at 10.0.0.83<mailto:riak at 10.0.0.83>'
leaving     0.2%      0.0%    'riak at 10.0.0.235<mailto:riak at 10.0.0.235>'
leaving    15.6%     15.8%    'riak at 10.0.0.234<mailto:riak at 10.0.0.234>'
valid       0.0%      0.0%    'riak at 10.0.0.168<mailto:riak at 10.0.0.168>'
valid       0.0%      0.0%    'riak at 10.0.0.169<mailto:riak at 10.0.0.169>'
valid      18.0%     18.0%    'riak at 10.0.0.231<mailto:riak at 10.0.0.231>'
valid      17.4%     17.4%    'riak at 10.0.0.232<mailto:riak at 10.0.0.232>'
valid      15.4%     15.4%    'riak at 10.0.0.233<mailto:riak at 10.0.0.233>'
valid      16.8%     16.8%    'riak at 10.0.0.84<mailto:riak at 10.0.0.84>'
-------------------------------------------------------------------------------
Valid:6 / Leaving:1 / Exiting:0 / Joining:1 / Down:1


Thanks for your input!

Austin

_______________________________________________
riak-users mailing list
riak-users at lists.basho.com<mailto:riak-users at lists.basho.com>
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20120604/33b474be/attachment.html>


More information about the riak-users mailing list