Help with handling Riak disk failure

Leo scicomplete at gmail.com
Tue Sep 19 13:31:38 EDT 2017


Dear Riak users and experts,

I really appreciate any help with my questions below.

I have a 3 node Riak cluster with each having approx. 1 TB disk usage.
All of a sudden, one node's hard disk failed unrecoverably. So, I
added a new node using the following steps:

1) riak-admin cluster join 2) down the failed node 3) riak-admin
force-replace failed-node new-node 4) riak-admin cluster plan 5)
riak-admin cluster commit.

This almost fixed the problem except that after lots of data transfers
and handoffs, now not all three nodes have 1 TB disk usage. Only two
of them have 1 TB disk usage. The other one is almost empty (few 10s
of GBs). This means there are no longer 3 copies on disk anymore. My
data is completely random (no two keys have same data associated with
them. So, compression of data cannot be the reason for less data on
disk),

I also tried using the "riak-admin cluster replace failednode newnode"
command so that the leaving node handsoff data to the joining node.
This however is not helpful if the leaving node has a failed hard
disk. I want the remaining live vnodes to help the new node recreate
the lost data using their replica copies.

I have three questions:

1) What commands should I run to forcefully make sure there are three
replicas on disk overall without waiting for read-repair or
anti-entropy to make three copies ? Bandwidth usage or CPU usage is
not a huge concern for me.

2) Also, I will be very grateful if someone lists the commands that I
can run using "riak attach" so that I can clear the AAE trees and
forcefully make sure all data has 3 copies.

3) I will be very thankful if someone helps me with the commands that
I should run to ensure that all data has 3 replicas on disk after the
disk failure (instead of just looking at the disk space usage in all
the nodes as hints)?

Thanks,
Leo




More information about the riak-users mailing list