Help with handling Riak disk failure
scicomplete at gmail.com
Tue Sep 19 13:31:38 EDT 2017
Dear Riak users and experts,
I really appreciate any help with my questions below.
I have a 3 node Riak cluster with each having approx. 1 TB disk usage.
All of a sudden, one node's hard disk failed unrecoverably. So, I
added a new node using the following steps:
1) riak-admin cluster join 2) down the failed node 3) riak-admin
force-replace failed-node new-node 4) riak-admin cluster plan 5)
riak-admin cluster commit.
This almost fixed the problem except that after lots of data transfers
and handoffs, now not all three nodes have 1 TB disk usage. Only two
of them have 1 TB disk usage. The other one is almost empty (few 10s
of GBs). This means there are no longer 3 copies on disk anymore. My
data is completely random (no two keys have same data associated with
them. So, compression of data cannot be the reason for less data on
I also tried using the "riak-admin cluster replace failednode newnode"
command so that the leaving node handsoff data to the joining node.
This however is not helpful if the leaving node has a failed hard
disk. I want the remaining live vnodes to help the new node recreate
the lost data using their replica copies.
I have three questions:
1) What commands should I run to forcefully make sure there are three
replicas on disk overall without waiting for read-repair or
anti-entropy to make three copies ? Bandwidth usage or CPU usage is
not a huge concern for me.
2) Also, I will be very grateful if someone lists the commands that I
can run using "riak attach" so that I can clear the AAE trees and
forcefully make sure all data has 3 copies.
3) I will be very thankful if someone helps me with the commands that
I should run to ensure that all data has 3 replicas on disk after the
disk failure (instead of just looking at the disk space usage in all
the nodes as hints)?
More information about the riak-users