High disk usage on node

Nicholas Adams nicholas.adams at tiot.jp
Thu Nov 1 10:23:47 EDT 2018

Hi Travis,
I see you have encountered the disk space pickle. Just for the record, the safest way to run Riak in production is to keep all resources (CPU, RAM, Network, Disk Space, I/O etc.) below 70% utilization on all nodes at all times. The reason behind this is to compensate for when one or more nodes go down. When this does happen, the remaining nodes have to carry all the load of the offline node(s) on top of their existing load and therefore need to have sufficient free resources available to do it. If you are running right on the limit for any resource then you need to expect issues like this or worse to happen on a regular basis.

Potential initial prevention
When you join all the nodes together to form your cluster, you get to run `riak-admin cluster plan` and it will show you how things will turn out. If you like this plan, run `riak-admin cluster commit` and partitions will be moved around accordingly. If not, you can cancel the plan and generate a new one and keep doing so until you are happy with the distribution. Sometimes with unfortunate divisions of partitions/nodes/ring size then one node gets its fair share and then some but often a replan will make it as painless as possible.

Temporary escape from current situation
Before beginning here, let me mention that this method does have the potential to go horribly wrong so proceed at your own risk. With regards to the server with the filled disk, you can follow the method underneath:

  *   Stop Riak
  *   Attach additional storage (USB, additional disks, NAS, whatever)
  *   Copy partitions from the data directory of, presumably bitcask, to the additional storage
  *   Once the copy has been completed, delete the data from the regular node's hard disk
  *   Create a symlink from the external storage to where you just deleted the data from
  *   Repeat until you have freed up sufficient disk space (new stuff may be copied here, so make sure you do have enough space)
  *   Start Riak

The above should bring your server back in touch with the cluster. Monitor transfers and once they have all finished, add your new node to the cluster. After this new node has been added and all transfers have finished, take the previously full node offline and reverse the steps above until you are able to remove the additional storage.

Note: running a mixed version cluster for a prolonged period of time is not recommended in production. Out of preference, I would suggest installing the same version of Riak on the new node, going through the above and then looking at upgrading the cluster once everything is stable.

Good luck,

From: riak-users <riak-users-bounces at lists.basho.com> On Behalf Of Travis Kirstine
Sent: 01 November 2018 22:25
To: riak-users at lists.basho.com
Subject: High disk usage on node

I'm running riak (v2.14) in a 5 node cluster and for some reason one of the nodes has higher disk usage than the other nodes.  The problem seems to be related to how riak distributes the partitions, in my case I'm using the default 64, riak has given each node 12 partition except one node that gets 16 (4x12+16=64).  As a result the node with 16 partitions has filled the disk and become 'un-reachable'.

I have a node on standby with roughly the same disk space as the failed node, my concern is that if a add it to the cluster it will overflow as well.

How do I recover the failed node and add a new node without destroying the cluster..... BTW just to make things more fun the new node is at a newer version of riak so I need to perform a rolling upgrade at the same time.

Any help would be greatly appreciated!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20181101/e3469b9c/attachment.html>

More information about the riak-users mailing list