Node Recovery Questions

sean mcevoy sean.mcevoy at
Thu Aug 9 18:25:33 EDT 2018

Hi Martin,
Thanks for taking the time.
Yes, by "size of the bitcask directory" I mean I did a "du -h --max-depth=1
bitcask", so I think that would cover all the vnodes. We don't use any
other backends.
Those answers are helpful, will get back to this in a few days and see what
I can determine about where our data physically lies. Might have more
questions then.

On Wed, Aug 8, 2018 at 6:05 PM, Martin Sumner <martin.sumner at>

> Based on a quick read of the code, compaction in bitcask is performed only
> on "readable" files, and the current active file for writing is excluded
> from that list.  With default settings, that active file can grow to 2GB.
> So it is possible that if objects had been replaced/deleted many times
> within the active file, that space will not be recovered if all the
> replacements amount to < 2GB per vnode.  So at these small data sizes - you
> may get a relatively significant discrepancy between an old and recovered
> node in terms of disk space usage.
> On 8 August 2018 at 17:37, Martin Sumner <martin.sumner at>
> wrote:
>> Sean,
>> Some partial answers to your questions.
>> I don't believe force-replace itself will sync anything up - it just
>> reassigns ownership (hence handoff happens very quickly).
>> Read repair would synchronise a portion of the data.  So if 10% of you
>> data is read regularly, this might explain some of what you see.
>> AAE should also repair your data.  But if nothing has happened for 4
>> days, then that doesn't seem to be the case.  It would be worth checking
>> the aae-status page (
>> /2.2.3/using/admin/riak-admin/#aae-status) to confirm things are
>> happening.
>> I don't know if there are any minimum levels of data before bitcask will
>> perform compaction.  There's nothing obvious in the code that wouldn't be
>> triggered way before 90%.  I don't know if it will merge on the active file
>> (the one currently being written to), but that is 2GB max size (configured
>> through bitcask.max_file_size).
>> When you say the size of the bitcask directory - is this the size shared
>> across all vnodes on the node?  I guess if each vnode has a single file
>> <2GB, and there are multiple vnodes - something unexpected might happen
>> here?  If bitcask does indeed not merge the file active for writing.
>> In terms of distribution around the cluster, if you have an n_val of 3
>> you should normally expect to see a relatively even distribution of the
>> data on failure (certainly not it all going to one).  Worst case scenario
>> is that 3 nodes get all the load from that one failed node.
>> When a vnode is inaccessible, 3 (assuming n=3) fallback vnodes are
>> selected to handle the load for that 1 vnode (as that vnode would normally
>> be in 3 preflists, and commonly a different node will be asked to start a
>> vnode for each preflist).
>> I will try and dig later into bitcask merge/compaction code, to see if I
>> spot anything else.
>> Martin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

More information about the riak-users mailing list