Multiple disks

Nico Meyer nico.meyer at
Wed Mar 23 20:15:00 EDT 2011

After reading todays recap, I am a bit unsure:

> 5) Q --- Would Riak handle an individual vnode failure the same way as
> an entire node failure? (from grourk via #riak)
>     A --- Yes. The request to that vnode would fail and will be routed
> to the next available vnode

Is it really handled the same way? I don't believe handoff will occur. 
The R/W values still apply of course, but I think there will be one less 
replica of the keys that map to the failed vnode until the situation.
I have delved quite a bit into the riak code, but if I really missed 
something I would be glad if someone could point me to the place where a 
vnode failure is detected. As far as I can see, the heavy lifting 
happens in riak_kv_util:try_cast/5 (, 
which only checks if the whole node is up.

On 24.03.2011 00:56, Nico Meyer wrote:
> Hi Greg,
> I don't think the vnodes will always die. I have seen some situations
> (disk full, filesystem becoming read only due to device errors,
> corrupted bitcask files after a mashine crash) where the vnode did not
> crash, but the get and/or put requests returned errors.
> Even if the process crashes, it will just be restarted, possibly over
> and over again.
> Also, the handoff logic only operates on the level of a whole node, not
> individual vnodes, which makes monitoring and detecting disk failures
> very important.
> We were also thinking about how to use multiple disk per node. But its
> not a very pressing problem for us, since we have a lot of relatively
> small entries (~1000 bytes), so the RAM used by bitcask is a problem
> long before we can even fill one disk.
> Cheers,
> Nico
> On 23.03.2011 23:50, Greg Nelson wrote:
>> Hi Joe,
>> With a few hours of investigation today, your patch is looking
>> promising. Maybe you can give some more detail on what you did in your
>> experiments a few months ago?
>> What I did was set up a Ubuntu VM with three loopback file systems. Then
>> built Riak 0.14.1 with your patch, configured as you described to spread
>> across the three disks. I ran a single node, and it correctly spread
>> partitions across the disks.
>> I then corrupted the file system on one of the disks (by zeroing out the
>> loop device), and did some more GETs and PUTs against Riak. In the logs
>> it looks like the vnode processes that had bitcasks on that disk died,
>> as expected, and the other vnodes continued to operate.
>> I need to do a bit more investigation with more than one node, but given
>> how well it handled this scenario, it seems like we're on the right
>> track.
>> Oh, one thing I noticed is that while Riak starts up, if there's a bad
>> disk then it will shutdown (the whole node), at this line:
>> That makes sense, but I'm wondering if it's possible to let the node
>> start since some of its vnodes would be able to open their bitcasks just
>> fine. I wonder if it's as simple as removing that line?
>> Greg
>> On Tuesday, March 22, 2011 at 9:54 AM, Joseph Blomstedt wrote:
>>> You're forgetting how awesome riak actually is. Given how riak is
>>> implemented, my patches should work without any operational headaches
>>> at all. Let me explain.
>>> First, there was the one issue from yesterday. My initial patch didn't
>>> reuse the same partition bitcask on the same node. I've fixed that in
>>> a newer commit:
>>> Now, about how this all works in operation.
>>> Let's consider a simple scenario under normal riak. The key concept
>>> here is to realize that riak's vnodes are completely independent, and
>>> that failure and partition ownership changes are handled through
>>> handoff alone.
>>> Let's say we have an 8-partition ring with 3 riak nodes:
>>> n1 owns partitions 1,4,7
>>> n2 owns partitions 2,5,8
>>> n3 owns partitions 3,6
>>> ie: Ring = (0/n1, 1/n2, 2/n3, 3/n1, 4/n2, 5/n3, 6/n1, 7/n2, 8/n3)
>>> Each node runs an independent vnode for each partition it owns, and
>>> each vnode will setup it's own bitcask:
>>> vnode 0/1: {n1-root}/data/bitcask/1
>>> vnode 0/4: {n1-root}/data/bitcask/4
>>> ...
>>> vnode 2/2: {n2-root}/data/bitcask/2
>>> ...
>>> vnode 3/6: {n3-root}/data/bitcask/6
>>> Reads/writes are routed to the appropriate vnodes and to the
>>> appropriate bitcasks. Under failure, hinted handoff comes into play.
>>> Let's have a write to preflist [1,2,3] while n2 is down/split. Since
>>> n2 is down, riak will send the write meant for partition 2 to another
>>> node, let's say n3. n3 will spawn a new vnode for partition 2 which is
>>> initially empty:
>>> vnode 3/2: {n3-root}/data/bitcask/2
>>> and, write the incoming write to the new bitcask.
>>> Later, when n2 rejoins, n3 will eventually engage in handoff, and send
>>> all (k,v) in its data/bitcask/2 to n2, which writes them into its
>>> data/bitcask/2. After handing off data, n3 will shutdown it's 3/2
>>> vnode and delete the bitcask directory {n3-root}/data/bitcask/2.
>>> Under node rebalancing / ownership changes, a similar event occurs.
>>> For example, if a new node n4 takes ownership of partition 4, then n1
>>> will handoff it's data to n4 and then shutdown its vnode and delete
>>> its {n1-root}/data/bitcask/4.
>>> If you take the above scenario, and change all the directories of the
>>> form:
>>> {NODE-root}/data/bitcask/P
>>> to:
>>> /mnt/DISK-N/NODE/bitcask/P
>>> and allow DISK-N to be any randomly chosen directory in /mnt, then the
>>> scenario plays out exactly the same provided that riak always selects
>>> the same DISK-N for a given P on a given node (across nodes doesn't
>>> matter, vnodes are independent). My new commit handles this. A simple
>>> configuration could be:
>>> n1-vars.config:
>>> {bitcask_data_root, {random, ["/mnt/bitcask/disk1/n1",
>>> "/mnt/bitcask/disk2/n1", "/mnt/bitcask/disk3/n1"]}}
>>> n2-vars.config:
>>> {bitcask_data_root, {random, ["/mnt/bitcask/disk1/n2",
>>> "/mnt/bitcask/disk2/n2", "/mnt/bitcask/disk3/n2"]}}
>>> (...etc...)
>>> There is no inherent need for symlinks, or needing to pre-create any
>>> initial links per partition index. riak already creates and deletes
>>> partition bitcask directories on demand. If a disk fails, then all
>>> vnodes with bitcasks on that disk fail in the same manner as a disk
>>> failure under normal riak. Standard read repair, handoff, and node
>>> replacement apply.
>>> -Joe
>>> On Tue, Mar 22, 2011 at 9:53 AM, Alexander Sicular <siculars at
>>> <mailto:siculars at>> wrote:
>>>> Ya, my original message just highlighted the standard 0,1,5 that most
>>>> people/hardware should know/be able to support. There are better
>>>> options and
>>>> 10 would be one of them.
>>>> @siculars on twitter
>>>> Sent from my iPhone
>>>> On Mar 22, 2011, at 8:43, Ryan Zezeski <rzezeski at
>>>> <mailto:rzezeski at>> wrote:
>>>> On Tue, Mar 22, 2011 at 10:01 AM, Alexander Sicular
>>>> <siculars at <mailto:siculars at>>
>>>> wrote:
>>>>> Save your ops dudes the headache and just use raid 5 and be done
>>>>> with it.
>>>> Depending on the number of disks available I might even argue running
>>>> software RAID 10 for better throughput and less chance of data loss
>>>> (as long
>>>> as you can afford to cut your avail storage in half on every
>>>> machine). It's
>>>> not too hard to setup on modern Linux distros (mdadm); at least I was
>>>> doing
>>>> it 5 years ago and I'm no sys admin.
>>>> -Ryan
>>> _______________________________________________
>>> riak-users mailing list
>>> riak-users at <mailto:riak-users at>
>> _______________________________________________
>> riak-users mailing list
>> riak-users at
> _______________________________________________
> riak-users mailing list
> riak-users at

More information about the riak-users mailing list