Multiple disks

Joseph Blomstedt Joseph.Blomstedt at gmail.com
Wed Mar 23 20:58:54 EDT 2011


Sorry, I don't have a lot of time right now. I'll try to write a more
detailed response later.

>>> With a few hours of investigation today, your patch is looking
>>> promising. Maybe you can give some more detail on what you did in your
>>> experiments a few months ago?

I'll try to write something up when I have the time. I need to find my
notes. In general, the focus was mostly on performance tuning,
although I did look into error/recovery a bit as well. My main goal at
the time was trying to reduce disk seeks as much as possible. Bitcask
is awesome as it is an append only store, but if you have multiple
bitcasks being written to on the same disk you still end up with disk
seeking depending on how the underlying file system works. I was
trying to mitigate this as much as possible, given a project that used
bitcask in a predominately write-only mode (basically as a transaction
log that was only written to; read only in failure conditions). BTW,
concerning RAID, I recall seeing better performance spreading vnode
bitcasks across several smaller RAID arrays than using a single larger
RAID array during write-heavy bursts.

>>> Oh, one thing I noticed is that while Riak starts up, if there's a bad
>>> disk then it will shutdown (the whole node), at this line:
>>>
>>>
>>> https://github.com/jtuple/riak_kv/blob/jdb-multi-dirs/src/riak_kv_bitcask_backend.erl#L103
>>>
>>>
>>> That makes sense, but I'm wondering if it's possible to let the node
>>> start since some of its vnodes would be able to open their bitcasks just
>>> fine. I wonder if it's as simple as removing that line?
>>>

You don't want to remove that line, riak expects the vnode to come
online or kill the entire node. You would need to have a vnode failure
trigger an ownership change if you really wanted things to behave
properly.

The better case is to not have the vnode fail if there are other
existing disks. That's an easy change that I'll throw together when I
have time. Basically, when a vnode starts, have it pick a bitcask
directory, if that directory fails, then have it pick a different
directory. If all configured directories fail, then call riak:stop.
Thus, if a disk fails and a vnode restarts, it should create a new
empty bitcask on a working disk. Then read repair will slowly rewrite
your data depending on data access (handoff won't occur though, unless
that's added in patch).

> After reading todays recap, I am a bit unsure:
>
>> 5) Q --- Would Riak handle an individual vnode failure the same way as
>> an entire node failure? (from grourk via #riak)
>>
>>    A --- Yes. The request to that vnode would fail and will be routed
>> to the next available vnode
>
> Is it really handled the same way? I don't believe handoff will occur. The
> R/W values still apply of course, but I think there will be one less replica
> of the keys that map to the failed vnode until the situation.
> I have delved quite a bit into the riak code, but if I really missed
> something I would be glad if someone could point me to the place where a
> vnode failure is detected. As far as I can see, the heavy lifting happens in
> riak_kv_util:try_cast/5 (
> https://github.com/basho/riak_kv/blob/riak_kv-0.14.1/src/riak_kv_util.erl#L78),
> which only checks if the whole node is up.

I don't think handoff occurs either. Maybe folks at Basho can look
into this further, or someone can test it. I'll test it
tonight/tomorrow if I have the time. It looks like the cast will
occur, but never return. So, your overall write may fail depending on
your W-val. Is there something we're both missing here?

-Joe




More information about the riak-users mailing list