Multiple disks

Dan Reverri dan at basho.com
Sun Mar 27 21:06:00 EDT 2011


Hi Joe,

You observation regarding question 5 is correct. The coordinating FSM
would attempt to send the request to the failed vnode and receive
either an error or no reply. A request may still succeed if enough of
the other vnodes respond; "enough" would be determined by the "r",
"w", "dw", or "rw" setting of the request. Handoff would not occur in
this scenario.

Thanks,
Dan

Daniel Reverri
Developer Advocate
Basho Technologies, Inc.
dan at basho.com


On Wed, Mar 23, 2011 at 5:58 PM, Joseph Blomstedt
<Joseph.Blomstedt at gmail.com> wrote:
>
> Sorry, I don't have a lot of time right now. I'll try to write a more
> detailed response later.
>
> >>> With a few hours of investigation today, your patch is looking
> >>> promising. Maybe you can give some more detail on what you did in your
> >>> experiments a few months ago?
>
> I'll try to write something up when I have the time. I need to find my
> notes. In general, the focus was mostly on performance tuning,
> although I did look into error/recovery a bit as well. My main goal at
> the time was trying to reduce disk seeks as much as possible. Bitcask
> is awesome as it is an append only store, but if you have multiple
> bitcasks being written to on the same disk you still end up with disk
> seeking depending on how the underlying file system works. I was
> trying to mitigate this as much as possible, given a project that used
> bitcask in a predominately write-only mode (basically as a transaction
> log that was only written to; read only in failure conditions). BTW,
> concerning RAID, I recall seeing better performance spreading vnode
> bitcasks across several smaller RAID arrays than using a single larger
> RAID array during write-heavy bursts.
>
> >>> Oh, one thing I noticed is that while Riak starts up, if there's a bad
> >>> disk then it will shutdown (the whole node), at this line:
> >>>
> >>>
> >>> https://github.com/jtuple/riak_kv/blob/jdb-multi-dirs/src/riak_kv_bitcask_backend.erl#L103
> >>>
> >>>
> >>> That makes sense, but I'm wondering if it's possible to let the node
> >>> start since some of its vnodes would be able to open their bitcasks just
> >>> fine. I wonder if it's as simple as removing that line?
> >>>
>
> You don't want to remove that line, riak expects the vnode to come
> online or kill the entire node. You would need to have a vnode failure
> trigger an ownership change if you really wanted things to behave
> properly.
>
> The better case is to not have the vnode fail if there are other
> existing disks. That's an easy change that I'll throw together when I
> have time. Basically, when a vnode starts, have it pick a bitcask
> directory, if that directory fails, then have it pick a different
> directory. If all configured directories fail, then call riak:stop.
> Thus, if a disk fails and a vnode restarts, it should create a new
> empty bitcask on a working disk. Then read repair will slowly rewrite
> your data depending on data access (handoff won't occur though, unless
> that's added in patch).
>
> > After reading todays recap, I am a bit unsure:
> >
> >> 5) Q --- Would Riak handle an individual vnode failure the same way as
> >> an entire node failure? (from grourk via #riak)
> >>
> >>    A --- Yes. The request to that vnode would fail and will be routed
> >> to the next available vnode
> >
> > Is it really handled the same way? I don't believe handoff will occur. The
> > R/W values still apply of course, but I think there will be one less replica
> > of the keys that map to the failed vnode until the situation.
> > I have delved quite a bit into the riak code, but if I really missed
> > something I would be glad if someone could point me to the place where a
> > vnode failure is detected. As far as I can see, the heavy lifting happens in
> > riak_kv_util:try_cast/5 (
> > https://github.com/basho/riak_kv/blob/riak_kv-0.14.1/src/riak_kv_util.erl#L78),
> > which only checks if the whole node is up.
>
> I don't think handoff occurs either. Maybe folks at Basho can look
> into this further, or someone can test it. I'll test it
> tonight/tomorrow if I have the time. It looks like the cast will
> occur, but never return. So, your overall write may fail depending on
> your W-val. Is there something we're both missing here?
>
> -Joe
>
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com




More information about the riak-users mailing list