'not found' after join

Andy Gross andy at basho.com
Thu May 5 16:06:28 EDT 2011


Alex's description roughly matches up with some of our plans to address this
issue.

As with almost anything, this comes down to a tradeoff between consistency
and availability.   In the case of joining nodes, making the
join/handoff/ownership claim process more "atomic" requires a higher degree
of consensus from the machines in the cluster.  The current process (which
is clearly non-optimal) allows nodes to join the ring as long as they can
contact one current ring member.  A more atomic process would introduce
consensus issues that might prevent nodes from joining in partitioned
scenarios.

A good solution would probably involve some consistency knobs around the
join process to deal with a spectrum of failure/partition scenarios.

This is something of which we are acutely aware and are actively pursuing
solutions for a near-term release.

- Andy


On Thu, May 5, 2011 at 12:22 PM, Alexander Sicular <siculars at gmail.com>wrote:

> I'm really loving this thread. Generating great ideas for the way
> things should be... in the future. It seems to me that "the ring
> changes immediately" is actually the problem as Ryan astutely
> mentions. One way the future could look is :
>
> - a new node comes online
> - introductions are made
> - candidate vnodes are selected for migration (<- insert pixie dust magic
> here)
> - the number of simultaneous migrations are configurable, fewer for
> limited interruption or more for quicker completion
> - vnodes are migrated
> - once migration is completed, ownership is claimed
>
> Selecting vnodes for migration is where the unicorn cavalry attack the
> dragons den. If done right(er) the algorithm could be swappable to
> optimize for different strategies. Don't ask me how to implement it,
> I'm only a yellow belt in erlang-fu.
>
> Cheers,
> Alexander
>
> On Thu, May 5, 2011 at 13:33, Ryan Zezeski <rzezeski at basho.com> wrote:
> > John,
> > All great points.  The problem is that the ring changes immediately when
> a
> > node is added.  So now, all the sudden, the preflist is potentially
> pointing
> > to nodes that don't have the data and they won't have that data until
> > handoff occurs.  The faster that data gets transferred, the less time
> your
> > clients have to hit 'notfound'.
> > However, I agree completely with what you're saying.  This is just a side
> > effect of how the system currently works.  In a perfect world we wouldn't
> > care how long handoff takes and we would also do some sort of automatic
> > congestion control akin to TCP Reno or something.  The preflist would
> still
> > point to the "old" partitions until all data has been successfully handed
> > off, and then and only then would we flip the switch for that vnode.  I'm
> > pretty sure that's where we are heading (I say "pretty sure" b/c I just
> > joined the team and haven't been heavily involved in these specific talks
> > yet).
> > It's all coming down the pipe...
> > As for your specific I/O question re handoff_concurrecy, you might be
> right.
> >  I would think it depends on hardware/platform/etc.  I was offering it as
> a
> > possible stopgap to minimize Greg's pain.  It's certainly a cure to a
> > symptom, not the problem itself.
> > -Ryan
> >
> > On Thu, May 5, 2011 at 1:10 PM, John D. Rowell <me at jdrowell.com> wrote:
> >>
> >> Hi Ryan, Greg,
> >>
> >> 2011/5/5 Ryan Zezeski <rzezeski at basho.com>
> >>>
> >>> 1. For example, riak_core has a `handoff_concurrency` setting that
> >>> determines how many vnodes can concurrently handoff on a given node.
>  By
> >>> default this is set to 4.  That's going to take a while with your 2048
> >>> vnodes and all :)
> >>
> >> Won't that make the handoff situation potentially worse? From the thread
> I
> >> understood that the main problem was that the cluster was shuffling too
> much
> >> data around and thus becoming unresponsive and/or returning unexpected
> >> results (like "not founds"). I'm attributing the concerns more to an
> >> excessive I/O situation than to how long the handoff takes. If the
> handoff
> >> can be made transparent (no or little side effects) I don't think most
> >> people will really care (e.g. the "fix the cluster tomorrow" anecdote).
> >>
> >> How about using a percentage of available I/O to throttle the vnode
> >> handoff concurrency? Start with 1, and monitor the node's I/O (kinda
> like
> >> 'atop' does, collection CPU, disk and network metrics), if it is below
> the
> >> expected usage, then increase the vnode handoff concurrency, and
> vice-versa.
> >>
> >> I for one would be perfectly happy if the handoff took several hours
> (even
> >> days) if we could maintain the core riak_kv characteristics intact
> during
> >> those events. We've all seen looooong RAID rebuild times, and it's
> usually
> >> better to just sit tight and keep the rebuild speed low (slower I/O)
> while
> >> keeping all of the dependent systems running smoothly.
> >>
> >> cheers
> >> -jd
> >
> >
> > _______________________________________________
> > riak-users mailing list
> > riak-users at lists.basho.com
> > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> >
> >
>
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20110505/870f59b5/attachment.html>


More information about the riak-users mailing list