'not found' after join
andy at basho.com
Thu May 5 16:06:28 EDT 2011
Alex's description roughly matches up with some of our plans to address this
As with almost anything, this comes down to a tradeoff between consistency
and availability. In the case of joining nodes, making the
join/handoff/ownership claim process more "atomic" requires a higher degree
of consensus from the machines in the cluster. The current process (which
is clearly non-optimal) allows nodes to join the ring as long as they can
contact one current ring member. A more atomic process would introduce
consensus issues that might prevent nodes from joining in partitioned
A good solution would probably involve some consistency knobs around the
join process to deal with a spectrum of failure/partition scenarios.
This is something of which we are acutely aware and are actively pursuing
solutions for a near-term release.
On Thu, May 5, 2011 at 12:22 PM, Alexander Sicular <siculars at gmail.com>wrote:
> I'm really loving this thread. Generating great ideas for the way
> things should be... in the future. It seems to me that "the ring
> changes immediately" is actually the problem as Ryan astutely
> mentions. One way the future could look is :
> - a new node comes online
> - introductions are made
> - candidate vnodes are selected for migration (<- insert pixie dust magic
> - the number of simultaneous migrations are configurable, fewer for
> limited interruption or more for quicker completion
> - vnodes are migrated
> - once migration is completed, ownership is claimed
> Selecting vnodes for migration is where the unicorn cavalry attack the
> dragons den. If done right(er) the algorithm could be swappable to
> optimize for different strategies. Don't ask me how to implement it,
> I'm only a yellow belt in erlang-fu.
> On Thu, May 5, 2011 at 13:33, Ryan Zezeski <rzezeski at basho.com> wrote:
> > John,
> > All great points. The problem is that the ring changes immediately when
> > node is added. So now, all the sudden, the preflist is potentially
> > to nodes that don't have the data and they won't have that data until
> > handoff occurs. The faster that data gets transferred, the less time
> > clients have to hit 'notfound'.
> > However, I agree completely with what you're saying. This is just a side
> > effect of how the system currently works. In a perfect world we wouldn't
> > care how long handoff takes and we would also do some sort of automatic
> > congestion control akin to TCP Reno or something. The preflist would
> > point to the "old" partitions until all data has been successfully handed
> > off, and then and only then would we flip the switch for that vnode. I'm
> > pretty sure that's where we are heading (I say "pretty sure" b/c I just
> > joined the team and haven't been heavily involved in these specific talks
> > yet).
> > It's all coming down the pipe...
> > As for your specific I/O question re handoff_concurrecy, you might be
> > I would think it depends on hardware/platform/etc. I was offering it as
> > possible stopgap to minimize Greg's pain. It's certainly a cure to a
> > symptom, not the problem itself.
> > -Ryan
> > On Thu, May 5, 2011 at 1:10 PM, John D. Rowell <me at jdrowell.com> wrote:
> >> Hi Ryan, Greg,
> >> 2011/5/5 Ryan Zezeski <rzezeski at basho.com>
> >>> 1. For example, riak_core has a `handoff_concurrency` setting that
> >>> determines how many vnodes can concurrently handoff on a given node.
> >>> default this is set to 4. That's going to take a while with your 2048
> >>> vnodes and all :)
> >> Won't that make the handoff situation potentially worse? From the thread
> >> understood that the main problem was that the cluster was shuffling too
> >> data around and thus becoming unresponsive and/or returning unexpected
> >> results (like "not founds"). I'm attributing the concerns more to an
> >> excessive I/O situation than to how long the handoff takes. If the
> >> can be made transparent (no or little side effects) I don't think most
> >> people will really care (e.g. the "fix the cluster tomorrow" anecdote).
> >> How about using a percentage of available I/O to throttle the vnode
> >> handoff concurrency? Start with 1, and monitor the node's I/O (kinda
> >> 'atop' does, collection CPU, disk and network metrics), if it is below
> >> expected usage, then increase the vnode handoff concurrency, and
> >> I for one would be perfectly happy if the handoff took several hours
> >> days) if we could maintain the core riak_kv characteristics intact
> >> those events. We've all seen looooong RAID rebuild times, and it's
> >> better to just sit tight and keep the rebuild speed low (slower I/O)
> >> keeping all of the dependent systems running smoothly.
> >> cheers
> >> -jd
> > _______________________________________________
> > riak-users mailing list
> > riak-users at lists.basho.com
> > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> riak-users mailing list
> riak-users at lists.basho.com
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the riak-users