'not found' after join

Mike Oxford moxford at gmail.com
Thu May 5 16:28:22 EDT 2011


As someone not familiar with Riak's internals...

You have xN replication.
Make the transition-nodes be part of an x(N+1) and write-only (eg, they
don't count in the read quorum.)

If you're set up for x3 replication, then the transition bucket ends up as
part of an x4 replication.
As more queries come in, you will get up to 4 responses, but the write-only
response gets tossed.
A split would give you up to a 2x2 where it's really 2x1+1 and you can
rebuild normally.
Once your x4 replication is consistant you remove one node from the
replication set, taking you back to 3.

This, while not trivial, seems to leverage many of the pieces you already
have in place and avoids the
problem of "slow replication fubars requests for indeterminate amounts of
time."

Just a suggestion from the peanut gallery... :)

-mox



On Thu, May 5, 2011 at 1:06 PM, Andy Gross <andy at basho.com> wrote:

>
> Alex's description roughly matches up with some of our plans to address
> this issue.
>
> As with almost anything, this comes down to a tradeoff between consistency
> and availability.   In the case of joining nodes, making the
> join/handoff/ownership claim process more "atomic" requires a higher degree
> of consensus from the machines in the cluster.  The current process (which
> is clearly non-optimal) allows nodes to join the ring as long as they can
> contact one current ring member.  A more atomic process would introduce
> consensus issues that might prevent nodes from joining in partitioned
> scenarios.
>
> A good solution would probably involve some consistency knobs around the
> join process to deal with a spectrum of failure/partition scenarios.
>
> This is something of which we are acutely aware and are actively pursuing
> solutions for a near-term release.
>
> - Andy
>
>
> On Thu, May 5, 2011 at 12:22 PM, Alexander Sicular <siculars at gmail.com>wrote:
>
>> I'm really loving this thread. Generating great ideas for the way
>> things should be... in the future. It seems to me that "the ring
>> changes immediately" is actually the problem as Ryan astutely
>> mentions. One way the future could look is :
>>
>> - a new node comes online
>> - introductions are made
>> - candidate vnodes are selected for migration (<- insert pixie dust magic
>> here)
>> - the number of simultaneous migrations are configurable, fewer for
>> limited interruption or more for quicker completion
>> - vnodes are migrated
>> - once migration is completed, ownership is claimed
>>
>> Selecting vnodes for migration is where the unicorn cavalry attack the
>> dragons den. If done right(er) the algorithm could be swappable to
>> optimize for different strategies. Don't ask me how to implement it,
>> I'm only a yellow belt in erlang-fu.
>>
>> Cheers,
>> Alexander
>>
>> On Thu, May 5, 2011 at 13:33, Ryan Zezeski <rzezeski at basho.com> wrote:
>> > John,
>> > All great points.  The problem is that the ring changes immediately when
>> a
>> > node is added.  So now, all the sudden, the preflist is potentially
>> pointing
>> > to nodes that don't have the data and they won't have that data until
>> > handoff occurs.  The faster that data gets transferred, the less time
>> your
>> > clients have to hit 'notfound'.
>> > However, I agree completely with what you're saying.  This is just a
>> side
>> > effect of how the system currently works.  In a perfect world we
>> wouldn't
>> > care how long handoff takes and we would also do some sort of automatic
>> > congestion control akin to TCP Reno or something.  The preflist would
>> still
>> > point to the "old" partitions until all data has been successfully
>> handed
>> > off, and then and only then would we flip the switch for that vnode.
>>  I'm
>> > pretty sure that's where we are heading (I say "pretty sure" b/c I just
>> > joined the team and haven't been heavily involved in these specific
>> talks
>> > yet).
>> > It's all coming down the pipe...
>> > As for your specific I/O question re handoff_concurrecy, you might be
>> right.
>> >  I would think it depends on hardware/platform/etc.  I was offering it
>> as a
>> > possible stopgap to minimize Greg's pain.  It's certainly a cure to a
>> > symptom, not the problem itself.
>> > -Ryan
>> >
>> > On Thu, May 5, 2011 at 1:10 PM, John D. Rowell <me at jdrowell.com> wrote:
>> >>
>> >> Hi Ryan, Greg,
>> >>
>> >> 2011/5/5 Ryan Zezeski <rzezeski at basho.com>
>> >>>
>> >>> 1. For example, riak_core has a `handoff_concurrency` setting that
>> >>> determines how many vnodes can concurrently handoff on a given node.
>>  By
>> >>> default this is set to 4.  That's going to take a while with your 2048
>> >>> vnodes and all :)
>> >>
>> >> Won't that make the handoff situation potentially worse? From the
>> thread I
>> >> understood that the main problem was that the cluster was shuffling too
>> much
>> >> data around and thus becoming unresponsive and/or returning unexpected
>> >> results (like "not founds"). I'm attributing the concerns more to an
>> >> excessive I/O situation than to how long the handoff takes. If the
>> handoff
>> >> can be made transparent (no or little side effects) I don't think most
>> >> people will really care (e.g. the "fix the cluster tomorrow" anecdote).
>> >>
>> >> How about using a percentage of available I/O to throttle the vnode
>> >> handoff concurrency? Start with 1, and monitor the node's I/O (kinda
>> like
>> >> 'atop' does, collection CPU, disk and network metrics), if it is below
>> the
>> >> expected usage, then increase the vnode handoff concurrency, and
>> vice-versa.
>> >>
>> >> I for one would be perfectly happy if the handoff took several hours
>> (even
>> >> days) if we could maintain the core riak_kv characteristics intact
>> during
>> >> those events. We've all seen looooong RAID rebuild times, and it's
>> usually
>> >> better to just sit tight and keep the rebuild speed low (slower I/O)
>> while
>> >> keeping all of the dependent systems running smoothly.
>> >>
>> >> cheers
>> >> -jd
>> >
>> >
>> > _______________________________________________
>> > riak-users mailing list
>> > riak-users at lists.basho.com
>> > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>> >
>> >
>>
>> _______________________________________________
>> riak-users mailing list
>> riak-users at lists.basho.com
>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>
>
>
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20110505/f3822a1f/attachment.html>


More information about the riak-users mailing list