'not found' after join

Ben Tilly btilly at gmail.com
Thu May 5 17:07:41 EDT 2011


There are solutions to that consistency issue.  You can set
allow_multi true, have each object have a link to a change history,
and have each change have a record of what changed.  The change
history could be done as a singly linked list, where each change is
inserted into a bucket with a randomly generated key.

And then on reading an object, if you find siblings, you can go look
at the change histories, merge them, and come up with a resolved
object.

This is a *lot* of application logic, but it should be doable.

On Thu, May 5, 2011 at 1:14 PM, Greg Nelson <grourk at dropcam.com> wrote:
> The future I'd like to see is basically what I initially expected.  That is,
> I can add a single node to an online cluster and clients should not even see
> any effects of this or need to know that it's even happening -- except of
> course the side effects like the added load on the cluster incurred by
> gossiping new ring state, handing off data, etc.  But if no data has
> actually been lost, I don't believe data should ever be unavailable,
> temporarily or not.  And I'd like to be able to, as someone else mentioned,
> add a node and throttle the handoffs and let it trickle over hours or even
> days.
>
> Waving hands and saying that eventually the data will make it is true in
> principle, but in practice if you are following a read/modify/write pattern
> for some objects, you could easily lose data.  e.g., my application writes
> JSON arrays to certain objects, and when it wishes to append something to
> the array, it will read/append/write back.  If that initial read returns
> 404, then a new empty array is created.  This is normal operation.  But if
> that 404 is not a "normal" 404, it will happily create a new empty array,
> append, and write back a single-element array to that key.  Of course there
> could have been a 100 element array in Riak that was just unavailable at the
> time which is now effectively lost.
>
> Anyhow, I do understand the importance of knowing what will happen when
> doing something operationally like adding a node, and I understand that one
> can't naively expect everything to just work like magic.  But the current
> behavior is pretty poorly documented and surprising.  I don't think it was
> even mentioned in the operations webinar!  (Ok, I'll stop beating a dead
> horse.  :))
>
> On Thursday, May 5, 2011 at 12:22 PM, Alexander Sicular wrote:
>
> I'm really loving this thread. Generating great ideas for the way
> things should be... in the future. It seems to me that "the ring
> changes immediately" is actually the problem as Ryan astutely
> mentions. One way the future could look is :
>
> - a new node comes online
> - introductions are made
> - candidate vnodes are selected for migration (<- insert pixie dust magic
> here)
> - the number of simultaneous migrations are configurable, fewer for
> limited interruption or more for quicker completion
> - vnodes are migrated
> - once migration is completed, ownership is claimed
>
> Selecting vnodes for migration is where the unicorn cavalry attack the
> dragons den. If done right(er) the algorithm could be swappable to
> optimize for different strategies. Don't ask me how to implement it,
> I'm only a yellow belt in erlang-fu.
>
> Cheers,
> Alexander
>
> On Thu, May 5, 2011 at 13:33, Ryan Zezeski <rzezeski at basho.com> wrote:
>
> John,
> All great points.  The problem is that the ring changes immediately when a
> node is added.  So now, all the sudden, the preflist is potentially pointing
> to nodes that don't have the data and they won't have that data until
> handoff occurs.  The faster that data gets transferred, the less time your
> clients have to hit 'notfound'.
> However, I agree completely with what you're saying.  This is just a side
> effect of how the system currently works.  In a perfect world we wouldn't
> care how long handoff takes and we would also do some sort of automatic
> congestion control akin to TCP Reno or something.  The preflist would still
> point to the "old" partitions until all data has been successfully handed
> off, and then and only then would we flip the switch for that vnode.  I'm
> pretty sure that's where we are heading (I say "pretty sure" b/c I just
> joined the team and haven't been heavily involved in these specific talks
> yet).
> It's all coming down the pipe...
> As for your specific I/O question re handoff_concurrecy, you might be right.
>  I would think it depends on hardware/platform/etc.  I was offering it as a
> possible stopgap to minimize Greg's pain.  It's certainly a cure to a
> symptom, not the problem itself.
> -Ryan
>
> On Thu, May 5, 2011 at 1:10 PM, John D. Rowell <me at jdrowell.com> wrote:
>
> Hi Ryan, Greg,
>
> 2011/5/5 Ryan Zezeski <rzezeski at basho.com>
>
> 1. For example, riak_core has a `handoff_concurrency` setting that
> determines how many vnodes can concurrently handoff on a given node.  By
> default this is set to 4.  That's going to take a while with your 2048
> vnodes and all :)
>
> Won't that make the handoff situation potentially worse? From the thread I
> understood that the main problem was that the cluster was shuffling too much
> data around and thus becoming unresponsive and/or returning unexpected
> results (like "not founds"). I'm attributing the concerns more to an
> excessive I/O situation than to how long the handoff takes. If the handoff
> can be made transparent (no or little side effects) I don't think most
> people will really care (e.g. the "fix the cluster tomorrow" anecdote).
>
> How about using a percentage of available I/O to throttle the vnode
> handoff concurrency? Start with 1, and monitor the node's I/O (kinda like
> 'atop' does, collection CPU, disk and network metrics), if it is below the
> expected usage, then increase the vnode handoff concurrency, and vice-versa.
>
> I for one would be perfectly happy if the handoff took several hours (even
> days) if we could maintain the core riak_kv characteristics intact during
> those events. We've all seen looooong RAID rebuild times, and it's usually
> better to just sit tight and keep the rebuild speed low (slower I/O) while
> keeping all of the dependent systems running smoothly.
>
> cheers
> -jd
>
>
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
>
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
>




More information about the riak-users mailing list