'not found' after join

Greg Nelson grourk at dropcam.com
Thu May 5 17:26:39 EDT 2011


There's no concurrency for writes to these objects, which is what I was hoping would simplify the problem. But it sounds like I'll have to turn on allow_mult and resolve conflicts anyway.
On Thursday, May 5, 2011 at 2:23 PM, Bob Ippolito wrote: 
> It's not necessarily as much application logic as you might think,
> you've just described what statebox [1] is an abstraction for (but it
> encapsulates change history in the value). It's all Erlang, but the
> technique could be applied in any language. That said, it's really
> frustrating that data is unavailable during hand-off, but at least you
> can mitigate it with a smart model (you should probably have this
> anyway). We're also really looking forward to having this issue
> resolved.
> 
> Greg's usage pattern sounds like it's fundamentally inconsistent even
> in the normal case when no handoff is occurring (assuming that there's
> any concurrency for writes).
> 
> [1] http://github.com/mochi/statebox
> 
> On Thu, May 5, 2011 at 2:07 PM, Ben Tilly <btilly at gmail.com> wrote:
> > There are solutions to that consistency issue. You can set
> > allow_multi true, have each object have a link to a change history,
> > and have each change have a record of what changed. The change
> > history could be done as a singly linked list, where each change is
> > inserted into a bucket with a randomly generated key.
> > 
> > And then on reading an object, if you find siblings, you can go look
> > at the change histories, merge them, and come up with a resolved
> > object.
> > 
> > This is a *lot* of application logic, but it should be doable.
> > 
> > On Thu, May 5, 2011 at 1:14 PM, Greg Nelson <grourk at dropcam.com> wrote:
> > > The future I'd like to see is basically what I initially expected. That is,
> > > I can add a single node to an online cluster and clients should not even see
> > > any effects of this or need to know that it's even happening -- except of
> > > course the side effects like the added load on the cluster incurred by
> > > gossiping new ring state, handing off data, etc. But if no data has
> > > actually been lost, I don't believe data should ever be unavailable,
> > > temporarily or not. And I'd like to be able to, as someone else mentioned,
> > > add a node and throttle the handoffs and let it trickle over hours or even
> > > days.
> > > 
> > > Waving hands and saying that eventually the data will make it is true in
> > > principle, but in practice if you are following a read/modify/write pattern
> > > for some objects, you could easily lose data. e.g., my application writes
> > > JSON arrays to certain objects, and when it wishes to append something to
> > > the array, it will read/append/write back. If that initial read returns
> > > 404, then a new empty array is created. This is normal operation. But if
> > > that 404 is not a "normal" 404, it will happily create a new empty array,
> > > append, and write back a single-element array to that key. Of course there
> > > could have been a 100 element array in Riak that was just unavailable at the
> > > time which is now effectively lost.
> > > 
> > > Anyhow, I do understand the importance of knowing what will happen when
> > > doing something operationally like adding a node, and I understand that one
> > > can't naively expect everything to just work like magic. But the current
> > > behavior is pretty poorly documented and surprising. I don't think it was
> > > even mentioned in the operations webinar! (Ok, I'll stop beating a dead
> > > horse. :))
> > > 
> > > On Thursday, May 5, 2011 at 12:22 PM, Alexander Sicular wrote:
> > > 
> > > I'm really loving this thread. Generating great ideas for the way
> > > things should be... in the future. It seems to me that "the ring
> > > changes immediately" is actually the problem as Ryan astutely
> > > mentions. One way the future could look is :
> > > 
> > > - a new node comes online
> > > - introductions are made
> > > - candidate vnodes are selected for migration (<- insert pixie dust magic
> > > here)
> > > - the number of simultaneous migrations are configurable, fewer for
> > > limited interruption or more for quicker completion
> > > - vnodes are migrated
> > > - once migration is completed, ownership is claimed
> > > 
> > > Selecting vnodes for migration is where the unicorn cavalry attack the
> > > dragons den. If done right(er) the algorithm could be swappable to
> > > optimize for different strategies. Don't ask me how to implement it,
> > > I'm only a yellow belt in erlang-fu.
> > > 
> > > Cheers,
> > > Alexander
> > > 
> > > On Thu, May 5, 2011 at 13:33, Ryan Zezeski <rzezeski at basho.com> wrote:
> > > 
> > > John,
> > > All great points. The problem is that the ring changes immediately when a
> > > node is added. So now, all the sudden, the preflist is potentially pointing
> > > to nodes that don't have the data and they won't have that data until
> > > handoff occurs. The faster that data gets transferred, the less time your
> > > clients have to hit 'notfound'.
> > > However, I agree completely with what you're saying. This is just a side
> > > effect of how the system currently works. In a perfect world we wouldn't
> > > care how long handoff takes and we would also do some sort of automatic
> > > congestion control akin to TCP Reno or something. The preflist would still
> > > point to the "old" partitions until all data has been successfully handed
> > > off, and then and only then would we flip the switch for that vnode. I'm
> > > pretty sure that's where we are heading (I say "pretty sure" b/c I just
> > > joined the team and haven't been heavily involved in these specific talks
> > > yet).
> > > It's all coming down the pipe...
> > > As for your specific I/O question re handoff_concurrecy, you might be right.
> > > I would think it depends on hardware/platform/etc. I was offering it as a
> > > possible stopgap to minimize Greg's pain. It's certainly a cure to a
> > > symptom, not the problem itself.
> > > -Ryan
> > > 
> > > On Thu, May 5, 2011 at 1:10 PM, John D. Rowell <me at jdrowell.com> wrote:
> > > 
> > > Hi Ryan, Greg,
> > > 
> > > 2011/5/5 Ryan Zezeski <rzezeski at basho.com>
> > > 
> > > 1. For example, riak_core has a `handoff_concurrency` setting that
> > > determines how many vnodes can concurrently handoff on a given node. By
> > > default this is set to 4. That's going to take a while with your 2048
> > > vnodes and all :)
> > > 
> > > Won't that make the handoff situation potentially worse? From the thread I
> > > understood that the main problem was that the cluster was shuffling too much
> > > data around and thus becoming unresponsive and/or returning unexpected
> > > results (like "not founds"). I'm attributing the concerns more to an
> > > excessive I/O situation than to how long the handoff takes. If the handoff
> > > can be made transparent (no or little side effects) I don't think most
> > > people will really care (e.g. the "fix the cluster tomorrow" anecdote).
> > > 
> > > How about using a percentage of available I/O to throttle the vnode
> > > handoff concurrency? Start with 1, and monitor the node's I/O (kinda like
> > > 'atop' does, collection CPU, disk and network metrics), if it is below the
> > > expected usage, then increase the vnode handoff concurrency, and vice-versa.
> > > 
> > > I for one would be perfectly happy if the handoff took several hours (even
> > > days) if we could maintain the core riak_kv characteristics intact during
> > > those events. We've all seen looooong RAID rebuild times, and it's usually
> > > better to just sit tight and keep the rebuild speed low (slower I/O) while
> > > keeping all of the dependent systems running smoothly.
> > > 
> > > cheers
> > > -jd
> > > 
> > > 
> > > _______________________________________________
> > > riak-users mailing list
> > > riak-users at lists.basho.com
> > > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> > > 
> > > _______________________________________________
> > > riak-users mailing list
> > > riak-users at lists.basho.com
> > > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> > > 
> > > 
> > > _______________________________________________
> > > riak-users mailing list
> > > riak-users at lists.basho.com
> > > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> > 
> > _______________________________________________
> > riak-users mailing list
> > riak-users at lists.basho.com
> > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20110505/48713262/attachment-0001.html>


More information about the riak-users mailing list