'not found' after join

Bob Ippolito bob at redivi.com
Thu May 5 17:23:27 EDT 2011


It's not necessarily as much application logic as you might think,
you've just described what statebox [1] is an abstraction for (but it
encapsulates change history in the value). It's all Erlang, but the
technique could be applied in any language. That said, it's really
frustrating that data is unavailable during hand-off, but at least you
can mitigate it with a smart model (you should probably have this
anyway). We're also really looking forward to having this issue
resolved.

Greg's usage pattern sounds like it's fundamentally inconsistent even
in the normal case when no handoff is occurring (assuming that there's
any concurrency for writes).

[1] http://github.com/mochi/statebox

On Thu, May 5, 2011 at 2:07 PM, Ben Tilly <btilly at gmail.com> wrote:
> There are solutions to that consistency issue.  You can set
> allow_multi true, have each object have a link to a change history,
> and have each change have a record of what changed.  The change
> history could be done as a singly linked list, where each change is
> inserted into a bucket with a randomly generated key.
>
> And then on reading an object, if you find siblings, you can go look
> at the change histories, merge them, and come up with a resolved
> object.
>
> This is a *lot* of application logic, but it should be doable.
>
> On Thu, May 5, 2011 at 1:14 PM, Greg Nelson <grourk at dropcam.com> wrote:
>> The future I'd like to see is basically what I initially expected.  That is,
>> I can add a single node to an online cluster and clients should not even see
>> any effects of this or need to know that it's even happening -- except of
>> course the side effects like the added load on the cluster incurred by
>> gossiping new ring state, handing off data, etc.  But if no data has
>> actually been lost, I don't believe data should ever be unavailable,
>> temporarily or not.  And I'd like to be able to, as someone else mentioned,
>> add a node and throttle the handoffs and let it trickle over hours or even
>> days.
>>
>> Waving hands and saying that eventually the data will make it is true in
>> principle, but in practice if you are following a read/modify/write pattern
>> for some objects, you could easily lose data.  e.g., my application writes
>> JSON arrays to certain objects, and when it wishes to append something to
>> the array, it will read/append/write back.  If that initial read returns
>> 404, then a new empty array is created.  This is normal operation.  But if
>> that 404 is not a "normal" 404, it will happily create a new empty array,
>> append, and write back a single-element array to that key.  Of course there
>> could have been a 100 element array in Riak that was just unavailable at the
>> time which is now effectively lost.
>>
>> Anyhow, I do understand the importance of knowing what will happen when
>> doing something operationally like adding a node, and I understand that one
>> can't naively expect everything to just work like magic.  But the current
>> behavior is pretty poorly documented and surprising.  I don't think it was
>> even mentioned in the operations webinar!  (Ok, I'll stop beating a dead
>> horse.  :))
>>
>> On Thursday, May 5, 2011 at 12:22 PM, Alexander Sicular wrote:
>>
>> I'm really loving this thread. Generating great ideas for the way
>> things should be... in the future. It seems to me that "the ring
>> changes immediately" is actually the problem as Ryan astutely
>> mentions. One way the future could look is :
>>
>> - a new node comes online
>> - introductions are made
>> - candidate vnodes are selected for migration (<- insert pixie dust magic
>> here)
>> - the number of simultaneous migrations are configurable, fewer for
>> limited interruption or more for quicker completion
>> - vnodes are migrated
>> - once migration is completed, ownership is claimed
>>
>> Selecting vnodes for migration is where the unicorn cavalry attack the
>> dragons den. If done right(er) the algorithm could be swappable to
>> optimize for different strategies. Don't ask me how to implement it,
>> I'm only a yellow belt in erlang-fu.
>>
>> Cheers,
>> Alexander
>>
>> On Thu, May 5, 2011 at 13:33, Ryan Zezeski <rzezeski at basho.com> wrote:
>>
>> John,
>> All great points.  The problem is that the ring changes immediately when a
>> node is added.  So now, all the sudden, the preflist is potentially pointing
>> to nodes that don't have the data and they won't have that data until
>> handoff occurs.  The faster that data gets transferred, the less time your
>> clients have to hit 'notfound'.
>> However, I agree completely with what you're saying.  This is just a side
>> effect of how the system currently works.  In a perfect world we wouldn't
>> care how long handoff takes and we would also do some sort of automatic
>> congestion control akin to TCP Reno or something.  The preflist would still
>> point to the "old" partitions until all data has been successfully handed
>> off, and then and only then would we flip the switch for that vnode.  I'm
>> pretty sure that's where we are heading (I say "pretty sure" b/c I just
>> joined the team and haven't been heavily involved in these specific talks
>> yet).
>> It's all coming down the pipe...
>> As for your specific I/O question re handoff_concurrecy, you might be right.
>>  I would think it depends on hardware/platform/etc.  I was offering it as a
>> possible stopgap to minimize Greg's pain.  It's certainly a cure to a
>> symptom, not the problem itself.
>> -Ryan
>>
>> On Thu, May 5, 2011 at 1:10 PM, John D. Rowell <me at jdrowell.com> wrote:
>>
>> Hi Ryan, Greg,
>>
>> 2011/5/5 Ryan Zezeski <rzezeski at basho.com>
>>
>> 1. For example, riak_core has a `handoff_concurrency` setting that
>> determines how many vnodes can concurrently handoff on a given node.  By
>> default this is set to 4.  That's going to take a while with your 2048
>> vnodes and all :)
>>
>> Won't that make the handoff situation potentially worse? From the thread I
>> understood that the main problem was that the cluster was shuffling too much
>> data around and thus becoming unresponsive and/or returning unexpected
>> results (like "not founds"). I'm attributing the concerns more to an
>> excessive I/O situation than to how long the handoff takes. If the handoff
>> can be made transparent (no or little side effects) I don't think most
>> people will really care (e.g. the "fix the cluster tomorrow" anecdote).
>>
>> How about using a percentage of available I/O to throttle the vnode
>> handoff concurrency? Start with 1, and monitor the node's I/O (kinda like
>> 'atop' does, collection CPU, disk and network metrics), if it is below the
>> expected usage, then increase the vnode handoff concurrency, and vice-versa.
>>
>> I for one would be perfectly happy if the handoff took several hours (even
>> days) if we could maintain the core riak_kv characteristics intact during
>> those events. We've all seen looooong RAID rebuild times, and it's usually
>> better to just sit tight and keep the rebuild speed low (slower I/O) while
>> keeping all of the dependent systems running smoothly.
>>
>> cheers
>> -jd
>>
>>
>> _______________________________________________
>> riak-users mailing list
>> riak-users at lists.basho.com
>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>
>> _______________________________________________
>> riak-users mailing list
>> riak-users at lists.basho.com
>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>
>>
>> _______________________________________________
>> riak-users mailing list
>> riak-users at lists.basho.com
>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>
>>
>
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>




More information about the riak-users mailing list