'not found' after join
btilly at gmail.com
Thu May 5 15:03:57 EDT 2011
OK, I've been sitting here watching this thread, and I'd really like
to understand what happens when a node leaves/joins. I can't find any
really good documents describing it. Based on the conversation as
I've followed it, here is a detailed description of my garbled
misunderstanding. Please correct this.
Suppose that we have a ring with 9 nodes, named 1 through 9. A new node joins.
1. The new node is placed as node 10.
2. The preflist for everyone gets updated. Modulo roundoff:
- 10% of node 1's vnodes are now pointed at node 2.
- 20% of node 2's vnodes are now pointed at node 3.
- 30% of node 3's vnodes are now pointed at node 4.
- 40% of node 4's vnodes are now pointed at node 5.
- 50% of node 5's vnodes are now pointed at node 6.
- 60% of node 6's vnodes are now pointed at node 7.
- 70% of node 7's vnodes are now pointed at node 8.
- 80% of node 8's vnodes are now pointed at node 9.
- 90% of node 9's vnodes are now pointed at node 10.
All told about half of the vnodes are pointed to the wrong place.
If you've replicated data 3 times, then keys whose first copy
appears in the ranges (1/10, 1/9), (1/5, 2/9), (3/10, 1/3) have
all 3 vnodes pointed to the wrong place. That's 1/15 of the
total data set.
3. With default settings, if 2 of your vnodes are on nodes that do
not know about your data, then you will be told that data does
not exist. With other settings that can be chosen (notfound ok,
basic_quorum false) you can make data that any node knows about
findable. Data that have all 3 vnodes in the wrong place, is
4. Nodes are now responding for vnodes that they don't have data
for, and start asking for vnodes that have not been
properly handed off. They will ask for, by default, a maximum
of 4 vnodes at a time. And they cannot ask for a vnode until
all activity on that vnode has stopped for a minimum of, by
default, 60 seconds.
5. Once all vnodes are handed over, all data is available and no
data was actually lost.
Please correct. Or if there is a document somewhere that is likely to
clear up some of my misunderstandings, point me there so that I can
understand better what happens. (I'm pretty sure that my
understanding has to be incomplete, because I know that target_n_val
exists, and it looks like it should be involved somehow.)
On Thu, May 5, 2011 at 10:33 AM, Ryan Zezeski <rzezeski at basho.com> wrote:
> All great points. The problem is that the ring changes immediately when a
> node is added. So now, all the sudden, the preflist is potentially pointing
> to nodes that don't have the data and they won't have that data until
> handoff occurs. The faster that data gets transferred, the less time your
> clients have to hit 'notfound'.
> However, I agree completely with what you're saying. This is just a side
> effect of how the system currently works. In a perfect world we wouldn't
> care how long handoff takes and we would also do some sort of automatic
> congestion control akin to TCP Reno or something. The preflist would still
> point to the "old" partitions until all data has been successfully handed
> off, and then and only then would we flip the switch for that vnode. I'm
> pretty sure that's where we are heading (I say "pretty sure" b/c I just
> joined the team and haven't been heavily involved in these specific talks
> It's all coming down the pipe...
> As for your specific I/O question re handoff_concurrecy, you might be right.
> I would think it depends on hardware/platform/etc. I was offering it as a
> possible stopgap to minimize Greg's pain. It's certainly a cure to a
> symptom, not the problem itself.
> On Thu, May 5, 2011 at 1:10 PM, John D. Rowell <me at jdrowell.com> wrote:
>> Hi Ryan, Greg,
>> 2011/5/5 Ryan Zezeski <rzezeski at basho.com>
>>> 1. For example, riak_core has a `handoff_concurrency` setting that
>>> determines how many vnodes can concurrently handoff on a given node. By
>>> default this is set to 4. That's going to take a while with your 2048
>>> vnodes and all :)
>> Won't that make the handoff situation potentially worse? From the thread I
>> understood that the main problem was that the cluster was shuffling too much
>> data around and thus becoming unresponsive and/or returning unexpected
>> results (like "not founds"). I'm attributing the concerns more to an
>> excessive I/O situation than to how long the handoff takes. If the handoff
>> can be made transparent (no or little side effects) I don't think most
>> people will really care (e.g. the "fix the cluster tomorrow" anecdote).
>> How about using a percentage of available I/O to throttle the vnode
>> handoff concurrency? Start with 1, and monitor the node's I/O (kinda like
>> 'atop' does, collection CPU, disk and network metrics), if it is below the
>> expected usage, then increase the vnode handoff concurrency, and vice-versa.
>> I for one would be perfectly happy if the handoff took several hours (even
>> days) if we could maintain the core riak_kv characteristics intact during
>> those events. We've all seen looooong RAID rebuild times, and it's usually
>> better to just sit tight and keep the rebuild speed low (slower I/O) while
>> keeping all of the dependent systems running smoothly.
> riak-users mailing list
> riak-users at lists.basho.com
More information about the riak-users