'not found' after join
John D. Rowell
me at jdrowell.com
Thu May 5 13:10:21 EDT 2011
Hi Ryan, Greg,
2011/5/5 Ryan Zezeski <rzezeski at basho.com>
> 1. For example, riak_core has a `handoff_concurrency` setting that
> determines how many vnodes can concurrently handoff on a given node. By
> default this is set to 4. That's going to take a while with your 2048
> vnodes and all :)
Won't that make the handoff situation potentially worse? From the thread I
understood that the main problem was that the cluster was shuffling too much
data around and thus becoming unresponsive and/or returning unexpected
results (like "not founds"). I'm attributing the concerns more to an
excessive I/O situation than to how long the handoff takes. If the handoff
can be made transparent (no or little side effects) I don't think most
people will really care (e.g. the "fix the cluster tomorrow" anecdote).
How about using a percentage of available I/O to throttle the vnode handoff
concurrency? Start with 1, and monitor the node's I/O (kinda like 'atop'
does, collection CPU, disk and network metrics), if it is below the expected
usage, then increase the vnode handoff concurrency, and vice-versa.
I for one would be perfectly happy if the handoff took several hours (even
days) if we could maintain the core riak_kv characteristics intact during
those events. We've all seen looooong RAID rebuild times, and it's usually
better to just sit tight and keep the rebuild speed low (slower I/O) while
keeping all of the dependent systems running smoothly.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the riak-users