'not found' after join

John D. Rowell me at jdrowell.com
Thu May 5 13:10:21 EDT 2011


Hi Ryan, Greg,

2011/5/5 Ryan Zezeski <rzezeski at basho.com>

> 1. For example, riak_core has a `handoff_concurrency` setting that
> determines how many vnodes can concurrently handoff on a given node.  By
> default this is set to 4.  That's going to take a while with your 2048
> vnodes and all :)
>

Won't that make the handoff situation potentially worse? From the thread I
understood that the main problem was that the cluster was shuffling too much
data around and thus becoming unresponsive and/or returning unexpected
results (like "not founds"). I'm attributing the concerns more to an
excessive I/O situation than to how long the handoff takes. If the handoff
can be made transparent (no or little side effects) I don't think most
people will really care (e.g. the "fix the cluster tomorrow" anecdote).

How about using a percentage of available I/O to throttle the vnode handoff
concurrency? Start with 1, and monitor the node's I/O (kinda like 'atop'
does, collection CPU, disk and network metrics), if it is below the expected
usage, then increase the vnode handoff concurrency, and vice-versa.

I for one would be perfectly happy if the handoff took several hours (even
days) if we could maintain the core riak_kv characteristics intact during
those events. We've all seen looooong RAID rebuild times, and it's usually
better to just sit tight and keep the rebuild speed low (slower I/O) while
keeping all of the dependent systems running smoothly.

cheers
-jd
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20110505/9a894348/attachment.html>


More information about the riak-users mailing list