Riak Recap for May 18 - 19
grourk at dropcam.com
Sat May 21 21:44:58 EDT 2011
You mentioned you're planning on growing to around 100 nodes. I'm curious what ring_creation_size you used? Also, how much data capacity per node are you planning on?
I've been spending a lot of time lately working through what happens when a node joins. There are a couple of big issues you will want to look out for in addition to what you've discovered, which boil down to essentially:
After a node joins, the data on partitions which change nodes in the new ring will be unavailable until the handoffs are complete. Currently this comes back as a 404 that's indistinguishable from a "true" 404.
At certain points in the progression of ring states from 1 to 100 nodes, a LOT more partitions move around than you'd expect from a consistent hashing scheme.
#2 obviously exacerbates #1, and if -- like us -- you plan to have a lot of data in the cluster, having most of it move around after a node joins is unrealistic.
I'm still trying to work through exactly what's happening with #2, but it seems like once you have more nodes than target_n_val, when adding a new node you usually get the consistent hashing property you want: that the new node takes some partitions from each of the other nodes, and that's it. But every once in a while (and really, not all that rarely), shit hits the fan and it decides to re-balance and completely change the ring. >95% of partitions will move, in certain cases!
I have some erlang console code I've been using with riak_core to simulate our cluster, to get a deeper understanding of the rings at each phase. I might be able to clean that up and put it into a script to share.
On Saturday, May 21, 2011 at 9:31 AM, Anthony Molinaro wrote:
As I asked this question I thought I would pipe in with my experience (comments inline).
> On May 20, 2011, at 3:17 PM, Mark Phillips <mark at basho.com> wrote:
> > 4) Q --- Lets say I have several new nodes to add, is the recommended
> > procedure to add them one at a time and wait for all transfers to
> > finish, or can you actually add several?
> > A --- The current recommended procedure is to add one node at a time
> > and wait for the partition transfers to finish before proceeding to
> > the next node addition.
> I found that adding them one at a time would have taken about 4 hours per node and as I was doubling the size I felt there would be less shuffling of data if I added all at once (as suggested by aphyr on IRC). This proved to be exactly correct as I was able to add 4 new nodes in about 4 hours instead of 16.
> > Specifically:
> > * Use the "riak-admin join" command to kick off the cluster expansion
> > * Run "riak-admin transfers" periodically to keep an eye on the nodes
> > awaiting or passing off partitions (this may take a bit to complete);
> > an alternate (and less expensive) way to keep an eye on on this is to
> > just watch the logs.
> Running "riak-admin transfers" hardly ever works I would say it times out 95% of the time when attempting to add a new node. I don't know why this is and I hope it is fixed someday but I would recommend never running it.
> Unfortunately grepping logs is also tricky as you have to deal with lots of false positives if you done something like I did where you had a bunch of nodes crash then brought them up, only to realize you need to add capacity, so you add nodes. But now the logs on the first nodes have messages for transfers from the restart and the node addition.
> > * When "riak-admin ringready" prints "TRUE ..." to let you know that
> > all nodes agree on the ring state, you're good to go.
> This actually returned true before transfers were complete IIRC so I think this may not quite be right.
> > (It's worth nothing that making this process smoother and more fluid
> > is high on our list of priorities.)
> Good to know I look forward to this as I expect to be increasing my cluster up to close to 100 nodes by the end of this year.
> riak-users mailing list
> riak-users at lists.basho.com
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the riak-users