Riak Recap for May 18 - 19

Anthony Molinaro anthonym at alumni.caltech.edu
Sun May 22 02:03:57 EDT 2011

Comments inline.

On Sat, May 21, 2011 at 06:44:58PM -0700, Greg Nelson wrote:
> You mentioned you're planning on growing to around 100 nodes.
> I'm curious what ring_creation_size you used?

I started with 1024 and figure I can probably get to around 100 nodes before
erlang distribution protocol becomes an issue.  At that point I envision us
investing some time to modify or augment riak_core to work with a different
distribution mechanism.  One of our developers actually did his master's
research in this area, so may be able to help on that end.  Also, 100 is
mostly an estimate the Cassandra cluster we hope to replace currently has
close to that many nodes in a couple data centers, and we'll be storing
more data than we currently do in Cassandra.

> Also, how much data capacity per node are you planning on?

Still working on that, although I expect it to actually not be that much
(probably a few hundred GB per node), most of the data we keep is small
in terms of keys and values.  Most of our keys will be 16 byte binaries
and most values are in the 16-200 byte range (when building systems of
the scale I need to build every byte counts, I fully expect to have to
replace bitcask at some point with something with less overhead, luckily
we've already done this once, we have a riak_core application with a
custom backend).  While our data size is smallish we tend to query it
a lot, with our current cluster looks like we average about 10K/sec gets
across the day for each of 8 nodes with about 6 billion gets per day,
the puts about only on the order of a few hundred a second, so many fewer.
I would expect that to grow at least 10x over the next year given the current
trajectory of the company.  Also, that only represents one of the buckets
that I currently use.  In the future we'll have quite at least 3 more
with similar amount of accesses.

> I've been spending a lot of time lately working through what happens when
> a node joins. There are a couple of big issues you will want to look out
> for in addition to what you've discovered, which boil down to essentially:
> After a node joins, the data on partitions which change nodes in the new
> ring will be unavailable until the handoffs are complete. Currently this
> comes back as a 404 that's indistinguishable from a "true" 404.

I remember reading the mails about that issue, but I thought it was possible
to get answers if your r value was high enough?  Maybe I'm misremembering, I
will go back and check it out.

> At certain points in the progression of ring states from 1 to 100 nodes,
> a LOT more partitions move around than you'd expect from a consistent
> hashing scheme.
> #2 obviously exacerbates #1, and if -- like us -- you plan to have a lot
> of data in the cluster, having most of it move around after a node joins
> is unrealistic.
> I'm still trying to work through exactly what's happening with #2, but
> it seems like once you have more nodes than target_n_val, when adding
> a new node you usually get the consistent hashing property you want:
> that the new node takes some partitions from each of the other nodes,
> and that's it. But every once in a while (and really, not all that
> rarely), shit hits the fan and it decides to re-balance and completely
> change the ring. >95% of partitions will move, in certain cases!

That's sort of unfortunate, and I would assume is some sort of bug, I can't
imagine why you would ever want that behavior.

> I have some erlang console code I've been using with riak_core to
> simulate our cluster, to get a deeper understanding of the rings
> at each phase. I might be able to clean that up and put it into a
> script to share.

Sounds like the sort of thing that might be useful for testing riak_core,
especially if you can manage to get the problem with excess data movement
to be reproduceable with that code.


Anthony Molinaro                           <anthonym at alumni.caltech.edu>

More information about the riak-users mailing list