Riak behavior under stress

Justin Sheehy justin at basho.com
Wed Nov 25 16:12:32 EST 2009


Hello Lev,

On Wed, Nov 25, 2009 at 3:29 PM, Lev Walkin <vlm at lionet.info> wrote:

> A typical Riak cluster is [presumably] configured with N=3. Some
> times it is sufficient for the Riak data reader to have a single
> response from the Riak cluster (R=1). An intuitive understanding
> being that if there's at least one alive node with the necessary
> ring segment, that node's answer should be enough to satisfy the
> reader.

This is correct.

> However, in a test case with N=3 and R=1, when we bring down the
> one out of three nodes, the Riak cluster returns with a timeout,
> {error,timeout} instead of returning the answer available on the
> two nodes which are still alive.

That is not the usual or expected response.  If you are seeing this in
practice, I'd be interested to see more about your configuration.

> The Riak's source code uses N and R values to determine a number
> of nodes on which to store data (N) and which should be expected
> to return an answer when asked (R). The behavior that puzzles me
> is that it awaits (R) positive answers and (N-R)+1 negative ones
> from the cluster.

Note that it will always send those messages to "up" nodes, meaning
that if a node is down at the time of message sending it will not
attempt to get a reply from it.

>        b) Since one Riak node is unavailable, there is no 3
>        nodes available which can confirm data unavailability,
>        therefore it returns with an {error, timeout}.

This should only occur if the node in question actually goes down
during (not before) the request.  In a usual case {ok,Data} will be
returned to such a reply.

> Question: is this expected behavior? I would presume that Riak
> should either allow N=3,R=1 requests to be satisfied even when
> one node dies (and, ideally, when two out of three nodes die),
> or the documentation needs to be updated to highlight the fact
> that R=1 is unusable in practice. Could someone clarify this?

I have just verified this by setting up a three-node cluster, storing
a document in a bucket with n_val of 3, then taking down one of the
three nodes.  A subsequent get of that document with different
R-values:

R=1 returned immediately with the document
R=2 returned immediately with the document
R=3 returned immediately with notfound, as the third replica was unavailable

In other words, the behavior you describe is neither what is expected
nor what I see in practice.  Is your question based on a running
cluster?  If the latter, can you elaborate on exactly how you are
causing that behavior?  I would like to help you to resolve any
problems you are seeing.

> The Riak's source code and documentation makes references to the
> Merkle trees, used to exchange information about the hash trees.
> The documentation and marketing material suggests that Riak can
> automatically synchronize the data in certain conditions.

There are two ways in which Riak uses merkle trees.

The first use is to reconcile documents stored under hinted-handoff.
If you store a document when some of the nodes are down, it will be
stored at a node other than the "ideal" one.  When the ideal node
comes back online, the nodes handling those documents that were stored
in the interim will exchange merkle trees with the returning node in
order to determine which documents to use in bringing it up to date.

The second use is in the Enterprise DS version of Riak, which also
uses merkle trees to assist in cluster-to-cluster replication.

> One might assume that the use of Merkle trees in Riak allows it
> to recover from a node failure by bringing up data from the nodes
> which are still alive and known to contain the data for the relevant
> segments of the key space.

One might assume that.  However, that would require much more than
just the use of merkle trees to determine which data needs to be
copied to such nodes, and would be very expensive.

> This makes the following realistic condition troublesome: suppose
> there are M nodes on Amazon EC2, working as a Riak cluster.
> Then it'd be just enough for the M nodes to be rebooted at different,
> and potentially very distant points in time (and start afresh, with no
> saved state) to lose all the data accumulated. That is, provided that
> there is no intermittent requests to the particular key/value pairs that
> happen to reside on the nodes which are brought back.

If you want your data to survive sequential outages of N or more
nodes, you should be running your cluster with a persistent backend,
not an in-memory backend or one that otherwise loses data on restart.
If you are running in such a precarious environment, read-repair will
help somewhat but there is no global and continuous data scan for
missing data such as what you describe.

The short answer is that if you want persistent data, you should run
your datastore (Riak or otherwise) on computers with persistent local
storage.  If you need to be on EC2 this could be achieved with EBS,
for instance.

> Merkle trees, when used in the best context for this situation, could
> have solved the problem by automatically replicating the content onto
> the new nodes as they join in.

This is not quite right, as vnodes are not duplicates of each other
but rather sliding overlaps -- and so a multi-node scan and not just a
merkle tree would be needed to find the data that you might want to
store on a given node.

> Suppose there's a cluster of N Riak nodes. What is the best way to
> expand that into a cluster of N*2 Riak nodes?

Simply join the cluster with the new nodes the same way as with the
original ones.  (start-join, for instance)  If you wait a few seconds
between starts, you will reduce extra gossip chatter for ring state
resolution.  The data in question will rebalance over the next few
minutes.  As an aside, we will be pushing a change soon (probably
right after the holiday weekend) which has already been locally tested
and makes this rebalancing process much more efficient.

I hope that this helps to answer your questions and clarify how Riak works.

-Justin



More information about the riak-users mailing list