Riak behavior under stress

Lev Walkin vlm at lionet.info
Wed Nov 25 15:29:53 EST 2009


Hi,


I have a few questions regarding Riak's behavior under stress, in
realistic conditions.


First quwstion.

A typical Riak cluster is [presumably] configured with N=3. Some
times it is sufficient for the Riak data reader to have a single
response from the Riak cluster (R=1). An intuitive understanding
being that if there's at least one alive node with the necessary
ring segment, that node's answer should be enough to satisfy the
reader.

However, in a test case with N=3 and R=1, when we bring down the
one out of three nodes, the Riak cluster returns with a timeout,
{error,timeout} instead of returning the answer available on the
two nodes which are still alive.

The Riak's source code uses N and R values to determine a number
of nodes on which to store data (N) and which should be expected
to return an answer when asked (R). The behavior that puzzles me
is that it awaits (R) positive answers and (N-R)+1 negative ones
from the cluster. That is, when C:get(<<"Table">>, <<"Key">>, 1)
is issued, then
	a) Riak will wait for 1 positive answer or 3 ((3-1)+1)
	negative answers (the ones which tells the caller that
	there is no data found for the key);
	b) Since one Riak node is unavailable, there is no 3
	nodes available which can confirm data unavailability,
	therefore it returns with an {error, timeout}.

Question: is this expected behavior? I would presume that Riak
should either allow N=3,R=1 requests to be satisfied even when
one node dies (and, ideally, when two out of three nodes die),
or the documentation needs to be updated to highlight the fact
that R=1 is unusable in practice. Could someone clarify this?



Second question.

The Riak's source code and documentation makes references to the
Merkle trees, used to exchange information about the hash trees.
The documentation and marketing material suggests that Riak can
automatically synchronize the data in certain conditions.

One might assume that the use of Merkle trees in Riak allows it
to recover from a node failure by bringing up data from the nodes
which are still alive and known to contain the data for the relevant
segments of the key space. Hovewer, experiments show that Riak
does not replicate data to other, available nodes when some node
is brought down. Also, Riak does not automatically brings data
back up to date when a node reboots or otherwise becomes temporarily
unavailable. The best thing Riak does is to make sure the node
gets the latest copy of the particular key/value pair if that
key/value pair gets requested explicitly.

This makes the following realistic condition troublesome: suppose
there are M nodes on Amazon EC2, working as a Riak cluster.
Then it'd be just enough for the M nodes to be rebooted at different,
and potentially very distant points in time (and start afresh, with no
saved state) to lose all the data accumulated. That is, provided that
there is no intermittent requests to the particular key/value pairs that
happen to reside on the nodes which are brought back.

Merkle trees, when used in the best context for this situation, could
have solved the problem by automatically replicating the content onto
the new nodes as they join in. But it seems that Merkle trees aren't
used that much in Riak, at least not in that capacity. Is this
a feature available in the EnterpriseDS implementation?



Third question.

Suppose there's a cluster of N Riak nodes. What is the best way to
expand that into a cluster of N*2 Riak nodes?



-- 
vlm




More information about the riak-users mailing list