Riak behavior under stress

Kirill A. Korinskiy catap+riak at catap.ru
Thu Nov 26 09:10:21 EST 2009


Hello Justin, again

Please sorry for my english; I'll try to rewrite my response.

At Wed, 25 Nov 2009 16:12:32 -0500,
Justin Sheehy <justin at basho.com> wrote:
> On Wed, Nov 25, 2009 at 3:29 PM, Lev Walkin <vlm at lionet.info> wrote:
> 
> > However, in a test case with N=3 and R=1, when we bring down the
> > one out of three nodes, the Riak cluster returns with a timeout,
> > {error,timeout} instead of returning the answer available on the
> > two nodes which are still alive.
> 
> That is not the usual or expected response.  If you are seeing this in
> practice, I'd be interested to see more about your configuration.
> 

The configuration: 10 nodes, no data in the cluster.
Config all nodes is as follows:

    {cluster_name, "default"}.
    {ring_state_dir, "/data/riak/priv/ringstate"}.
    {ring_creation_size, 100}.
    {gossip_interval, 60000}.
    {doorbell_port, 9000}.
    {storage_backend, riak_fs_backend}.
    {riak_fs_backend_root, "/data/riak-data"}.
    %{riak_cookie, riak_jskit_cookie}.
    {riak_heart_command, "(cd /data/riak; ./start-restart.sh /data/riak/config/riak.erlenv)"}.
    {riak_nodename, "riak"}.
    %{riak_hostname, "127.0.0.1"}.
    {riak_web_ip, "127.0.0.1"}.
    {riak_web_port, 9980}.
    {jiak_name, "jiak"}.


> > The Riak's source code uses N and R values to determine a number
> > of nodes on which to store data (N) and which should be expected
> > to return an answer when asked (R). The behavior that puzzles me
> > is that it awaits (R) positive answers and (N-R)+1 negative ones
> > from the cluster.
> 
> Note that it will always send those messages to "up" nodes, meaning
> that if a node is down at the time of message sending it will not
> attempt to get a reply from it.
> 

In the open source edition, Riak has two sets of nodes to send request to: the Targets and the Fallbacks. If some Target nodes are down, Riaks sends data to the other set of nodes, the Fallbacks. Is it correct?

> >        b) Since one Riak node is unavailable, there is no 3
> >        nodes available which can confirm data unavailability,
> >        therefore it returns with an {error, timeout}.
> > This should only occur if the node in question actually goes down
> during (not before) the request.  In a usual case {ok,Data} will be
> returned to such a reply.
> > > Question: is this expected behavior? I would presume that Riak
> > should either allow N=3,R=1 requests to be satisfied even when
> > one node dies (and, ideally, when two out of three nodes die),
> > or the documentation needs to be updated to highlight the fact
> > that R=1 is unusable in practice. Could someone clarify this?
> 
> I have just verified this by setting up a three-node cluster, storing
> a document in a bucket with n_val of 3, then taking down one of the
> three nodes.  A subsequent get of that document with different
> R-values:
> 
> R=1 returned immediately with the document
> R=2 returned immediately with the document
> R=3 returned immediately with notfound, as the third replica was unavailable
> 
> In other words, the behavior you describe is neither what is expected
> nor what I see in practice.  Is your question based on a running
> cluster?  If the latter, can you elaborate on exactly how you are
> causing that behavior?  I would like to help you to resolve any
> problems you are seeing.
> 

You can check it for answers riak {error, notfound}?

Simple test case:

[catap at satellite] cat config/riak-A.erlenv
{cluster_name, "default"}.
{ring_state_dir, "priv/riak-A/ringstate/"}.
{ring_creation_size, 16}.
{gossip_interval, 60000}.
{doorbell_port, 9000}.
{storage_backend, riak_ets_backend}.
{riak_cookie, riak_demo_cookie}.
{riak_heart_command, "(cd /home/catap/src/riak; ./start-restart.sh /home/catap/src/riak/config/riak-A.erlenv)"}.
{riak_nodename, "riak-A"}.
{riak_hostname, "127.0.0.1"}.
{riak_web_ip, "127.0.0.1"}.
{riak_web_port, 8098}.
{jiak_name, "jiak"}.
{default_bucket_props, [{n_val,2},
                        {allow_mult,false},
                        {linkfun,{modfun, jiak_object, mapreduce_linkfun}},
                        {chash_keyfun, {riak_util, chash_std_keyfun}},
                        {old_vclock, 86400},
                        {young_vclock, 21600},
                        {big_vclock, 50},
                        {small_vclock, 10}]}.
[catap at satellite] cat config/riak-B.erlenv
{cluster_name, "default"}.
{ring_state_dir, "priv/riak-B/ringstate/"}.
{ring_creation_size, 16}.
{gossip_interval, 60000}.
{doorbell_port, 9000}.
{storage_backend, riak_ets_backend}.
{riak_cookie, riak_demo_cookie}.
{riak_heart_command, "(cd /home/catap/src/riak; ./start-restart.sh /home/catap/src/riak/config/riak-B.erlenv)"}.
{riak_nodename, "riak-B"}.
{riak_hostname, "127.0.0.1"}.
{riak_web_ip, "127.0.0.1"}.
{riak_web_port, 8098}.
{jiak_name, "jiak"}.
{default_bucket_props, [{n_val,2},
                        {allow_mult,false},
                        {linkfun,{modfun, jiak_object, mapreduce_linkfun}},
                        {chash_keyfun, {riak_util, chash_std_keyfun}},
                        {old_vclock, 86400},
                        {young_vclock, 21600},
                        {big_vclock, 50},
                        {small_vclock, 10}]}.
[catap at satellite] ./start-fresh.sh config/riak-A.erlenv
[catap at satellite] ./start-join.sh config/riak-B.erlenv riak-A at 127.0.0.1
[catap at satellite] erl -pa ebin -name riak-test at 127.0.0.1 -setcookie riak_cookie
Erlang R13B02 (erts-5.7.3) [source] [smp:2:2] [rq:2] [async-threads:0] [kernel-poll:false]

Eshell V5.7.3  (abort with ^G)
(riak-test at 127.0.0.1)1> {ok, A} = riak:client_connect('riak-A at 127.0.0.1').
{ok,{riak_client,'riak-A at 127.0.0.1',<<1,235,57,194>>}}
(riak-test at 127.0.0.1)2> A:get(<<"Table">>, <<"Key">>, 1).
{error,notfound}
(riak-test at 127.0.0.1)3> rpc:call('riak-B at 127.0.0.1', init, stop, []).           
ok
(riak-test at 127.0.0.1)4> A:get(<<"Table">>, <<"Key">>, 1).
{error,timeout}
(riak-test at 127.0.0.1)6>


> > The Riak's source code and documentation makes references to the
> > Merkle trees, used to exchange information about the hash trees.
> > The documentation and marketing material suggests that Riak can
> > automatically synchronize the data in certain conditions.
> 
> There are two ways in which Riak uses merkle trees.
> 
> The first use is to reconcile documents stored under hinted-handoff.
> If you store a document when some of the nodes are down, it will be
> stored at a node other than the "ideal" one.  When the ideal node
> comes back online, the nodes handling those documents that were stored
> in the interim will exchange merkle trees with the returning node in
> order to determine which documents to use in bringing it up to date.
> 

I start a simple experiment on the 10 nodes, using fs backend. I put
object in the cluster and look in /data/riak-data on all nodes and see
that data appeared to node1, node2 and node6. I'm saving a data-file
that was created with the data to my home. Next, I kill node6 and
update object. I looked at the nodes and don't see the data on other
nodes, it expected? Next, I compare the changed data-file to the
node2/node3 and data-file from my home - diff says that the files are
different. Then I join node6 and compare its data - diff no find
differences with copy of data-file on my home. Then I do get the
object and compare data on node6 - diff says they are different.

Accordingly, two questions:
 1) Where to store the data in case one of the nodes are not "ideal"
    available;
 2) The expected process of updating the data only when they get?

-- 
wbr, Kirill




More information about the riak-users mailing list