Riak behavior

Justin Sheehy justin at basho.com
Tue Dec 1 12:45:19 EST 2009


Hi Kirill,

I hope that I am able to help you with the issues you describe.

On Mon, Nov 30, 2009 at 4:24 PM, Kirill A. Korinskiy
<catap+riak at catap.ru> wrote:

> I have a test Riak cluster of 10 nodes. For storage backend I use
> riak_fs_backend.

I would start by suggesting a different backend.  Of the open source
backends currently available, riak_dets_backend is the best choice for
many people.  The riak_fs_backend is really intended as the first demo
code of writing a backend, and is also useful for certain testing
scenarios, but is not really the right thing for production use.  We
should better document this fact about riak_fs_backend, and I will
make sure that we do so.

On to each of your issues...

> 1) The data when riak_fs_backend is in use is not written atomically
> to the file system. That is, a file is written directly to the right
> place, which could lead to a partial data written during the system
> crash.  Respectively, after system restart, the file system will
> appear inconsistent.

Yes, this is a large part of what I meant above about the fs backend.
The other big downside of fs_backend is that it is quite slow when it
contains a large amount of data.  I recommend switching backends.

> 2) With the active use of Riak very quickly starts to be blocked by
> IO. If you add +A 32 to the erl's command line options, it gets
> better. Have you tried riak_fs_backend in high load setting? Do you
> have any additional recommendations?

We don't use the fs backend in any high load settings, but the dets
backend is currently in use at some customer locations that see a
reasonable amount of load and it performs satisfactorily for them.

> 3) In riak_vnode_sidekick changes its state from not_home to
> home. Under what conditions does it happen?

The notion there is that the sidekick process was checking to see if
the content stored for a given partition on a given node is in a
partition that that node actually owns ("home") or if instead it
should be handing that content to a different node.  ("not home")

This separate sidekick process was removed recently, as riak_vnode.erl
was changed to an FSM that encompasses this check and therefore no
longer needs a separate process.  In the tip of the bitbucket
repository you can see this change.

> 4) If understand the idea correctly, before you insert data into a
> cluster, you need to make sure that there is no such information, or
> to update it.

That is the right general practice, yes.  You do not strictly have to,
but if you do not adhere to that pattern you will create "sibling"
objects that subsequently need to be merged.

> But for Riak to return the {error, notfound} answer, is
> awaits such a response from N-R+1 nodes.

If you meant that as N-(R+1), then yes.  Riak sends N messages to
vnodes and will reply to the client as soon as either R "ok" messages
are received, a timeout occurs, or enough "error" or "notfound"
messages (here is the N-(R+1) part) are received that it is impossible
for R "ok" messages to ever arrive.

However, this does not necessarily block on that many physical nodes.
In cases like your test (where less than N physical nodes are running)
some of the vnodes in question will exist on the same physical node.
Thus, Riak does not block on any particular number of physical nodes
when it can already tell through reachability failure that it could
not possibly hear back from that many hosts.

> If we have N=3, R=1 and one
> of the nodes goes offline, it results in {error, timeout}. What is the
> expected behavior?

This is not the expected behavior, and is also not what I observe.  I
am unable to reproduce the test case you describe.  Your exact
supplied configuration files do not work, as they both specify the
same web port on localhost -- this will cause one of those instances
to crash.  With a different set of configuration files, I can
demonstrate after starting riak1, riak2, and riak3 locally:

(k at 127.0.0.1)1> {ok,C} = riak:client_connect('riak1 at 127.0.0.1').
{ok,{riak_client,'riak1 at 127.0.0.1',<<3,12,131,114>>}}
(k at 127.0.0.1)2> C:get(<<"b">>,<<"k1">>,1).
{error,notfound}
(k at 127.0.0.1)3> rpc:call('riak3 at 127.0.0.1',init,stop,[]).
ok
(k at 127.0.0.1)4> C:get(<<"b">>,<<"k1">>,1).
{error,notfound}
(k at 127.0.0.1)5> rpc:call('riak2 at 127.0.0.1',init,stop,[]).
ok
(k at 127.0.0.1)6> C:get(<<"b">>,<<"k1">>,1).
{error,notfound}

None of these requests timed out, even when 2/3 of the nodes were shut down.

> 5) I started a simple experiment on the 10 nodes, using the fs
> backend.

I am not sure what was going on in your experiment, but we have
regularly verified that this variety of hinted handoff will in fact
store a replica of the updated document on another node and will also
(after some delay following the ideal node rejoining) transfer that
document to the ideal node.

>  5.1) Where the data is getting saved when one of the "ideal" nodes is
>  not available?

Typically, on the next vnode on a unique node along the ring after the
ideal ones.  In the case where less than N physical nodes are present,
it will not be a unique node.

>  5.2) According to the experiments, the data gets updated only when it
>  is accessed via an API, directly. The data folders are not
>  synchronized automatically when the "ideal" node being down becomes
>  up again.

I'm not sure what to say without being involved in that experiment,
but that we observe that data updated when an ideal node is
unreachable is correctly stored on a fallback node, and later
transferred.  This is accomplished via the vnode-to-vnode merkle
exchange a little while after the ideal node has become visible again.

> 6) In riak_vnode_sidekick are two things that will produce a very high
> load on the nodes at unexpected intervals. I'm referring to
> gen_server2:call(VNode, list, 60000) and make_merk(VNode, Idx,
> ObjList).  How do you control this load?

These calls occur when a node is in the process of handing off data to
the ideal node for that data.  The set of documents in question are
those that were stored in the fallback node during an outage.  These
will not be executed over the entire dataset in a cluster, but in fact
only the subset of a given partition that has changed during a node
outage.  This has been further improved in the newer vnode
implementation that does away with the sidekick.

Also, if you use nearly any backend besides fs_backend, the
performance will tend to be quite different.

I hope that this message helps with your understanding and use of Riak.

Best regards,

-Justin




More information about the riak-users mailing list