'not found' after join

Kresten Krab Thorup krab at trifork.com
Thu May 5 04:05:14 EDT 2011

I've also been asking for this, and the current master has code to remedy these things, but it's not in an official release yet.

At Erlang level, you can specify options to a RiakClient:get as follows

-type option() :: {r, pos_integer()} |         %% Minimum number of successful responses
                  {pr, non_neg_integer()} |    %% Minimum number of primary vnodes participating
                  {basic_quorum, boolean()} |  %% Whether to use basic quorum (return early
                                               %% in some failure cases.
                  {notfound_ok, boolean()}  |  %% Count notfound reponses as successful.
                  {timeout, pos_integer() | infinity}. %% Timeout for vnode responses

And so to get the semantics I think you're asking for, do GET (assuming N=3) with

[{r,2},{pr, 2}, {basic_quorum, false}, {notfound_ok, true}]

So this will work as you want as long as there is only one node down.

During handoff you may see a new kind of error

HTTP 503 / {error r_value_unsatisfied ...}

which  is the behavior when basic quorum is disabled, i.e. the alternative to getting a notfound just because there was some node which did not have the value.

Each of those are also available as query parameters when doing a HTTP get.


I'm also looking forward to a release which has this, and I'm hoping that the defaults can somehow be simplified / strengthened so people new to this don't need to be so surprised about these things.


On May 5, 2011, at 8:18 AM, Greg Nelson wrote:

I just added node #5 to our cluster, and once again the experience during the subsequent 60-minute handoff period was pretty awful!  I just don't understand why this would be expected behavior while adding a node.  There doesn't seem to be any realistic way to join a node to an online cluster.  As far as I'm concerned this is a huge defect in Riak.

Read-repair didn't seem to kick in immediately for data.  My application was configured to retry GETs (with a few seconds of backoff), and still got 404s.  I manually requested an object repeatedly for over 20 minutes until finally getting a result.

I think bug #992 (https://issues.basho.com/show_bug.cgi?id=992) describes the defect, but I'm wondering if there is more to it than this?  Especially since read-repair didn't quite seem to work.

Could what Daniel describes on that bug ("Only return not found when all vnodes have reported not found (or error)") be implemented as a configurable option?  Maybe something one could kick in when a node joins until all handoffs are complete?

What we can do to remedy this before I add node #6, #7, etc.  We're storing huge amounts of data, which means that a) we'll be adding nodes often, and b) the amount of data handoff will be large, which means long periods of handoff where we don't want to have downtime.


On Tuesday, May 3, 2011 at 2:30 AM, Nico Meyer wrote:

Hi everyone,

I just want to note that I observed similar behaviour with a somewhat
larger clusters of 10 or so nodes. I first noticed that handoff activity
after node join (or leave for that matter) involved a lot more
partitions than I would have expected. By comparing the old and the new
ring file, I found out that more than 80 percent of partitions had to be
moved to another node.
My naive expectation was that joining a node to a cluster of size X
would result in roughly ring_creation_size/(X+1) partitions to be handed
off, which would also be the minimum if one expects a balanced cluster
Furthermore it would in theory be possible to move partitions in such a
way that at least one partition from each preflist stays on the same
node. Maybe for X>N it should even be possible to guarantee this for a
basic quorum of each preflist, eliminating the notfound problem
completely, but I am not sure about that.

I may be able to provide some ring files to analyze this behaviour if
someone from basho is interested.

Cheer Nico

Am Montag, den 02.05.2011, 23:14 -0400 schrieb Ryan Zezeski:

Your expectations are fair, just because you added a node doesn't mean
Riak should return notfounds. Unfortunately, we aren't quite there
yet. This is a side effect of how Riak currently implements handoff
in that it immediately updates/gossips the ring causing
many partitions to handoff immediately. If a request comes in that
relies on these partitions then it will get a notfound and perform
read repair. You're situation is multiplied by the fact that you are
going from 3 nodes to 4. More vnode shuffling occurs because of the
small cluster size.

We're well aware of this and have it on our radar for improvement in a
future release.

All this said, you data will be eventually consistent. That is, all
your data will eventually be handed off and things will work as
normal. It's only during the handoff that you _may_ encounter
notfounds. In this case it would be best to add a new node to your
cluster at lowest load times and if you can spare additional hardware
a few more nodes to start with is an even easier option.


On Mon, May 2, 2011 at 9:48 PM, Greg Nelson <grourk at dropcam.com<mailto:grourk at dropcam.com>>
Hello riak users!

I have a 4 node cluster that started out as 3 nodes.
ring_creation_size = 2048, target_n_val is default (4), and
all buckets have n_val = 3.

When I joined the 4th node, for a few minutes some GETs were
returning 'not found' for data that was already in riak.
Eventually the data was returned, due to read repair I would
assume. Is this expected? It seems that 'not found' and read
repairs should only happen when something goes wrong, like a
node goes down. Not when adding a node to the cluster, which
is supposed to be part of normal operation!

Any help or insight is appreciated!


riak-users mailing list
riak-users at lists.basho.com<mailto:riak-users at lists.basho.com>

riak-users mailing list
riak-users at lists.basho.com<mailto:riak-users at lists.basho.com>

riak-users mailing list
riak-users at lists.basho.com<mailto:riak-users at lists.basho.com>


More information about the riak-users mailing list