'not found' after join

Ryan Zezeski rzezeski at basho.com
Thu May 5 10:40:44 EDT 2011


Hi Greg,

You're right, the current situation for handoff is not optimal.  We are
aware of these shortcomings and are working on solutions.  That said, there
may be some knobs you can turn to minimize the amount of time you spend in
handoff.

Since you are using a ring size of 2048 you are possibly testing the limits
of certain subsystems inside Riak.  I'm not saying it shouldn't work, just
that many things in Riak were probably coded with an expectation of a
smaller ring size.

1. For example, riak_core has a `handoff_concurrency` setting that
determines how many vnodes can concurrently handoff on a given node.  By
default this is set to 4.  That's going to take a while with your 2048
vnodes and all :)

2. Also, by default, riak_core sets the `vnode_inactivity_timeout` to 60s.
 That means that the vnode must be inactive for 60s before it will begin
handoff.  You could try lowering this to expedite handoff.

3. _Do not_ make constant calls to `riak-admin transfers`.  Every time you
call this you are resetting the vnode activity and stalling handoff.  I
know, I know, how could anyone possibly be expected to know that?  Along
with everything else, it's something we plan to address.

In summary, you could try adding the following to your app.config and adjust
accordingly.  And keep in mind not to hit `riak-admin transfers` too often
to give the vnodes a chance to handoff.

[
%% Riak Core Config
  {riak_core, [
                    ...
                    %% settings to adjust handoff latency/throughput
                    {handoff_concurrency, N},
                    {vnode_inactivity_timeout, M},
                    ....
                   ]},
  ...
].

HTH,
-Ryan

On Thu, May 5, 2011 at 2:18 AM, Greg Nelson <grourk at dropcam.com> wrote:

> I just added node #5 to our cluster, and once again the experience during
> the subsequent 60-minute handoff period was pretty awful!  I just don't
> understand why this would be expected behavior while adding a node.  There
> doesn't seem to be any realistic way to join a node to an online cluster.
>  As far as I'm concerned this is a *huge* defect in Riak.
>
> Read-repair didn't seem to kick in immediately for data.  My application
> was configured to retry GETs (with a few seconds of backoff), and still got
> 404s.  I manually requested an object repeatedly for over 20 minutes until
> finally getting a result.
>
> I think bug #992 (https://issues.basho.com/show_bug.cgi?id=992) describes
> the defect, but I'm wondering if there is more to it than this?  Especially
> since read-repair didn't quite seem to work.
>
> Could what Daniel describes on that bug ("Only return not found when all
> vnodes have reported not found (or error)") be implemented as a configurable
> option?  Maybe something one could kick in when a node joins until all
> handoffs are complete?
>
> What we can do to remedy this before I add node #6, #7, etc.  We're storing
> huge amounts of data, which means that a) we'll be adding nodes often, and
> b) the amount of data handoff will be large, which means long periods of
> handoff where we don't want to have downtime.
>
> Greg
>
> On Tuesday, May 3, 2011 at 2:30 AM, Nico Meyer wrote:
>
> Hi everyone,
>
> I just want to note that I observed similar behaviour with a somewhat
> larger clusters of 10 or so nodes. I first noticed that handoff activity
> after node join (or leave for that matter) involved a lot more
> partitions than I would have expected. By comparing the old and the new
> ring file, I found out that more than 80 percent of partitions had to be
> moved to another node.
> My naive expectation was that joining a node to a cluster of size X
> would result in roughly ring_creation_size/(X+1) partitions to be handed
> off, which would also be the minimum if one expects a balanced cluster
> afterwards.
> Furthermore it would in theory be possible to move partitions in such a
> way that at least one partition from each preflist stays on the same
> node. Maybe for X>N it should even be possible to guarantee this for a
> basic quorum of each preflist, eliminating the notfound problem
> completely, but I am not sure about that.
>
> I may be able to provide some ring files to analyze this behaviour if
> someone from basho is interested.
>
> Cheer Nico
>
> Am Montag, den 02.05.2011, 23:14 -0400 schrieb Ryan Zezeski:
>
> Greg,
>
>
> Your expectations are fair, just because you added a node doesn't mean
> Riak should return notfounds. Unfortunately, we aren't quite there
> yet. This is a side effect of how Riak currently implements handoff
> in that it immediately updates/gossips the ring causing
> many partitions to handoff immediately. If a request comes in that
> relies on these partitions then it will get a notfound and perform
> read repair. You're situation is multiplied by the fact that you are
> going from 3 nodes to 4. More vnode shuffling occurs because of the
> small cluster size.
>
>
> We're well aware of this and have it on our radar for improvement in a
> future release.
>
>
> All this said, you data will be eventually consistent. That is, all
> your data will eventually be handed off and things will work as
> normal. It's only during the handoff that you _may_ encounter
> notfounds. In this case it would be best to add a new node to your
> cluster at lowest load times and if you can spare additional hardware
> a few more nodes to start with is an even easier option.
>
>
> -Ryan
>
> On Mon, May 2, 2011 at 9:48 PM, Greg Nelson <grourk at dropcam.com>
> wrote:
> Hello riak users!
>
>
> I have a 4 node cluster that started out as 3 nodes.
> ring_creation_size = 2048, target_n_val is default (4), and
> all buckets have n_val = 3.
>
>
> When I joined the 4th node, for a few minutes some GETs were
> returning 'not found' for data that was already in riak.
> Eventually the data was returned, due to read repair I would
> assume. Is this expected? It seems that 'not found' and read
> repairs should only happen when something goes wrong, like a
> node goes down. Not when adding a node to the cluster, which
> is supposed to be part of normal operation!
>
>
> Any help or insight is appreciated!
>
>
> Greg
>
> ________________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
>
>
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
>
>
>
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20110505/3b0f730f/attachment.html>


More information about the riak-users mailing list