question on replicas distribution

Justin Sheehy justin at
Wed Oct 14 22:27:47 EDT 2009

Hi Stewart,

On Wed, Oct 14, 2009 at 10:04 PM, stewart mackenzie <setori88 at> wrote:

> You guys have instead opted for a scenario where the addressing of the
> replica to a node is hashed/baked into the actual key. ....

This is the essence of consistent hashing: use of the same addressing
domain for objects and their locations, and a well-known hash function
into that domain.

> What happens if a first/fresh replica xyz stored on node 1. (and later
> replicated to nodes 2, 3 and 4.) Now the addressing in the key points to
> node 1. If node 1 goes down. How does the addressing get the value from 2, 3
> or 4.

The first thing to think about is N (or a bucket's configured n_val
parameter), which defaults to 3 but is configurable.  This is the
number of replicas of a given document that should be stored.  So, in
the typical case, a value will be stored on 3 different nodes
immediately upon the first put.  (there is no "later replicated" as
all replicas are of equal interest, importance, and timing)

In addition to a well-known hash function we have a well-known and
simple function for finding the N preferred storage locations
corresponding to a given hash value.  This can be seen in
riak_ring:filtered_preflist/3, and in most cases it boils down to the
next N ring positions that are owned by distinct nodes.

As a result, all N replicas of a given document are findable based
simply on the document's bucket and key, plus the current state of the
ring in terms of partition ownership -- without any "routing" or need
to ask any other node where the replicas are.

(There are some interesting failover cases when a node is down or when
ownership is in flux, but this email message is already long enough.)

> I need to hit the books again! now where did I put that amazon dynamo paper
> again? ;)

The dynamo paper is actually a great place to start for this topic,
yes.  Go for it.


More information about the riak-users mailing list