Absolute consistency

Joseph Blomstedt joe at basho.com
Wed Jan 11 22:14:54 EST 2012

If ElasticSearch is a better fit, then using ElasticSearch is the
right thing to do. The whole "NoSQL movement" is really about choice.
At scale there will never be a single solution that is best for
everyone.  Riak is intentionally focused on high availability,
reliability, and fault-tolerance. If you have critical data and/or
care most about optimizing for "it just works" (even during a server
failure, even at 2 AM), then Riak is likely the best choice. If other
priorities dominate, then you may want to look elsewhere.

Riak is based on Dynamo. Dynamo is an eventually consistent system
that embraces sloppy quorums and hinted handoff. In terms of CAP, it
is an AP system rather than a CP system. And, by embracing sloppy
quorums, Riak/Dynamo takes AP to its limit by optimizing for "always
writable" over "read-your-own writes" consistency. In comparison,
Project Voldemort is another Dynamo inspired system that chooses to go
with strict quorums. This provides RYOW eventual consistency, but
leads to lower availability guarantees during failures/partitions.

For Riak 1.0, we introduced PW/PR. The intention behind PW/PR was to
add support for strict quorums. Requests with R/W settings would use
the standard sloppy quorum logic, requests with PR/PW would use strict
quorum logic. As Jon mentioned in his earlier email, the current
implementation that went into 1.0 turns out to have a corner case were
strict quorums aren't enforced. Addressing this corner case and
strengthening the guarantees of PW/PR are "on the list".

To illustrate sloppy vs strict, consider a 5-node cluster: A/B/C/D/E.
And a W=Quorum/N=3 write request is issued on a key that is owned by
nodes A/B/C. However, nodes B and C are currently offline.

With sloppy quorums/hinted handoff, Riak will re-route the requests
meant for B/C to nodes D/E. This allows the write to succeed with the
guarantee of having 3-replicas in the cluster. In the future, D will
eventually handoff it's replica back to B whenever it comes online,
and E will do the same to C. This is hinted handoff.

With strict quorums (ie. PW=Quorum), the desired behavior is to only
send requests to the primary nodes. Since B/C are down, only A would
respond, and therefore the write would fail because it did not fulfill
the requested replica requirement. Again, in the current
implementation, there are cases where this is not guaranteed and a
request may be sent to a fallback node. This leads to the behavior
discussed earlier.

Of course, even if we had perfect PW/PR semantics, Riak still only
gives you a limited form of "read your own writes" consistency. The
labels "absolute consistency", "strong consistency" or even "atomic
operation" are vague when discussing distributed systems with multiple

Some thoughts to ponder:

1. Do you allow multiple clients to write to Riak at the same time?
With concurrent writers, "atomic" can mean multiple things. Do you
want linearizability? Do you want one writer to fail? Is optimistic
concurrency control or MVCC your solution? You could route everything
though a single writer, but then you introduce a
single-point-of-failure (SPOF). Adding an SPOF in-front of a highly
available system is non-ideal. Or, perhaps a single writer with leader
election / failover?

2. What about write failure? In Riak, a write failure does not mean
the value won't later show-up in a read. If you issue a PW=3 write, it
may fail because it succeeded to write to 1 replica, but not the other
2. However, the 1-replica that does have the value will eventually
propagate it to the other 2 replicas through read repair. Thus, you
may eventually read the failed-to-write value. Of course, the write
could have failed to all 3 replicas in which case you won't ever read
it. Which type of failure was it? You don't know, because actual
failure and "didn't reply before I responded to the client" are
indistinguishable without something similar to a 2-phase commit. Does
your client handle this case? Perhaps just re-issues writes until they
succeed? What if your client dies while re-issuing requests, is the
value lost? What consistency guarantees do you want to provide in this

In general, distributed consistency is non-trivial. Even master/slave
systems have choices to make. Synchronous vs asynchronous replication.
ElasticSearch is synchronous (at least to secondary RAM), while
MongoDB is asynchronous. If you have multiple slaves, what consistency
guarantees are there between all of them? If the master crashes during
a write that was replicated to some but not all slaves, is it possible
to get different values on a read if different subset of slaves crash
as well? For the strongest replication guarantees, you end up with
protocols with higher latency and lower availability guarantees. It's
always a tradeoff game.

As an aside, around March 2010, I started to investigate strong
consistency in Riak. Part of that work lead to an implementation of
riak_zab (http://github.com/jtuple/riak_zab), an atomic, 2-phase
commit protocol built on riak_core that I released last year. I have
unreleased code that provided a strongly consistent riak_kv layer
(riakual) on top of riak_zab. This was one of many possible ways to
add stronger guarantees to Riak.

As Jon mentioned, stronger consistency is a research area for 2012.
While CAP dictates that you can't have C/A/P at once, there's no
reason you can't have a product that provides both AP requests and CP
requests. Perhaps there will be more to discuss on that point later on
this year.

The main take-away of this long email is that providing different
guarantees in the presence of node failures and network partitions is
a non-trivial problem.  If the goal is high-availability and no SPOF,
the problem is even more challenging.  I would recommend against
anyone trying to implementing client-side strong consistency on-top of
Riak, unless you understand the scope of the problem and are
intentionally limiting yourself to a subset of strong consistency (eg.
"assume writes always succeed", "assume only one writer always", etc).
Or, if you understand how you would do so, leverage something like
Zookeeper to provide consistent replicated state that is backed by
Riak. (That's essentially what riakual/riak_zab did). Or, better yet,
look at how you can reformulate your problem to work in an eventually
consistent system. Less coordination will always provide faster, more
predictable performance. Options like statebox, knockbox, and
meangirls can help with certain problem domains:

Finally, if you truly need atomic, strong consistency, today and not
tomorrow then consider other options. Seriously, if Riak isn't the
right fit, "Don't use my database":

If you do look at other options, take the time to truly understand the
guarantees provided. What does atomic, multi-master replication mean
for a given product?  What failure conditions can it tolerate? Can you
ever lose data under rare, but not necessarily uncommon scenarios? If
you're unsure, ask questions. It's not about good or bad. Different
products have different guarantees.  The goal is to ensure that
whatever features/guarantees you need for your project are provided by
the choice you decide upon.


On Wed, Jan 11, 2012 at 11:07 AM, Les Mikesell <lesmikesell at gmail.com> wrote:
> On Wed, Jan 11, 2012 at 10:56 AM, Vishal Shah <goldenv at gmail.com> wrote:
>> To add to Ian's comment, for me personally, this specific characteristic is
>> in fact a very important distinguishing feature of Riak vs other scalable KV
>> systems. To me, this is what separates
> Yes, but it makes it unusable for anything that requires atomic
> operations.  A group here just went with elasticsearch - partly
> because of the full lucene indexer and range queries, but mostly
> because they wanted redundant data feeds and an atomic operation to
> reject duplicates.  And as a nice side effect, you can run a
> client-only node that doesn't store data on the same box with the
> application so you don't need to go through a separate load balancer
> to deal with failures of the node the app is configured to use.
> --
>  Les Mikesell
>    lesmikesell at gmail.com
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Joseph Blomstedt <joe at basho.com>
Software Engineer
Basho Technologies, Inc.

More information about the riak-users mailing list