Jon Meredith jmeredith at basho.com
Mon Aug 1 23:46:54 EDT 2011

Hi Soren,

I forgot to reply-all so I'm resending to keep the discussion on the mailing

First reply:

A few months back we added options on the master branch to the get/put FSM
to control how many *primary* nodes (partition owners) are required for each
get and put operation (PR and PW for get/put respectively).  Iff you set
PR+PW > N then you should get the behavior you desire in point 1.  Otherwise
there's a chance that your write could go to (possibly different) fallback
nodes which would *eventually* deliver the result back to the current owner.
 In the case of a locking mechanism, eventual consistency is undesirable.

For (2) nothing protects you from concurrent writes - two puts could be
coordinated at the same time and because multiple nodes are involved (well,
vnodes) there is no guarantee of the ordering.  Being prepared to handle
conflicts is a much safer strategy.

You may be able to use the returnbody option in combination with PR/PW to
check if you got what you wanted.  If the result you receive after put only
contains the client that locked it, treat it as success.  Otherwise the lock
failed.  I'm not quite sure how you would clean up on collisions, as it's
possible that both clients could lock.  Even a solution of 'lowest sorted
client wins' wouldn't necessarily work if a client with a higher id
successfully locked then a lower created a conflict.  If you can work out a
way to reliably clean up, you may be onto something.

Follow ups from Soren:

Referencing (2) - That was what I was trying to achieve by simply doing
and reading back the value afterwards.

Thinking about this some more, I guess I'm still a bit stuck in my SQL
thinking and figured that setting dw=n would somehow synchronise the
write across the different nodes, but I gather that's not actually the
case (since that would imply some kind of transaction rollback
As you correctly observe - there are no transactions.  DW guarantees that
the storage backend has accepted the write on that many nodes - depending on
how you've configured the backend, operating system and disk
controllers/disks it may mean the data has been written to the disk.  On
failure, no transactional rollbacks occur. A consequence of this is that a
write can fail if it does not meet the PW/W/DW requirements but still be
written to some vnodes.  A later read can return the 'failed' value.  I wish
we had made the returns be 'ok' or 'maybe', instead of {'error', Blah}.

I seem to have not understood how returnbody works under the hood.
> Does it essentially perform a GET operation after finishing the PUT? I
> was afraid it would return whatever I had passed in my PUT request, so
> effectively masking it if there were conflicts when attempting the
> write.

Whenever Riak does a put it retrieves the current object on each vnode,
updates it with the incoming value - either replacing it or creating
siblings depending on the vclock then writes the object.  Returnbody returns
that written value to the put FSM saving a call get.  It's just an
optimization really, you could issue a separate get.

Did you miss the last part of my algorithm where I read back the value
> after the supposed succesful write for verification? If not, how is
> what you're suggesting different?
> I suggested writing the client id with allow_mult=false and w=dw=n,
> and then reading it back again afterwards (with r=n). If I didn't find
> the value in there I expect, start over (with the PUT with
> If-None-Match: * etc.) until I succesfully PUT and then READ back my
> own ID. Once I'm done with the lock, DELETE it.

The clean up problem I was talking about was in the case where two clients
try and lock at the same time and neither gets the lock. This can happen
depending on the order the vnodes process the put requests.

Client1: get returns not found
Client2: get returns not found
Client1: put(<<"b">>, <<"lock">>, <<"client 1 lock">>) while Client2:
put(<<"b">>, <<"lock">>, <<"client 2 lock">>)
Client1: get(<<"b">>, <<"lock">>) returns [<<"client1 lock">>]
Client2: get(<<"b">>, <<"lock">>) returns sibings [<<"client1 lock">>,
<<"client2 lock">>]

So client1 gets the lock, and client 2 sees it failed. If the ordering of
node message delivery is different both clients may succeed. If-None-Match
is implemented as get to check the vclock followed by a put in the HTTP/PBC
interfaces, there is still a chance of creating a conflict when both
if-none-match gets complete before the puts.

Client1: get returns not found
Client2: get returns not found
Client1: put(<<"b">>, <<"lock">>, <<"client 1 lock">>) while Client2:
put(<<"b">>, <<"lock">>, <<"client 2 lock">>)
Client1: get(<<"b">>, <<"lock">>) returns sibings [<<"client1 lock">>,
<<"client2 lock">>]
Client2: get(<<"b">>, <<"lock">>) returns sibings [<<"client1 lock">>,
<<"client2 lock">>]

In this case neither client gets the lock.  Notice that client2 gets the
same value as when client1 succeeds. This makes knowing when to delete to
retry the lock very hard.

Setting allow_mult=false makes things worse as it will just chose the
sibling with the latest timestamp and because the If-None-Match is not
atomic it will hide what really happened.

The In-None-Match behavior isn't ideal. I've been thinking about pushing the
If-None-Match option down to the vnode level but I don't think it resolves
all cases, it needs some thought.

Best Regards,

Jon Meredith
Senior Software Engineer
Basho Technologies

On Sun, Jul 31, 2011 at 2:55 PM, Soren Hansen <soren at linux2go.dk> wrote:

> I've seen a couple of posts here and there on the subject of a locking
> mechanism for Riak, most notably:
>   http://riak-users.197444.n3.nabble.com/Riak-and-Locks-td1866960.html
> While it would only serve as an advisory locking mechanism, wouldn't a
> bucket with a reasonably high n, w and dw set equal to n, a
> deterministic naming scheme for the object being locked, and a locking
> algorithm such as:
> 1. PUT /locks/object_id
>   If-None-Match: *
>   Body: <some unique identifier for the client thread>
> 1a. If this fails, wait for a while, then try again.
> 1b. If it succeeds, proceed to 2.
> 2. The doc for If-None-Match says "this does not prevent concurrent
> writes; it is possible for the condition to evaluate to true for
> multiple requests if the requests occur at the same time."  I'm not
> completely sure if n=w=dw protects me from concurrent writes (I'm not
> familiar with the locking semantics of a single riak instance).
> Anyway, if I'm in fact not protected, the next step is to read the
> value back to make sure we're actually the ones holding the key. If
> not, go back to step 1. If yes, proceed as planned.
> 3. Once you're done with the lock, just DELETE it.
> If this were really that simple, someone would have suggested it. So,
> what is this Riak rookie (i.e. I) missing?
> --
> Soren Hansen        | http://linux2go.dk/
> Ubuntu Developer    | http://www.ubuntu.com/
> OpenStack Developer | http://www.openstack.org/
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20110801/0c741214/attachment.html>

More information about the riak-users mailing list