Details on new delete_mode setting introduced in 1.0.0

Jon Meredith jmeredith at
Wed Oct 12 19:01:19 EDT 2011


  Riak 1.0.0 has introduced more control over deletion with the delete_mode

  If you plan to delete and recreate objects under the same key rapidly, and
there is enough disk available to store tombstones, it is safest to set
delete_mode to keep.

  The default 3s delay for removing tombstones balances keeping the
tombstone around long enough for any rapid delete/recreates, but unlike the
keep mode it does remove the data.

Riak keeps your objects available during failures by storing multiple copies
of the data.  This redundancy makes deletion more complex than a single node
database.  For example, Riak needs to ensure deletes issued while nodes are
down get applied when the nodes recover, or resolve what happens if the
network is partitioned and an object is deleted on one side but updated on
the other side.

Deletes in Riak are a two step process, first it writes a tombstone objects
to the N replicas and only once all replicas have stored the tombstone are
they removed.  If fallback nodes are in use, the tombstone will not be
removed.  Riak will wait until the next time the object is accessed and
successfully reads the same tombstone object from all primaries.  This
scheme works well most of the time, but is not perfect and Basho still has
some improvements planned.  As part of Riak 1.0 we've added a delete_mode
setting to give more control over the deletion process while we complete
that work.

Explaining what delete_mode does requires more gory details.  When a client
requests deletion of an object, internally riak performs a get/put against
the object to write the tombstone and acknowledges the delete to the client
allowing it to continue.  In the background Riak issues a second get
operation to trigger tombstone removal.

Whenever Riak gets an object, as part of a delete or not, it requests all
replicas of the object from the current preference list (made up of owner
nodes responsible for storing replicas of the object and fallback nodes if
the owners are unavailable).  If any vnodes return out of date or
conflicting objects out of date it will issue read repairs and stop.  If all
replicas have the same object and it is a tombstone object and there are no
fallbacks in the preference list then the get FSM issues the request to
remove the tombstone permanently.

Removing the tombstone object is the hard part of deletion.  To maintain
it's availability properties, Riak relies on eventual consistency and
deliberately does not synchronize updates to objects across replicas as one
or more could fail during a request.  This means that there can be small
variations in when the remove tombstone request is processed.  If the object
is being updated (a get/put) by a client during the window between the
tombstone-checking get deciding to remove the tombstones and the tombstone
being removed on the nodes then one of two things will happen.  Get uses R
to decide when it has enough objects, so if any of the first R responses
include the tombstone object the response can include a vector clock to base
the new object on that will supercede the tombstones.  If none of the first
R of N requests contain the tombstone object, there is no versioning
information to return to the client so a new object will be created with
empty versioning information.

The new delete_mode setting in the riak_kv section of app.config is for
controlling what happens during the window between the get FSM deciding it
can remove the object and it taking place.  If a delay is set, when the
tomstone removal request is received, the tombstone is hashed and then
checked after the delay to see if it has changed.  If it has not, the
tombstone is removed.  If the hash has changed due to an update, the new
object is left alone.  This is the default delete_mode with a delay of 3s.
 Delayed deletes are implemented using timers on the vnodes, so setting long
delays on systems with heavy delete activity will increase the memory

Setting delete_mode to 'keep' disables tombstone removal.  This is useful
for applications where the client may be disconnected for extended periods
and keep local copies of objects that they will update on reconnection (for
example mobile clients or multi-data center replication when the two sites
will be disconnected for long periods of time).  It also protects against an
edge case where an object is deleted and recreated on the owning nodes while
a fallback is either down or awaiting handoff.

There is also an 'immediate' delete_mode which preserves the old 0.14.2
behavior of removing the tombstone as soon as the request is received.  Some
unit tests for clients rely on the old behavior (e.g. the Python client) and
expect the versioning information to be reset after the delete.  The 1.0
HTTP, PBC and riak_client interfaces can now provide vclocks when tombstone
objects are present so that a put will supercede them.

Wrapping it up - the default delete_mode will work for most use cases with
the option to override it when needed.  To
change the setting add a {delete_mode, keep}, {delete_mode, immediate} or
{delete_mode, DelayMsecs} to the riak_kv section of app.config (in
/etc/riak, /opt/riak or etc/ depending on your platform).

Jon Meredith
Senior Software Engineer
Basho Technologies
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

More information about the riak-users mailing list