key garbage collection

Greg Stein gstein at
Thu Nov 3 03:49:06 EDT 2011

On Thu, Nov 3, 2011 at 02:39, Justin Karneges <justin at> wrote:
> Say you have an operation that requires creating two keys, A and B, and you
> succeed in creating A but fail in creating B.  How do you delete A after the
> fact?  I have two ideas:
> 1) Run periodic MapReduce operations that do full db scans looking for garbage
> keys and deleting them (this seems really horrible, but I'll admit I'm new to
> distributed DBs and MapReduce).

I believe that you will *always* need to do this. Without
transactions, you can always end up with cruft. Best you can do is
minimize how often you need to run the scavenge process.

> 2) Maintain cleanup logs that explicitly identify possibly offending keys, for
> optimized cleanup processing.

These logs need to be stored *somewhere*, but that storage could also
fail. That is why I believe you'll need a periodic full scan for

(and note this applies whether "storage" is memory, disk, Riak, or
whatever else)

> So far so good.  Now for handling cleanup.  Periodically, we scan the
> "cleanup" bucket for keys to process.  Since keys only exist in this bucket at
> the moment of a write (they are deleted immediately afterwards), in practice
> there should hardly be any keys in here at any single point in time.  We're
> talking single digits here.  Much better than a full db scan to find garbage
> keys.  Also, the keys to process can be narrowed down by time (e.g. > 5
> minutes ago) based on the key name.

This will minimize your scans, but not eliminate them. You may not be
able to write to the "cleanup" bucket because you've lost all network
connectivity to the Riak cluster. Not a bad assumption, given that you
could not write out B (what makes you think you could write to

Personally, rather than attempting to write something else to a
failing Riak cluster, I'd suggest keeping these keys in memory along
with a background thread that periodically attempts to clean them up.
You're gonna lose the keys if the client dies, but hey... as I said:
best you can do is to minimize the full scans.



More information about the riak-users mailing list