Deleting data in a map-reduce job

Daniel Abrahamsson daniel.abrahamsson at
Fri Oct 18 02:11:22 EDT 2013


Does anyone have any experience with a similar setup?

We have resolved questions 4 and 5 - they occurred due to a firewall
misconfiguration, but I would still very much like to hear if there are any
drawbacks with deleting data in the map reduce job itself compared to just
collecting the keys and then deleting the data with the client.

Daniel Abrahamsson

On Thu, Oct 10, 2013 at 11:14 AM, Daniel Abrahamsson <
daniel.abrahamsson at> wrote:

> Hi, I've some questions regarding map-reduce jobs. The main one regards
> deleting data in a map-reduce job.
> I have a map-reduce job that traverses an entire bucket to clean up
> old/unusable data once a day. The deletion of objects is done in the
> map-reduce job itself. Here is an example map-reduce job expressed as a
> qfun explaining what I am doing:
> fun({error, notfound}, _, _)   -> [];
>    (O, _, Condition) ->
>   Obj = case hd(riak_object:get_values(O)) of
>     <<>> -> % Ignore tombstones
>       default_object_not_to_be_deleted()
>     Bin -> binary_to_term(Bin)
>   end,
>   case should_be_deleted(Obj, Condition) of
>     false -> [{total, 1}, {removed, 0}];
>     true ->
>        Key = riak_object:key(O),
>        Bucket = riak_object:bucket(O),
>        {ok, Client} = riak:local_client(),
>        Client:delete(Bucket,Key),
>        [{total, 1}, {removed, 1}]
>   end
> end.
> And now to the questions:
> 1. I have noted that deleting data this way leaves the keys around if I do
> a subsequent
>    list_keys() operation. They are pruned when I try to get the objects
> and get {error, notfound}.
>    With this approach, will the keys ever be removed unless someone tries
> to get them first?
> 2. Are there any other drawbacks with deleting data in the map-reduce job
> itself, rather than
>    reading up the keys with the job, and the using the regular riak client
> to delete the objects?
> 3. Handling of tombstones in map-reduce jobs is very poorly documented.
> The approach above has worked for us. However, the approach feels very
> akward with both an {error, notfound} clase and checking for an empty
> binary as value. I know you can also check for the "X-Riak-Deleted" flag in
> the metadata. Under what circumstances do the different values appear, and
> most importantly, which is the recommended way of dealing with tombstones
> in map-reduce jobs?
> Considering that we will have quite much data in our bucket, we run the
> job during off-hours not to
> disturb regular traffic. However, when we run the job we often get an
> "error, disconnected" error after approximately 15 minutes, even if our
> timeout is even greater than that. Running the job manually afterwards
> usually takes just ~30 seconds.
> 4. Have anyone else experienced this with a "cold" database? We have not
> yet configured all the tuning parameters reported by "riak diag", but will
> do so soon. Might this have an effect in this case?
> 5. What does the "disconnected" message mean, considering that the timeout
> value has not yet been reached?
> Regards,
> Daniel Abrahamsson
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

More information about the riak-users mailing list