Deleting data in a map-reduce job

Daniel Abrahamsson daniel.abrahamsson at klarna.com
Thu Oct 10 05:14:14 EDT 2013


Hi, I've some questions regarding map-reduce jobs. The main one regards
deleting data in a map-reduce job.

I have a map-reduce job that traverses an entire bucket to clean up
old/unusable data once a day. The deletion of objects is done in the
map-reduce job itself. Here is an example map-reduce job expressed as a
qfun explaining what I am doing:

fun({error, notfound}, _, _)   -> [];
   (O, _, Condition) ->
  Obj = case hd(riak_object:get_values(O)) of
    <<>> -> % Ignore tombstones
      default_object_not_to_be_deleted()
    Bin -> binary_to_term(Bin)
  end,
  case should_be_deleted(Obj, Condition) of
    false -> [{total, 1}, {removed, 0}];
    true ->
       Key = riak_object:key(O),
       Bucket = riak_object:bucket(O),
       {ok, Client} = riak:local_client(),
       Client:delete(Bucket,Key),
       [{total, 1}, {removed, 1}]
  end
end.

And now to the questions:
1. I have noted that deleting data this way leaves the keys around if I do
a subsequent
   list_keys() operation. They are pruned when I try to get the objects and
get {error, notfound}.
   With this approach, will the keys ever be removed unless someone tries
to get them first?

2. Are there any other drawbacks with deleting data in the map-reduce job
itself, rather than
   reading up the keys with the job, and the using the regular riak client
to delete the objects?

3. Handling of tombstones in map-reduce jobs is very poorly documented. The
approach above has worked for us. However, the approach feels very akward
with both an {error, notfound} clase and checking for an empty binary as
value. I know you can also check for the "X-Riak-Deleted" flag in the
metadata. Under what circumstances do the different values appear, and most
importantly, which is the recommended way of dealing with tombstones in
map-reduce jobs?

Considering that we will have quite much data in our bucket, we run the job
during off-hours not to
disturb regular traffic. However, when we run the job we often get an
"error, disconnected" error after approximately 15 minutes, even if our
timeout is even greater than that. Running the job manually afterwards
usually takes just ~30 seconds.

4. Have anyone else experienced this with a "cold" database? We have not
yet configured all the tuning parameters reported by "riak diag", but will
do so soon. Might this have an effect in this case?

5. What does the "disconnected" message mean, considering that the timeout
value has not yet been reached?

Regards,
Daniel Abrahamsson
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20131010/baeb9b88/attachment.html>


More information about the riak-users mailing list