Object not found after successful PUT on S3 API

Daniel Miller dmiller at dimagi.com
Mon Mar 6 10:07:41 EST 2017


I recently had another case of a disappearing object. This time the object
was successfully PUT, and (unlike the previous cases reported in this
thread) for a period of time GETs were also successful. Then GETs started
404ing for no apparent reason. There are no errors in the logs to indicate
that anything unusual happened. This is quite disconcerting. Is it normal
that Riak CS just loses track of objects? At this point we are using CS as
primary object storage, meaning we do not have the data stored in another
database so it's critical that the data is not randomly lost.

In the CS access logs I see

# all prior GET requests for this object succeeding like this one. This is
the last successful GET request:
[28/Feb/2017:14:42:35 +0000] "GET
/buckets/blobdb/objects/commcarehq__apps%2F3d2b...
HTTP/1.0" 200 14923 "" "Boto3/1.4.0 Python/2.7.6 Linux/3.13.0-86-generic
Botocore/1.4.53 Resource"
...
# all GET requests for this object are now failing like this one (the first
404):
[02/Mar/2017:08:36:11 +0000] "GET
/buckets/blobdb/objects/commcarehq__apps%2F3d2b...
HTTP/1.0" 404 240 "" "Boto3/1.4.0 Python/2.7.6 Linux/3.13.0-86-generic
Botocore/1.4.53 Resource"

The object name has been elided for readability. I do not know when this
object was PUT into the cluster because I only have logs for the past
month. Is there any way to dig further into Riak or Riak CS data to
determine if the object content is actually completely lost or if there are
any other details that might explain why it is now missing? Could I
increase some logging parameters to get more information about what is
going wrong when something like this happens?

I have searched the logs for other 404 responses but found none (other than
the two reported earlier), so this is the 3rd known missing object in the
cluster. We retain logs for one month only (I'm increasing this now because
of this issue), so it is possible that other objects have also gone
missing, but I cannot see them since the logs have been truncated.

The cluster now has 7 nodes instead of 9 (see earlier emails in this
thread), and the riak storage backend is now leveldb instead of multi. I
have attached config file templates for riak, raik-cs and stanchion (these
are deployed with ansible).

Bucket properties:
{
  "props": {
    "notfound_ok": true,
    "n_val": 3,
    "last_write_wins": false,
    "allow_mult": true,
    "dvv_enabled": false,
    "name": "blobdb",
    "r": "quorum",
    "precommit": [],
    "old_vclock": 86400,
    "dw": "quorum",
    "rw": "quorum",
    "small_vclock": 50,
    "write_once": false,
    "basic_quorum": false,
    "big_vclock": 50,
    "chash_keyfun": {
      "fun": "chash_std_keyfun",
      "mod": "riak_core_util"
    },
    "postcommit": [],
    "pw": 0,
    "w": "quorum",
    "young_vclock": 20,
    "pr": 0,
    "linkfun": {
      "fun": "mapreduce_linkfun",
      "mod": "riak_kv_wm_link_walker"
    }
  }
}

I'll be happy to provide more context to help troubleshoot this issue.

Thanks in advance for any help you can provide.

Daniel


On Tue, Feb 14, 2017 at 11:52 AM, Daniel Miller <dmiller at dimagi.com> wrote:

> Hi Luke,
>
> Sorry for the late response and thanks for following up. I haven't seen it
> happen since. At this point I'm going to wait and see if it happens again
> and hopefully get more details about what might be causing it.
>
> Daniel
>
> On Thu, Feb 9, 2017 at 1:02 PM, Luke Bakken <lbakken at basho.com> wrote:
>
>> Hi Daniel -
>>
>> I don't have any ideas at this point. Has this scenario happened again?
>>
>> --
>> Luke Bakken
>> Engineer
>> lbakken at basho.com
>>
>>
>> On Wed, Jan 25, 2017 at 2:11 PM, Daniel Miller <dmiller at dimagi.com>
>> wrote:
>> > Thanks for the quick response, Luke.
>> >
>> > There is nothing unusual about the keys. The format is a name + UUID +
>> some
>> > other random URL-encoded charaters, like most other keys in our cluster.
>> >
>> > There are no errors near the time of the incident in any of the logs
>> (the
>> > last [error] is from over a month before). I see lots of messages like
>> this
>> > in console.log:
>> >
>> > /var/log/riak/console.log
>> > 2017-01-20 15:38:10.184 [info]
>> > <0.22902.1193>@riak_kv_exchange_fsm:key_exchange:263 Repaired 2 keys
>> during
>> > active anti-entropy exchange of
>> > {776422744832042175295707567380525354192214163456,3} between
>> > {776422744832042175295707567380525354192214163456,'riak-fake
>> 3 at fake3.fake.com'}
>> > and
>> > {822094670998632891489572718402909198556462055424,'riak-fake
>> 9 at fake9.fake.com'}
>> > 2017-01-20 15:40:39.640 [info]
>> > <0.21789.1193>@riak_kv_exchange_fsm:key_exchange:263 Repaired 1 keys
>> during
>> > active anti-entropy exchange of
>> > {936274486415109681974235595958868809467081785344,3} between
>> > {959110449498405040071168171470060731649205731328,'riak-fake
>> 3 at fake3.fake.com'}
>> > and
>> > {981946412581700398168100746981252653831329677312,'riak-fake
>> 5 at fake5.fake.com'}
>> > 2017-01-20 15:46:40.918 [info]
>> > <0.13986.1193>@riak_kv_exchange_fsm:key_exchange:263 Repaired 2 keys
>> during
>> > active anti-entropy exchange of
>> > {662242929415565384811044689824565743281594433536,3} between
>> > {685078892498860742907977265335757665463718379520,'riak-fake
>> 3 at fake3.fake.com'}
>> > and
>> > {707914855582156101004909840846949587645842325504,'riak-fake
>> 6 at fake6.fake.com'}
>> > 2017-01-20 15:48:25.597 [info]
>> > <0.29943.1193>@riak_kv_exchange_fsm:key_exchange:263 Repaired 2 keys
>> during
>> > active anti-entropy exchange of
>> > {776422744832042175295707567380525354192214163456,3} between
>> > {776422744832042175295707567380525354192214163456,'riak-fake
>> 3 at fake3.fake.com'}
>> > and
>> > {799258707915337533392640142891717276374338109440,'riak-fake
>> 0 at fake0.fake.com'}
>> >
>> > Thanks!
>> > Daniel
>> >
>> >
>> >
>> > On Wed, Jan 25, 2017 at 9:45 AM, Luke Bakken <lbakken at basho.com> wrote:
>> >>
>> >> Hi Daniel -
>> >>
>> >> This is a strange scenario. I recommend looking at all of the log
>> >> files for "[error]" or other entries at about the same time as these
>> >> PUTs or 404 responses.
>> >>
>> >> Is there anything unusual about the key being used?
>> >> --
>> >> Luke Bakken
>> >> Engineer
>> >> lbakken at basho.com
>> >>
>> >>
>> >> On Wed, Jan 25, 2017 at 6:40 AM, Daniel Miller <dmiller at dimagi.com>
>> wrote:
>> >> > I have a 9-node Riak CS cluster that has been working flawlessly for
>> >> > about 3
>> >> > months. The cluster configuration, including backend and bucket
>> >> > parameters
>> >> > such as N-value are using default settings. I'm using the S3 API to
>> >> > communicate with the cluster.
>> >> >
>> >> > Within the past week I had an issue where two objects were PUT
>> resulting
>> >> > in
>> >> > a 200 (success) response, but all subsequent GET requests for those
>> two
>> >> > keys
>> >> > return status of 404 (not found). Other than the fact that they are
>> now
>> >> > missing, there was nothing out of the ordinary with these particular
>> to
>> >> > PUTs. Maybe I'm missing something, but this seems like a scenario
>> that
>> >> > should never happen. All information included here about PUTs and
>> GETs
>> >> > comes
>> >> > from reviewing the CS access logs. Both objects were PUT on the same
>> >> > node,
>> >> > however GET requests returning 404 have been observed on all nodes.
>> >> > There is
>> >> > plenty of other traffic on the cluster involving GETs and PUTs that
>> are
>> >> > not
>> >> > failing. I'm unsure of how to troubleshoot further to find out what
>> may
>> >> > have
>> >> > happened to those objects and why they are now missing. What is the
>> best
>> >> > approach to figure out why an object that was successfully PUT seems
>> to
>> >> > be
>> >> > missing?
>> >> >
>> >> > Thanks!
>> >> > Daniel Miller
>> >> >
>> >> > _______________________________________________
>> >> > riak-users mailing list
>> >> > riak-users at lists.basho.com
>> >> > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>> >> >
>> >
>> >
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20170306/c5ffc5c1/attachment-0002.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: config-files.zip
Type: application/zip
Size: 11842 bytes
Desc: not available
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20170306/c5ffc5c1/attachment.zip>


More information about the riak-users mailing list