Object not found after successful PUT on S3 API

Russell Brown russell.brown at icloud.com
Mon Mar 6 12:28:08 EST 2017


Genuinely stumped then.

I’m surprised that dvv_enabled=false is the default as sibling explosion is bad.

I don’t know the CS code very well, but I assume a not_found means that either the manifest or some chunk is not found. I wonder if you can get the manifest and then see if any/all of the chunks are present?

On 6 Mar 2017, at 17:21, Daniel Miller <dmiller at dimagi.com> wrote:

> > Would be good to know the riak version
> 
> Riak 2.1.1
> Riak CS 2.1.0
> Stanchion 2.1.0
> 
> > why the dvv_enabled bucket property is set to false, please?
> 
> Looks like that's the default. I haven't changed it.
> 
>  > Also, is there multi-datacentre replication involved?
> 
> no
> 
> > Do you re-use your keys, for example, have the keys in question been created, deleted, and then re-created?
> 
> no
> 
> Thank you for the prompt follow-up.
> 
> Daniel
> 
> 
> On Mon, Mar 6, 2017 at 10:38 AM, Russell Brown <russell.brown at icloud.com> wrote:
> Hi,
> Would be good to know the riak version, and why the dvv_enabled bucket property is set to false, please? Also, is there multi-datacentre replication involved? Do you re-use your keys, for example, have the keys in question been created, deleted, and then re-created?
> 
> Cheers
> 
> Russell
> 
> On 6 Mar 2017, at 15:07, Daniel Miller <dmiller at dimagi.com> wrote:
> 
> > I recently had another case of a disappearing object. This time the object was successfully PUT, and (unlike the previous cases reported in this thread) for a period of time GETs were also successful. Then GETs started 404ing for no apparent reason. There are no errors in the logs to indicate that anything unusual happened. This is quite disconcerting. Is it normal that Riak CS just loses track of objects? At this point we are using CS as primary object storage, meaning we do not have the data stored in another database so it's critical that the data is not randomly lost.
> >
> > In the CS access logs I see
> >
> > # all prior GET requests for this object succeeding like this one. This is the last successful GET request:
> > [28/Feb/2017:14:42:35 +0000] "GET /buckets/blobdb/objects/commcarehq__apps%2F3d2b... HTTP/1.0" 200 14923 "" "Boto3/1.4.0 Python/2.7.6 Linux/3.13.0-86-generic Botocore/1.4.53 Resource"
> > ...
> > # all GET requests for this object are now failing like this one (the first 404):
> > [02/Mar/2017:08:36:11 +0000] "GET /buckets/blobdb/objects/commcarehq__apps%2F3d2b... HTTP/1.0" 404 240 "" "Boto3/1.4.0 Python/2.7.6 Linux/3.13.0-86-generic Botocore/1.4.53 Resource"
> >
> > The object name has been elided for readability. I do not know when this object was PUT into the cluster because I only have logs for the past month. Is there any way to dig further into Riak or Riak CS data to determine if the object content is actually completely lost or if there are any other details that might explain why it is now missing? Could I increase some logging parameters to get more information about what is going wrong when something like this happens?
> >
> > I have searched the logs for other 404 responses but found none (other than the two reported earlier), so this is the 3rd known missing object in the cluster. We retain logs for one month only (I'm increasing this now because of this issue), so it is possible that other objects have also gone missing, but I cannot see them since the logs have been truncated.
> >
> > The cluster now has 7 nodes instead of 9 (see earlier emails in this thread), and the riak storage backend is now leveldb instead of multi. I have attached config file templates for riak, raik-cs and stanchion (these are deployed with ansible).
> >
> > Bucket properties:
> > {
> >   "props": {
> >     "notfound_ok": true,
> >     "n_val": 3,
> >     "last_write_wins": false,
> >     "allow_mult": true,
> >     "dvv_enabled": false,
> >     "name": "blobdb",
> >     "r": "quorum",
> >     "precommit": [],
> >     "old_vclock": 86400,
> >     "dw": "quorum",
> >     "rw": "quorum",
> >     "small_vclock": 50,
> >     "write_once": false,
> >     "basic_quorum": false,
> >     "big_vclock": 50,
> >     "chash_keyfun": {
> >       "fun": "chash_std_keyfun",
> >       "mod": "riak_core_util"
> >     },
> >     "postcommit": [],
> >     "pw": 0,
> >     "w": "quorum",
> >     "young_vclock": 20,
> >     "pr": 0,
> >     "linkfun": {
> >       "fun": "mapreduce_linkfun",
> >       "mod": "riak_kv_wm_link_walker"
> >     }
> >   }
> > }
> >
> > I'll be happy to provide more context to help troubleshoot this issue.
> >
> > Thanks in advance for any help you can provide.
> >
> > Daniel
> >
> >
> > On Tue, Feb 14, 2017 at 11:52 AM, Daniel Miller <dmiller at dimagi.com> wrote:
> > Hi Luke,
> >
> > Sorry for the late response and thanks for following up. I haven't seen it happen since. At this point I'm going to wait and see if it happens again and hopefully get more details about what might be causing it.
> >
> > Daniel
> >
> > On Thu, Feb 9, 2017 at 1:02 PM, Luke Bakken <lbakken at basho.com> wrote:
> > Hi Daniel -
> >
> > I don't have any ideas at this point. Has this scenario happened again?
> >
> > --
> > Luke Bakken
> > Engineer
> > lbakken at basho.com
> >
> >
> > On Wed, Jan 25, 2017 at 2:11 PM, Daniel Miller <dmiller at dimagi.com> wrote:
> > > Thanks for the quick response, Luke.
> > >
> > > There is nothing unusual about the keys. The format is a name + UUID + some
> > > other random URL-encoded charaters, like most other keys in our cluster.
> > >
> > > There are no errors near the time of the incident in any of the logs (the
> > > last [error] is from over a month before). I see lots of messages like this
> > > in console.log:
> > >
> > > /var/log/riak/console.log
> > > 2017-01-20 15:38:10.184 [info]
> > > <0.22902.1193>@riak_kv_exchange_fsm:key_exchange:263 Repaired 2 keys during
> > > active anti-entropy exchange of
> > > {776422744832042175295707567380525354192214163456,3} between
> > > {776422744832042175295707567380525354192214163456,'riak-fake3 at fake3.fake.com'}
> > > and
> > > {822094670998632891489572718402909198556462055424,'riak-fake9 at fake9.fake.com'}
> > > 2017-01-20 15:40:39.640 [info]
> > > <0.21789.1193>@riak_kv_exchange_fsm:key_exchange:263 Repaired 1 keys during
> > > active anti-entropy exchange of
> > > {936274486415109681974235595958868809467081785344,3} between
> > > {959110449498405040071168171470060731649205731328,'riak-fake3 at fake3.fake.com'}
> > > and
> > > {981946412581700398168100746981252653831329677312,'riak-fake5 at fake5.fake.com'}
> > > 2017-01-20 15:46:40.918 [info]
> > > <0.13986.1193>@riak_kv_exchange_fsm:key_exchange:263 Repaired 2 keys during
> > > active anti-entropy exchange of
> > > {662242929415565384811044689824565743281594433536,3} between
> > > {685078892498860742907977265335757665463718379520,'riak-fake3 at fake3.fake.com'}
> > > and
> > > {707914855582156101004909840846949587645842325504,'riak-fake6 at fake6.fake.com'}
> > > 2017-01-20 15:48:25.597 [info]
> > > <0.29943.1193>@riak_kv_exchange_fsm:key_exchange:263 Repaired 2 keys during
> > > active anti-entropy exchange of
> > > {776422744832042175295707567380525354192214163456,3} between
> > > {776422744832042175295707567380525354192214163456,'riak-fake3 at fake3.fake.com'}
> > > and
> > > {799258707915337533392640142891717276374338109440,'riak-fake0 at fake0.fake.com'}
> > >
> > > Thanks!
> > > Daniel
> > >
> > >
> > >
> > > On Wed, Jan 25, 2017 at 9:45 AM, Luke Bakken <lbakken at basho.com> wrote:
> > >>
> > >> Hi Daniel -
> > >>
> > >> This is a strange scenario. I recommend looking at all of the log
> > >> files for "[error]" or other entries at about the same time as these
> > >> PUTs or 404 responses.
> > >>
> > >> Is there anything unusual about the key being used?
> > >> --
> > >> Luke Bakken
> > >> Engineer
> > >> lbakken at basho.com
> > >>
> > >>
> > >> On Wed, Jan 25, 2017 at 6:40 AM, Daniel Miller <dmiller at dimagi.com> wrote:
> > >> > I have a 9-node Riak CS cluster that has been working flawlessly for
> > >> > about 3
> > >> > months. The cluster configuration, including backend and bucket
> > >> > parameters
> > >> > such as N-value are using default settings. I'm using the S3 API to
> > >> > communicate with the cluster.
> > >> >
> > >> > Within the past week I had an issue where two objects were PUT resulting
> > >> > in
> > >> > a 200 (success) response, but all subsequent GET requests for those two
> > >> > keys
> > >> > return status of 404 (not found). Other than the fact that they are now
> > >> > missing, there was nothing out of the ordinary with these particular to
> > >> > PUTs. Maybe I'm missing something, but this seems like a scenario that
> > >> > should never happen. All information included here about PUTs and GETs
> > >> > comes
> > >> > from reviewing the CS access logs. Both objects were PUT on the same
> > >> > node,
> > >> > however GET requests returning 404 have been observed on all nodes.
> > >> > There is
> > >> > plenty of other traffic on the cluster involving GETs and PUTs that are
> > >> > not
> > >> > failing. I'm unsure of how to troubleshoot further to find out what may
> > >> > have
> > >> > happened to those objects and why they are now missing. What is the best
> > >> > approach to figure out why an object that was successfully PUT seems to
> > >> > be
> > >> > missing?
> > >> >
> > >> > Thanks!
> > >> > Daniel Miller
> > >> >
> > >> > _______________________________________________
> > >> > riak-users mailing list
> > >> > riak-users at lists.basho.com
> > >> > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> > >> >
> > >
> > >
> >
> >
> > <config-files.zip>_______________________________________________
> > riak-users mailing list
> > riak-users at lists.basho.com
> > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> 
> 
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com





More information about the riak-users mailing list