Put failure: too many siblings

Vladyslav Zakhozhai v.zakhozhai at smartweb.com.ua
Thu Jun 1 08:25:23 EDT 2017


Hi Russell,

I am reading about "retry_put_coordinator_failure" option and I do not
understand it completely.

http://docs.basho.com/riak/kv/2.2.3/configuring/reference/#miscellaneous

I understand the following thing.

*Statement 1.* If we have N=3, W=3 then PUT operation (write) will be
successful if at least one PUT was successful (i.е. 2 vnodes failed due
high load). If none of the vnodes were able to write data PUT request is
failed.

*Statement 2.* In the case of this "successful" PUT we have only one copy
of data and it will be fixed during read repairs or aae (if latter is
enabled).

This is how I understand "the risk of potentially increasing the likelihood
of write failure" from the link above:
"Setting it to off will speed response times on PUT requests in general,
but at the risk of potentially increasing the likelihood of write failure."

Russell or anybody on the list, are my statements are correct or not?

Thank you in advance.

On Wed, May 24, 2017 at 11:36 AM Russell Brown <russell.brown at icloud.com>
wrote:

> Also, this issue https://github.com/basho/riak_kv/issues/1188 suggests
> that adding the property `riak_kv.retry_put_coordinator_failure=false` may
> help in future. But won’t help with your keys with too many siblings.
>
> On 24 May 2017, at 09:22, Russell Brown <russell.brown at icloud.com> wrote:
>
> >
> > On 24 May 2017, at 09:11, Vladyslav Zakhozhai <
> v.zakhozhai at smartweb.com.ua> wrote:
> >
> >> Hello,
> >>
> >> My riak cluster still experiences "too many siblings". And hinted
> handoffs are not able to be finished completely. So "siblings will be
> resolved after hinted handoffs are finished" is not my case unfortunately.
> >>
> >> According to basho's docs (
> http://docs.basho.com/riak/kv/2.2.3/learn/concepts/causal-context/#sibling-explosion)
> I need to enable dvv conflict resolution mechanism. So here is a quesion:
> >>
> >> Is it safe to enable dvv on default bucket type and how it affects
> existing data?
> >
> > It might not affect existing data enough. All the existing siblings are
> “undotted” and would need a read-put cycle to resolve.
> >
> >> It may be a solution, is not it?
> >
> > You may require further action. I remember basho support helping someone
> with a similar issue, and there was some manual intervention/scripted
> solution, but I can’t remember what it was right now. I think those objects
> (as logged) with the sibling issues need to be read and resolved. Maybe one
> of the ex-basho support people remembers? I’ll prod one in a back channel
> and see if they can help.
> >
> >>
> >> Why I talk about default bucket type? Because there is only one riak
> client - Riak CS and it does not manage bucket types of PUT'ed object (so,
> default bucket type always is used during PUT's). Is it correct?
> >
> > Yes.
> >
> >>
> >> Thank you in advance.
> >>
> >> On Fri, Jun 17, 2016 at 11:45 AM Vladyslav Zakhozhai <
> v.zakhozhai at smartweb.com.ua> wrote:
> >> Hi Russel,
> >>
> >> thank you for your answer. I really appreciate your help.
> >>
> >> 2.1.3 is not actually riak_kv version. It is version of basho's riak
> package. Versions of riak subsystems you can see below.
> >>
> >> Bucket properties:
> >> # riak-admin bucket-type list
> >> default (active)
> >>
> >> # riak-admin bucket-type status default
> >> default is active
> >>
> >> allow_mult: true
> >> basic_quorum: false
> >> big_vclock: 50
> >> chash_keyfun: {riak_core_util,chash_std_keyfun}
> >> dvv_enabled: false
> >> dw: quorum
> >> last_write_wins: false
> >> linkfun: {modfun,riak_kv_wm_link_walker,mapreduce_linkfun}
> >> n_val: 3
> >> notfound_ok: true
> >> old_vclock: 86400
> >> postcommit: []
> >> pr: 0
> >> precommit: []
> >> pw: 0
> >> r: quorum
> >> rw: quorum
> >> small_vclock: 50
> >> w: quorum
> >> write_once: false
> >> young_vclock: 20
> >>
> >> I did not mentioned that upgrade from riak 1.5.4 have been took place
> couple months ago (about 6 months). As I understand DVV is disabled. Is it
> safe to migrate to setting DVV from Vector Clocks?
> >>
> >> Package versions:
> >> # dpkg -l | grep riak
> >> ii  riak                                2.1.3-1
>   amd64        Riak is a distributed data store
> >> ii  riak-cs                             2.1.0-1
>   amd64        Riak CS
> >>
> >> Subsystems versions:
> >> "clique_version" : "0.3.2-0-ge332c8f",
> >> "bitcask_version" : "1.7.2",
> >> "sys_driver_version" : "2.2",
> >> "riak_core_version" : "2.1.5-0-gb02ab53",
> >> "riak_kv_version" : "2.1.2-0-gf969bba",
> >> "riak_pipe_version" : "2.1.1-0-gb1ac2cf",
> >> "cluster_info_version" : "2.0.3-0-g76c73fc",
> >> "riak_auth_mods_version" : "2.1.0-0-g31b8b30",
> >> "erlydtl_version" : "0.7.0",
> >> "os_mon_version" : "2.2.13",
> >> "inets_version" : "5.9.6",
> >> "erlang_js_version" : "1.3.0-0-g07467d8",
> >> "riak_control_version" : "2.1.2-0-gab3f924",
> >> "xmerl_version" : "1.3.4",
> >> "protobuffs_version" : "0.8.1p5-0-gf88fc3c",
> >> "riak_sysmon_version" : "2.0.0",
> >> "compiler_version" : "4.9.3",
> >> "eleveldb_version" : "2.1.10-0-g0537ca9",
> >> "lager_version" : "2.1.1",
> >> "sasl_version" : "2.3.3",
> >> "riak_dt_version" : "2.1.1-0-ga2986bc",
> >> "runtime_tools_version" : "1.8.12",
> >> "yokozuna_version" : "2.1.2-0-g3520d11",
> >> "riak_search_version" : "2.1.1-0-gffe2113",
> >> "sys_system_version" : "Erlang R16B02_basho8 (erts-5.10.3) [source]
> [64-bit] [smp:4:4] [async-threads:64] [kernel-poll:true] [frame-pointer]",
> >> "basho_stats_version" : "1.0.3",
> >> "crypto_version" : "3.1",
> >> "merge_index_version" : "2.0.1-0-g0c8f77c",
> >> "kernel_version" : "2.16.3",
> >> "stdlib_version" : "1.19.3",
> >> "riak_pb_version" : "2.1.0.2-0-g620bc70",
> >> "syntax_tools_version" : "1.6.11",
> >> "goldrush_version" : "0.1.7",
> >> "ibrowse_version" : "4.0.2",
> >> "mochiweb_version" : "2.9.0",
> >> "exometer_core_version" : "1.0.0-basho2-0-gb47a5d6",
> >> "ssl_version" : "5.3.1",
> >> "public_key_version" : "0.20",
> >> "pbkdf2_version" : "2.0.0-0-g7076584",
> >> "sidejob_version" : "2.0.0-0-gc5aabba",
> >> "webmachine_version" : "1.10.8-0-g7677c24",
> >> "poolboy_version" : "0.8.1p3-0-g8bb45fb",
> >> "riak_api_version" : "2.1.2-0-gd8d510f",
> >> "asn1_version" : "2.0.3",
> >>
> >>
> >> On Fri, Jun 17, 2016 at 10:45 AM Russell Brown <russell.brown at me.com>
> wrote:
> >> What version of riak_kv is behind this riak_cs install, please? Is it
> really 2.1.3 as stated below? This looks and sounds like sibling explosion,
> which is fixed in riak 2.0 and above. Are you sure that you are using the
> DVV enabled setting for riak_cs bucket properties? Can you post your bucket
> properties please?
> >>
> >> On 16 Jun 2016, at 23:54, Vladyslav Zakhozhai <
> v.zakhozhai at smartweb.com.ua> wrote:
> >>
> >>> Hello.
> >>>
> >>> I see very interesting and confusing thing.
> >>>
> >>> From my previous letter you can see that siblings count on manifest
> objects is about 100 (actualy it is in range 100-300). Unfortunately my
> problem is that almost all PUT requests are failing with 500 Internal
> Server error.
> >>>
> >>> I've tried today set max_siblings riak option to 500. And there were
> successfull PUT requests but not for long. Now I see in riak logs error
> with "max siblings", but actual count of them is 500+ (earlier it was
> 100-300 as I've mentioned).
> >>>
> >>> Period of time between max_siblings=500 and errors in log is about 30
> minutes. And I want to point your attention that I've forbid PUT method on
> haproxy - frontend for riak cs.
> >>>
> >>>
> >>>
> >>> On Mon, Jun 6, 2016 at 1:17 AM Vladyslav Zakhozhai <
> v.zakhozhai at smartweb.com.ua> wrote:
> >>> Hi, Luke.
> >>>
> >>> Thank you for your answer. I did not understand you completely about
> transfer-limit. How does it relate to my problem. Transfer limit - is a
> limit of concurrent data transfer from different nodes. Am I wright? You
> mean that riak can handoff one partition from several nodes concurrently?
> >>>
> >>> Now I have transfer-limit 1 on all riak nodes.
> >>>
> >>> But I am not sure that my cluster will be converged ever. All nodes
> experiences low memory and are killed by OOM Killer periodically. I try to
> add new nodes to the cluster but due problem with OOM killer this process
> is very-very slow.
> >>>
> >>> In the official docs I've read:
> >>>
> >>> "Sibling explosion occurs when an object rapidly collects siblings
> that are not reconciled. This can lead to a variety of problems, including
> degraded performance, especially if many objects in a cluster suffer from
> siblings explosion. At the extreme, having an enormous object in a node can
> cause reads of that object to crash the entire node. Other issues include
> undue latency and out-of-memory errors."
> >>>
> >>> I mentioned that new nodes in the cluster do not experience such
> problems (I mean out of RAM).
> >>>
> >>> Regarding to siblings maybe your are right, this is manifest object. I
> can recognize key name but not bucket name. But more than 100 siblings on
> many keys is really confused me. Each time I try to PUT some object to Riak
> via Riak CS S3 interface I got errors with siblings.
> >>>
> >>> On Fri, Jun 3, 2016 at 6:43 PM Luke Bakken <lbakken at basho.com> wrote:
> >>> Hi Vladyslav,
> >>>
> >>> If you recognize the full name of the object raising the sibling
> >>> warning, it is most likely a manifest object. Sometimes, during hinted
> >>> handoff, you can see these messages. They should resolve after handoff
> >>> completes.
> >>>
> >>> Please see the documentation for the transfer-limit command as well:
> >>>
> >>>
> http://docs.basho.com/riak/kv/2.1.4/using/admin/riak-admin/#transfer-limit
> >>>
> >>> --
> >>> Luke Bakken
> >>> Engineer
> >>> lbakken at basho.com
> >>>
> >>>
> >>> On Fri, Jun 3, 2016 at 2:55 AM, Vladyslav Zakhozhai
> >>> <v.zakhozhai at smartweb.com.ua> wrote:
> >>>> Hi.
> >>>>
> >>>> I have a trouble with PUT to Riak CS cluster. During this process I
> >>>> periodically see the following message in Riak error.log:
> >>>>
> >>>> 2016-06-03 11:15:55.201 [error]
> >>>> <0.15536.142>@riak_kv_vnode:encode_and_put:2253 Put failure: too many
> >>>> siblings for object OBJECT_NAME (101)
> >>>>
> >>>> and also
> >>>>
> >>>> 2016-06-03 12:41:50.678 [error]
> >>>> <0.20448.515>@riak_api_pb_server:handle_info:331 Unrecognized message
> >>>> {7345880,{error,{too_many_siblings,101}}}
> >>>>
> >>>> Here OBJECT_NAME - is the name of object in Riak which has too many
> >>>> siblings.
> >>>>
> >>>> I definitely sure that this objects are static. Nobody deletes is,
> nobody
> >>>> rewrites it. I have no idea why more than 100 siblings of this object
> >>>> occurs.
> >>>>
> >>>> The following effect of this issue occurs:
> >>>>
> >>>> Great amount of keys are loaded to RAM. I almost out of RAM (Do each
> sibling
> >>>> has it own key or key duplicate?).
> >>>> Nodes are slow - adding new nodes are too slow
> >>>> Presence of "too many siblings" affects ownership handoffs
> >>>>
> >>>> So I have several questions:
> >>>>
> >>>> Do hinted or ownership handoffs can affect siblings count (I mean can
> >>>> siblings be created during ownership of hinted handoffs)
> >>>> Is there any workaround of this issue. Do I need remove siblings
> manually or
> >>>> it removes during merges, read repairs and so on
> >>>>
> >>>>
> >>>> My configuration:
> >>>>
> >>>> riak from basho's packages - 2.1.3-1
> >>>> riak cs from basho's packages - 2.1.0-1
> >>>> 24 riak/riak-cs nodes
> >>>> 32 GB RAM per node
> >>>> AAE is disabled
> >>>>
> >>>>
> >>>> I appreciate you help.
> >>>>
> >>>> _______________________________________________
> >>>> riak-users mailing list
> >>>> riak-users at lists.basho.com
> >>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> >>>>
> >>> _______________________________________________
> >>> riak-users mailing list
> >>> riak-users at lists.basho.com
> >>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> >>
> >> _______________________________________________
> >> riak-users mailing list
> >> riak-users at lists.basho.com
> >> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> >
> >
> > _______________________________________________
> > riak-users mailing list
> > riak-users at lists.basho.com
> > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20170601/14820b51/attachment-0002.html>


More information about the riak-users mailing list