Ensembles failing to reach "Leader ready" state

Andrew Stone astone at basho.com
Mon Apr 20 18:43:26 EDT 2015


A couple things stand out here. If a node is left in leaving state, it's
likely that the system can't get quorum for the ensembles it's a part of.
Node's that leave wait until their peer membership is transferred via joint
consensus and they are removed from the ensembles in question so that
future operations don't stall. It's possible that the other removed nodes
never completed this membership transition which is why the ensemble states
are stuck. I don't know why the don't show up in riak-admin member-status
also though. Unfortunately, I'm not sure I have a better suggestion for you
than to migrate your data right now. It's possible there is some trickery
we could do to fix the ensembles manually, but I don't have a specific
recipe for that.

Also, just to reiterate what Alexander said, we don't explicitly test
running a Riak cluster across data centers and don't support it. Riak
clustering relies on distributed erlang which has problems when used in a
WAN scenario. We offer multi-datacenter replication (MDC) to deal with
cross datacenter replication in Riak EE.

-Andrew

On Fri, Apr 17, 2015 at 11:40 PM, Jonathan Koff <jonathan at projexity.com>
wrote:

> Hi Alexander and Andrew,
>
> Thanks for the follow-up!
>
> Although I would expect to have used `riak-admin cluster leave`, it’s been
> months at this point and I can’t be sure. Perhaps I did something weird
> when I was getting started…
>
> Given the uncertain state of the system, it may make sense for me to
> migrate everything to a fresh cluster, unless a simple solution exists.
> It’s small enough that this would be practical, albeit inconvenient.
>
> Your timing in following up is interesting—I just today attempted to
> `riak-admin cluster leave` a node (104.131.130.237) and it’s still in state
> “leaving" with 0.0% of ring and the logs filling up with messages like:
> 2015-04-18 02:45:30.927 [warning]
> <0.9069.0>@riak_kv_ensemble_backend:handle_down:173 Vnode for Idx:
> 548063113999088594326381812268606132370974703616 crashed with reason:
> normal.
>
> Output of `riak-admin member-status`:
> ================================= Membership
> ==================================
> Status     Ring    Pending    Node
>
> -------------------------------------------------------------------------------
> leaving     0.0%      --      'riak at 104.131.130.237'
> valid      34.4%      --      'riak at 104.131.39.61'
> valid      32.8%      --      'riak at 104.236.79.78'
> valid      32.8%      --      'riak at 162.243.5.87'
>
> -------------------------------------------------------------------------------
> Valid:3 / Leaving:1 / Exiting:0 / Joining:0 / Down:0
>
> Output of `ring-admin ring-status`:
> ================================== Claimant
> ===================================
> Claimant:  'riak at 104.131.130.237'
> Status:     up
> Ring Ready: true
>
> ============================== Ownership Handoff
> ==============================
> No pending changes.
>
> ============================== Unreachable Nodes
> ==============================
> All nodes are up and reachable
>
>
>
> With regard to staging being spread out across NA, my thinking was that
> staging under extreme conditions would serve as a canary as well as help me
> familiarize myself with the performance characteristics of Riak. However it
> ended up working perfectly (including strong consistency), so I never ended
> up moving the servers to be in the same geographical area.
>
> I'd be reluctant to put everything in one LAN when the key requirement
> that lead us to pick Riak was high availability, and network issues at a
> single datacenter seems to be our most frequent mode of failure. I
> benchmarked under various network configurations and all seemed to work
> flawlessly and with acceptable performance. Do you think this is reasonable?
>
>
> Thanks again!
>
> *Jonathan Koff* B.CS.
> co-founder of Projexity
> www.projexity.com
>
> follow us on facebook at: www.facebook.com/projexity
> follow us on twitter at: twitter.com/projexity
>
> On Apr 17, 2015, at 7:49 PM, Alexander Sicular <siculars at gmail.com> wrote:
>
> Hi Jonathan,
>
> "staging (3 servers across NA)"
>
> If this means you're spreading your cluster across North America I would
> suggest you reconsider. A Riak cluster is meant to be deployed in one data
> center, more specifically in one LAN. Connecting Riak nodes over a WAN
> introduces network latencies. Riak's approach to multi datacenter
> replication is as a cluster of clusters. That said, I don't believe strong
> consistency is supported yet in an mdc environment.
>
> -Alexander
>
> @siculars
> http://siculars.posthaven.com
>
> Sent from my iRotaryPhone
>
> On Apr 17, 2015, at 16:19, Andrew Stone <astone at basho.com> wrote:
>
> Hi Jonathan,
>
> Sorry for the late reply. It looks like riak_ensemble still thinks that
> those old nodes are part of the cluster. Did you remove them with
> 'riak-admin cluster leave' ? If so they should have been removed from the
> root ensemble also, and the machines shouldn't have actually left the
> cluster until all the ensembles were reconfigured via joint consensus. Can
> you paste the results from the following commands:
>
> riak-admin member-status
> riak-admin ring-status
>
> Thanks,
> Andrew
>
>
> On Mon, Mar 23, 2015 at 11:25 AM, Jonathan Koff <jonathan at projexity.com>
> wrote:
>
>> Hi all,
>>
>> I recently used Riak’s Strong Consistency functionality to get
>> auto-incrementing IDs for a feature of an application I’m working on, and
>> although this worked great in dev (5 nodes in 1 VM) and staging (3 servers
>> across NA) environments, I’ve run into some odd behaviour in production
>> (originally 3 servers, now 4) that prevents it from working.
>>
>> I initially noticed that consistent requests were immediately failing as
>> timeouts, and upon checking `riak-admin ensemble-status` saw that many
>> ensembles were at 0 / 3, from the vantage point of the box I was SSH’d
>> into. Interestingly, SSH-ing into different boxes showed different results.
>> Here’s a brief snippet of what I see now, after adding a fourth server in a
>> troubleshooting attempt:
>>
>> *Machine 1* (104.131.39.61)
>>
>> ============================== Consensus System
>> ===============================
>> Enabled:     true
>> Active:      true
>> Ring Ready:  true
>> Validation:  strong (trusted majority required)
>> Metadata:    best-effort replication (asynchronous)
>>
>> ================================== Ensembles
>> ==================================
>>  Ensemble     Quorum        Nodes      Leader
>>
>> -------------------------------------------------------------------------------
>>    root       0 / 6         3 / 6      --
>>     2         0 / 3         3 / 3      --
>>     3         3 / 3         3 / 3      riak at 104.131.130.237
>>     4         3 / 3         3 / 3      riak at 104.131.130.237
>>     5         3 / 3         3 / 3      riak at 104.131.130.237
>>     6         0 / 3         3 / 3      --
>>     7         0 / 3         3 / 3      --
>>     8         0 / 3         3 / 3      --
>>     9         3 / 3         3 / 3      riak at 104.131.130.237
>>     10        3 / 3         3 / 3      riak at 104.131.130.237
>>     11        0 / 3         3 / 3      --
>>
>> *Machine 2* (104.236.79.78)
>>
>> ============================== Consensus System
>> ===============================
>> Enabled:     true
>> Active:      true
>> Ring Ready:  true
>> Validation:  strong (trusted majority required)
>> Metadata:    best-effort replication (asynchronous)
>>
>> ================================== Ensembles
>> ==================================
>>  Ensemble     Quorum        Nodes      Leader
>>
>> -------------------------------------------------------------------------------
>>    root       0 / 6         3 / 6      --
>>     2         3 / 3         3 / 3      riak at 104.236.79.78
>>     3         3 / 3         3 / 3      riak at 104.131.130.237
>>     4         3 / 3         3 / 3      riak at 104.131.130.237
>>     5         3 / 3         3 / 3      riak at 104.131.130.237
>>     6         3 / 3         3 / 3      riak at 104.236.79.78
>>     7         0 / 3         3 / 3      --
>>     8         0 / 3         3 / 3      --
>>     9         3 / 3         3 / 3      riak at 104.131.130.237
>>     10        3 / 3         3 / 3      riak at 104.131.130.237
>>     11        3 / 3         3 / 3      riak at 104.236.79.78
>>
>> *Machine 3* (104.131.130.237)
>>
>> ============================== Consensus System
>> ===============================
>> Enabled:     true
>> Active:      true
>> Ring Ready:  true
>> Validation:  strong (trusted majority required)
>> Metadata:    best-effort replication (asynchronous)
>>
>> ================================== Ensembles
>> ==================================
>>  Ensemble     Quorum        Nodes      Leader
>>
>> -------------------------------------------------------------------------------
>>    root       0 / 6         3 / 6      --
>>     2         0 / 3         3 / 3      --
>>     3         3 / 3         3 / 3      riak at 104.131.130.237
>>     4         3 / 3         3 / 3      riak at 104.131.130.237
>>     5         3 / 3         3 / 3      riak at 104.131.130.237
>>     6         0 / 3         3 / 3      --
>>     7         0 / 3         3 / 3      --
>>     8         0 / 3         3 / 3      --
>>     9         3 / 3         3 / 3      riak at 104.131.130.237
>>     10        3 / 3         3 / 3      riak at 104.131.130.237
>>     11        0 / 3         3 / 3      --
>>
>> *Machine 4* (162.243.5.87)
>>
>> ============================== Consensus System
>> ===============================
>> Enabled:     true
>> Active:      true
>> Ring Ready:  true
>> Validation:  strong (trusted majority required)
>> Metadata:    best-effort replication (asynchronous)
>>
>> ================================== Ensembles
>> ==================================
>>  Ensemble     Quorum        Nodes      Leader
>>
>> -------------------------------------------------------------------------------
>>    root       0 / 6         3 / 6      --
>>     2         3 / 3         3 / 3      riak at 104.236.79.78
>>     3         3 / 3         3 / 3      riak at 104.131.130.237
>>     4         3 / 3         3 / 3      riak at 104.131.130.237
>>     5         3 / 3         3 / 3      riak at 104.131.130.237
>>     6         3 / 3         3 / 3      riak at 104.236.79.78
>>     7         3 / 3         3 / 3      riak at 162.243.5.87
>>     8         3 / 3         3 / 3      riak at 162.243.5.87
>>     9         3 / 3         3 / 3      riak at 104.131.130.237
>>     10        3 / 3         3 / 3      riak at 104.131.130.237
>>     11        3 / 3         3 / 3      riak at 104.236.79.78
>>
>>
>> Interestingly, Machine 4 has full quora for all ensembles except for
>> root, while Machine 3 only sees itself as a leader.
>>
>> Another interesting point is the output of `riak-admin ensemble-status
>> root`:
>>
>> ================================= Ensemble #1
>> =================================
>> Id:           root
>> Leader:       --
>> Leader ready: false
>>
>> ==================================== Peers
>> ====================================
>>  Peer  Status     Trusted          Epoch         Node
>>
>> -------------------------------------------------------------------------------
>>   1    (offline)    --              --           riak at 104.131.45.32
>>   2      probe      no              8            riak at 104.131.130.237
>>   3    (offline)    --              --           riak at 104.131.141.237
>>   4    (offline)    --              --           riak at 104.131.199.79
>>   5      probe      no              8            riak at 104.236.79.78
>>   6      probe      no              8            riak at 162.243.5.87
>>
>> This is consistent across all 4 machines, and seems to include some old
>> IPs from machines that left the cluster quite a while back, almost
>> definitely before I’d used Riak's Strong Consistency. Note that the reason
>> I added the fourth machine (104.131.39.61) was to see if this output would
>> change, perhaps resulting in a quorum for the root ensemble.
>>
>> For reference, here’s the status of a sample ensemble that isn’t “Leader
>> ready”, from the perspective of Machine 2:
>> ================================ Ensemble #62
>> =================================
>> Id:           {kv,1370157784997721485815954530671515330927436759040,3}
>> Leader:       --
>> Leader ready: false
>>
>> ==================================== Peers
>> ====================================
>>  Peer  Status     Trusted          Epoch         Node
>>
>> -------------------------------------------------------------------------------
>>   1    following    yes             43           riak at 104.131.130.237
>>   2    following    yes             43           riak at 104.236.79.78
>>   3     leading     yes             43           riak at 162.243.5.87
>>
>>
>> My config consists of riak.conf with:
>>
>> strong_consistency = on
>>
>> and advanced.config with:
>>
>> [
>>   {riak_core,
>>     [
>>       {target_n_val, 5}
>>       ]},
>>   {riak_ensemble,
>>     [
>>       {ensemble_tick, 5000}
>>     ]}
>> ].
>>
>> though I’ve experimented with the latter in an attempt to get this
>> resolved.
>>
>> I didn’t see any relevant-looking log output on any of the servers.
>>
>> Has anyone come across this before?
>>
>> Thanks!
>>
>> *Jonathan Koff* B.CS.
>> co-founder of Projexity
>> www.projexity.com
>>
>> follow us on facebook at: www.facebook.com/projexity
>> follow us on twitter at: twitter.com/projexity
>>
>>
>> _______________________________________________
>> riak-users mailing list
>> riak-users at lists.basho.com
>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>
>>
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20150420/f0fb44b7/attachment-0002.html>


More information about the riak-users mailing list