Ensembles failing to reach "Leader ready" state

Jonathan Koff jonathan at projexity.com
Tue Apr 21 04:34:04 EDT 2015


Ok, thanks Andrew! I’ll go ahead and migrate the data to a fresh cluster.

Jonathan Koff B.CS.
co-founder of Projexity
www.projexity.com <http://www.projexity.com/>

follow us on facebook at: www.facebook.com/projexity <http://www.facebook.com/projexity>
follow us on twitter at: twitter.com/projexity <http://twitter.com/projexity>
> On Apr 20, 2015, at 6:43 PM, Andrew Stone <astone at basho.com> wrote:
> 
> A couple things stand out here. If a node is left in leaving state, it's likely that the system can't get quorum for the ensembles it's a part of. Node's that leave wait until their peer membership is transferred via joint consensus and they are removed from the ensembles in question so that future operations don't stall. It's possible that the other removed nodes never completed this membership transition which is why the ensemble states are stuck. I don't know why the don't show up in riak-admin member-status also though. Unfortunately, I'm not sure I have a better suggestion for you than to migrate your data right now. It's possible there is some trickery we could do to fix the ensembles manually, but I don't have a specific recipe for that.
> 
> Also, just to reiterate what Alexander said, we don't explicitly test running a Riak cluster across data centers and don't support it. Riak clustering relies on distributed erlang which has problems when used in a WAN scenario. We offer multi-datacenter replication (MDC) to deal with cross datacenter replication in Riak EE.
> 
> -Andrew
> 
> On Fri, Apr 17, 2015 at 11:40 PM, Jonathan Koff <jonathan at projexity.com <mailto:jonathan at projexity.com>> wrote:
> Hi Alexander and Andrew,
> 
> Thanks for the follow-up!
> 
> Although I would expect to have used `riak-admin cluster leave`, it’s been months at this point and I can’t be sure. Perhaps I did something weird when I was getting started…
> 
> Given the uncertain state of the system, it may make sense for me to migrate everything to a fresh cluster, unless a simple solution exists. It’s small enough that this would be practical, albeit inconvenient.
> 
> Your timing in following up is interesting—I just today attempted to `riak-admin cluster leave` a node (104.131.130.237) and it’s still in state “leaving" with 0.0% of ring and the logs filling up with messages like:
> 2015-04-18 02:45:30.927 [warning] <0.9069.0>@riak_kv_ensemble_backend:handle_down:173 Vnode for Idx: 548063113999088594326381812268606132370974703616 crashed with reason: normal.
> 
> Output of `riak-admin member-status`:
> ================================= Membership ==================================
> Status     Ring    Pending    Node
> -------------------------------------------------------------------------------
> leaving     0.0%      --      'riak at 104.131.130.237 <mailto:riak at 104.131.130.237>'
> valid      34.4%      --      'riak at 104.131.39.61 <mailto:riak at 104.131.39.61>'
> valid      32.8%      --      'riak at 104.236.79.78 <mailto:riak at 104.236.79.78>'
> valid      32.8%      --      'riak at 162.243.5.87 <mailto:riak at 162.243.5.87>'
> -------------------------------------------------------------------------------
> Valid:3 / Leaving:1 / Exiting:0 / Joining:0 / Down:0
> 
> Output of `ring-admin ring-status`:
> ================================== Claimant ===================================
> Claimant:  'riak at 104.131.130.237 <mailto:riak at 104.131.130.237>'
> Status:     up
> Ring Ready: true
> 
> ============================== Ownership Handoff ==============================
> No pending changes.
> 
> ============================== Unreachable Nodes ==============================
> All nodes are up and reachable
> 
> 
> 
> With regard to staging being spread out across NA, my thinking was that staging under extreme conditions would serve as a canary as well as help me familiarize myself with the performance characteristics of Riak. However it ended up working perfectly (including strong consistency), so I never ended up moving the servers to be in the same geographical area.
> 
> I'd be reluctant to put everything in one LAN when the key requirement that lead us to pick Riak was high availability, and network issues at a single datacenter seems to be our most frequent mode of failure. I benchmarked under various network configurations and all seemed to work flawlessly and with acceptable performance. Do you think this is reasonable?
> 
> 
> Thanks again!
> 
> Jonathan Koff B.CS.
> co-founder of Projexity
> www.projexity.com <http://www.projexity.com/>
> 
> follow us on facebook at: www.facebook.com/projexity <http://www.facebook.com/projexity>
> follow us on twitter at: twitter.com/projexity <http://twitter.com/projexity>
>> On Apr 17, 2015, at 7:49 PM, Alexander Sicular <siculars at gmail.com <mailto:siculars at gmail.com>> wrote:
>> 
>> Hi Jonathan,
>> 
>> "staging (3 servers across NA)"
>> 
>> If this means you're spreading your cluster across North America I would suggest you reconsider. A Riak cluster is meant to be deployed in one data center, more specifically in one LAN. Connecting Riak nodes over a WAN introduces network latencies. Riak's approach to multi datacenter replication is as a cluster of clusters. That said, I don't believe strong consistency is supported yet in an mdc environment. 
>> 
>> -Alexander 
>> 
>> @siculars
>> http://siculars.posthaven.com <http://siculars.posthaven.com/>
>> 
>> Sent from my iRotaryPhone
>> 
>> On Apr 17, 2015, at 16:19, Andrew Stone <astone at basho.com <mailto:astone at basho.com>> wrote:
>> 
>>> Hi Jonathan,
>>>  
>>> Sorry for the late reply. It looks like riak_ensemble still thinks that those old nodes are part of the cluster. Did you remove them with 'riak-admin cluster leave' ? If so they should have been removed from the root ensemble also, and the machines shouldn't have actually left the cluster until all the ensembles were reconfigured via joint consensus. Can you paste the results from the following commands:
>>> 
>>> riak-admin member-status
>>> riak-admin ring-status
>>> 
>>> Thanks,
>>> Andrew
>>> 
>>> 
>>> On Mon, Mar 23, 2015 at 11:25 AM, Jonathan Koff <jonathan at projexity.com <mailto:jonathan at projexity.com>> wrote:
>>> Hi all,
>>> 
>>> I recently used Riak’s Strong Consistency functionality to get auto-incrementing IDs for a feature of an application I’m working on, and although this worked great in dev (5 nodes in 1 VM) and staging (3 servers across NA) environments, I’ve run into some odd behaviour in production (originally 3 servers, now 4) that prevents it from working.
>>> 
>>> I initially noticed that consistent requests were immediately failing as timeouts, and upon checking `riak-admin ensemble-status` saw that many ensembles were at 0 / 3, from the vantage point of the box I was SSH’d into. Interestingly, SSH-ing into different boxes showed different results. Here’s a brief snippet of what I see now, after adding a fourth server in a troubleshooting attempt:
>>> 
>>> *Machine 1* (104.131.39.61)
>>> 
>>> ============================== Consensus System ===============================
>>> Enabled:     true
>>> Active:      true
>>> Ring Ready:  true
>>> Validation:  strong (trusted majority required)
>>> Metadata:    best-effort replication (asynchronous)
>>> 
>>> ================================== Ensembles ==================================
>>>  Ensemble     Quorum        Nodes      Leader
>>> -------------------------------------------------------------------------------
>>>    root       0 / 6         3 / 6      --
>>>     2         0 / 3         3 / 3      --
>>>     3         3 / 3         3 / 3      riak at 104.131.130.237 <mailto:riak at 104.131.130.237>
>>>     4         3 / 3         3 / 3      riak at 104.131.130.237 <mailto:riak at 104.131.130.237>
>>>     5         3 / 3         3 / 3      riak at 104.131.130.237 <mailto:riak at 104.131.130.237>
>>>     6         0 / 3         3 / 3      --
>>>     7         0 / 3         3 / 3      --
>>>     8         0 / 3         3 / 3      --
>>>     9         3 / 3         3 / 3      riak at 104.131.130.237 <mailto:riak at 104.131.130.237>
>>>     10        3 / 3         3 / 3      riak at 104.131.130.237 <mailto:riak at 104.131.130.237>
>>>     11        0 / 3         3 / 3      --
>>> 
>>> *Machine 2* (104.236.79.78)
>>> 
>>> ============================== Consensus System ===============================
>>> Enabled:     true
>>> Active:      true
>>> Ring Ready:  true
>>> Validation:  strong (trusted majority required)
>>> Metadata:    best-effort replication (asynchronous)
>>> 
>>> ================================== Ensembles ==================================
>>>  Ensemble     Quorum        Nodes      Leader
>>> -------------------------------------------------------------------------------
>>>    root       0 / 6         3 / 6      --
>>>     2         3 / 3         3 / 3      riak at 104.236.79.78 <mailto:riak at 104.236.79.78>
>>>     3         3 / 3         3 / 3      riak at 104.131.130.237 <mailto:riak at 104.131.130.237>
>>>     4         3 / 3         3 / 3      riak at 104.131.130.237 <mailto:riak at 104.131.130.237>
>>>     5         3 / 3         3 / 3      riak at 104.131.130.237 <mailto:riak at 104.131.130.237>
>>>     6         3 / 3         3 / 3      riak at 104.236.79.78 <mailto:riak at 104.236.79.78>
>>>     7         0 / 3         3 / 3      --
>>>     8         0 / 3         3 / 3      --
>>>     9         3 / 3         3 / 3      riak at 104.131.130.237 <mailto:riak at 104.131.130.237>
>>>     10        3 / 3         3 / 3      riak at 104.131.130.237 <mailto:riak at 104.131.130.237>
>>>     11        3 / 3         3 / 3      riak at 104.236.79.78 <mailto:riak at 104.236.79.78>
>>> 
>>> *Machine 3* (104.131.130.237)
>>> 
>>> ============================== Consensus System ===============================
>>> Enabled:     true
>>> Active:      true
>>> Ring Ready:  true
>>> Validation:  strong (trusted majority required)
>>> Metadata:    best-effort replication (asynchronous)
>>> 
>>> ================================== Ensembles ==================================
>>>  Ensemble     Quorum        Nodes      Leader
>>> -------------------------------------------------------------------------------
>>>    root       0 / 6         3 / 6      --
>>>     2         0 / 3         3 / 3      --
>>>     3         3 / 3         3 / 3      riak at 104.131.130.237 <mailto:riak at 104.131.130.237>
>>>     4         3 / 3         3 / 3      riak at 104.131.130.237 <mailto:riak at 104.131.130.237>
>>>     5         3 / 3         3 / 3      riak at 104.131.130.237 <mailto:riak at 104.131.130.237>
>>>     6         0 / 3         3 / 3      --
>>>     7         0 / 3         3 / 3      --
>>>     8         0 / 3         3 / 3      --
>>>     9         3 / 3         3 / 3      riak at 104.131.130.237 <mailto:riak at 104.131.130.237>
>>>     10        3 / 3         3 / 3      riak at 104.131.130.237 <mailto:riak at 104.131.130.237>
>>>     11        0 / 3         3 / 3      --
>>> 
>>> *Machine 4* (162.243.5.87)
>>> 
>>> ============================== Consensus System ===============================
>>> Enabled:     true
>>> Active:      true
>>> Ring Ready:  true
>>> Validation:  strong (trusted majority required)
>>> Metadata:    best-effort replication (asynchronous)
>>> 
>>> ================================== Ensembles ==================================
>>>  Ensemble     Quorum        Nodes      Leader
>>> -------------------------------------------------------------------------------
>>>    root       0 / 6         3 / 6      --
>>>     2         3 / 3         3 / 3      riak at 104.236.79.78 <mailto:riak at 104.236.79.78>
>>>     3         3 / 3         3 / 3      riak at 104.131.130.237 <mailto:riak at 104.131.130.237>
>>>     4         3 / 3         3 / 3      riak at 104.131.130.237 <mailto:riak at 104.131.130.237>
>>>     5         3 / 3         3 / 3      riak at 104.131.130.237 <mailto:riak at 104.131.130.237>
>>>     6         3 / 3         3 / 3      riak at 104.236.79.78 <mailto:riak at 104.236.79.78>
>>>     7         3 / 3         3 / 3      riak at 162.243.5.87 <mailto:riak at 162.243.5.87>
>>>     8         3 / 3         3 / 3      riak at 162.243.5.87 <mailto:riak at 162.243.5.87>
>>>     9         3 / 3         3 / 3      riak at 104.131.130.237 <mailto:riak at 104.131.130.237>
>>>     10        3 / 3         3 / 3      riak at 104.131.130.237 <mailto:riak at 104.131.130.237>
>>>     11        3 / 3         3 / 3      riak at 104.236.79.78 <mailto:riak at 104.236.79.78>
>>> 
>>> 
>>> Interestingly, Machine 4 has full quora for all ensembles except for root, while Machine 3 only sees itself as a leader.
>>> 
>>> Another interesting point is the output of `riak-admin ensemble-status root`:
>>> 
>>> ================================= Ensemble #1 =================================
>>> Id:           root
>>> Leader:       --
>>> Leader ready: false
>>> 
>>> ==================================== Peers ====================================
>>>  Peer  Status     Trusted          Epoch         Node
>>> -------------------------------------------------------------------------------
>>>   1    (offline)    --              --           riak at 104.131.45.32 <mailto:riak at 104.131.45.32>
>>>   2      probe      no              8            riak at 104.131.130.237 <mailto:riak at 104.131.130.237>
>>>   3    (offline)    --              --           riak at 104.131.141.237 <mailto:riak at 104.131.141.237>
>>>   4    (offline)    --              --           riak at 104.131.199.79 <mailto:riak at 104.131.199.79>
>>>   5      probe      no              8            riak at 104.236.79.78 <mailto:riak at 104.236.79.78>
>>>   6      probe      no              8            riak at 162.243.5.87 <mailto:riak at 162.243.5.87>
>>> 
>>> This is consistent across all 4 machines, and seems to include some old IPs from machines that left the cluster quite a while back, almost definitely before I’d used Riak's Strong Consistency. Note that the reason I added the fourth machine (104.131.39.61) was to see if this output would change, perhaps resulting in a quorum for the root ensemble.
>>> 
>>> For reference, here’s the status of a sample ensemble that isn’t “Leader ready”, from the perspective of Machine 2:
>>> ================================ Ensemble #62 =================================
>>> Id:           {kv,1370157784997721485815954530671515330927436759040,3}
>>> Leader:       --
>>> Leader ready: false
>>> 
>>> ==================================== Peers ====================================
>>>  Peer  Status     Trusted          Epoch         Node
>>> -------------------------------------------------------------------------------
>>>   1    following    yes             43           riak at 104.131.130.237 <mailto:riak at 104.131.130.237>
>>>   2    following    yes             43           riak at 104.236.79.78 <mailto:riak at 104.236.79.78>
>>>   3     leading     yes             43           riak at 162.243.5.87 <mailto:riak at 162.243.5.87>
>>> 
>>> 
>>> My config consists of riak.conf with:
>>> 
>>> strong_consistency = on
>>> 
>>> and advanced.config with:
>>> 
>>> [
>>>   {riak_core,
>>>     [
>>>       {target_n_val, 5}
>>>       ]},
>>>   {riak_ensemble,
>>>     [
>>>       {ensemble_tick, 5000}
>>>     ]}
>>> ].
>>> 
>>> though I’ve experimented with the latter in an attempt to get this resolved.
>>> 
>>> I didn’t see any relevant-looking log output on any of the servers.
>>> 
>>> Has anyone come across this before?
>>> 
>>> Thanks!
>>> 
>>> Jonathan Koff B.CS.
>>> co-founder of Projexity
>>> www.projexity.com <http://www.projexity.com/>
>>> 
>>> follow us on facebook at: www.facebook.com/projexity <http://www.facebook.com/projexity>
>>> follow us on twitter at: twitter.com/projexity <http://twitter.com/projexity>
>>> 
>>> _______________________________________________
>>> riak-users mailing list
>>> riak-users at lists.basho.com <mailto:riak-users at lists.basho.com>
>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com <http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com>
>>> 
>>> 
>>> _______________________________________________
>>> riak-users mailing list
>>> riak-users at lists.basho.com <mailto:riak-users at lists.basho.com>
>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com <http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com>
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20150421/7f1e4581/attachment-0002.html>


More information about the riak-users mailing list