Ensembles failing to reach "Leader ready" state

Andrew Stone astone at basho.com
Fri Apr 17 16:19:41 EDT 2015


Hi Jonathan,

Sorry for the late reply. It looks like riak_ensemble still thinks that
those old nodes are part of the cluster. Did you remove them with
'riak-admin cluster leave' ? If so they should have been removed from the
root ensemble also, and the machines shouldn't have actually left the
cluster until all the ensembles were reconfigured via joint consensus. Can
you paste the results from the following commands:

riak-admin member-status
riak-admin ring-status

Thanks,
Andrew


On Mon, Mar 23, 2015 at 11:25 AM, Jonathan Koff <jonathan at projexity.com>
wrote:

> Hi all,
>
> I recently used Riak’s Strong Consistency functionality to get
> auto-incrementing IDs for a feature of an application I’m working on, and
> although this worked great in dev (5 nodes in 1 VM) and staging (3 servers
> across NA) environments, I’ve run into some odd behaviour in production
> (originally 3 servers, now 4) that prevents it from working.
>
> I initially noticed that consistent requests were immediately failing as
> timeouts, and upon checking `riak-admin ensemble-status` saw that many
> ensembles were at 0 / 3, from the vantage point of the box I was SSH’d
> into. Interestingly, SSH-ing into different boxes showed different results.
> Here’s a brief snippet of what I see now, after adding a fourth server in a
> troubleshooting attempt:
>
> *Machine 1* (104.131.39.61)
>
> ============================== Consensus System
> ===============================
> Enabled:     true
> Active:      true
> Ring Ready:  true
> Validation:  strong (trusted majority required)
> Metadata:    best-effort replication (asynchronous)
>
> ================================== Ensembles
> ==================================
>  Ensemble     Quorum        Nodes      Leader
>
> -------------------------------------------------------------------------------
>    root       0 / 6         3 / 6      --
>     2         0 / 3         3 / 3      --
>     3         3 / 3         3 / 3      riak at 104.131.130.237
>     4         3 / 3         3 / 3      riak at 104.131.130.237
>     5         3 / 3         3 / 3      riak at 104.131.130.237
>     6         0 / 3         3 / 3      --
>     7         0 / 3         3 / 3      --
>     8         0 / 3         3 / 3      --
>     9         3 / 3         3 / 3      riak at 104.131.130.237
>     10        3 / 3         3 / 3      riak at 104.131.130.237
>     11        0 / 3         3 / 3      --
>
> *Machine 2* (104.236.79.78)
>
> ============================== Consensus System
> ===============================
> Enabled:     true
> Active:      true
> Ring Ready:  true
> Validation:  strong (trusted majority required)
> Metadata:    best-effort replication (asynchronous)
>
> ================================== Ensembles
> ==================================
>  Ensemble     Quorum        Nodes      Leader
>
> -------------------------------------------------------------------------------
>    root       0 / 6         3 / 6      --
>     2         3 / 3         3 / 3      riak at 104.236.79.78
>     3         3 / 3         3 / 3      riak at 104.131.130.237
>     4         3 / 3         3 / 3      riak at 104.131.130.237
>     5         3 / 3         3 / 3      riak at 104.131.130.237
>     6         3 / 3         3 / 3      riak at 104.236.79.78
>     7         0 / 3         3 / 3      --
>     8         0 / 3         3 / 3      --
>     9         3 / 3         3 / 3      riak at 104.131.130.237
>     10        3 / 3         3 / 3      riak at 104.131.130.237
>     11        3 / 3         3 / 3      riak at 104.236.79.78
>
> *Machine 3* (104.131.130.237)
>
> ============================== Consensus System
> ===============================
> Enabled:     true
> Active:      true
> Ring Ready:  true
> Validation:  strong (trusted majority required)
> Metadata:    best-effort replication (asynchronous)
>
> ================================== Ensembles
> ==================================
>  Ensemble     Quorum        Nodes      Leader
>
> -------------------------------------------------------------------------------
>    root       0 / 6         3 / 6      --
>     2         0 / 3         3 / 3      --
>     3         3 / 3         3 / 3      riak at 104.131.130.237
>     4         3 / 3         3 / 3      riak at 104.131.130.237
>     5         3 / 3         3 / 3      riak at 104.131.130.237
>     6         0 / 3         3 / 3      --
>     7         0 / 3         3 / 3      --
>     8         0 / 3         3 / 3      --
>     9         3 / 3         3 / 3      riak at 104.131.130.237
>     10        3 / 3         3 / 3      riak at 104.131.130.237
>     11        0 / 3         3 / 3      --
>
> *Machine 4* (162.243.5.87)
>
> ============================== Consensus System
> ===============================
> Enabled:     true
> Active:      true
> Ring Ready:  true
> Validation:  strong (trusted majority required)
> Metadata:    best-effort replication (asynchronous)
>
> ================================== Ensembles
> ==================================
>  Ensemble     Quorum        Nodes      Leader
>
> -------------------------------------------------------------------------------
>    root       0 / 6         3 / 6      --
>     2         3 / 3         3 / 3      riak at 104.236.79.78
>     3         3 / 3         3 / 3      riak at 104.131.130.237
>     4         3 / 3         3 / 3      riak at 104.131.130.237
>     5         3 / 3         3 / 3      riak at 104.131.130.237
>     6         3 / 3         3 / 3      riak at 104.236.79.78
>     7         3 / 3         3 / 3      riak at 162.243.5.87
>     8         3 / 3         3 / 3      riak at 162.243.5.87
>     9         3 / 3         3 / 3      riak at 104.131.130.237
>     10        3 / 3         3 / 3      riak at 104.131.130.237
>     11        3 / 3         3 / 3      riak at 104.236.79.78
>
>
> Interestingly, Machine 4 has full quora for all ensembles except for root,
> while Machine 3 only sees itself as a leader.
>
> Another interesting point is the output of `riak-admin ensemble-status
> root`:
>
> ================================= Ensemble #1
> =================================
> Id:           root
> Leader:       --
> Leader ready: false
>
> ==================================== Peers
> ====================================
>  Peer  Status     Trusted          Epoch         Node
>
> -------------------------------------------------------------------------------
>   1    (offline)    --              --           riak at 104.131.45.32
>   2      probe      no              8            riak at 104.131.130.237
>   3    (offline)    --              --           riak at 104.131.141.237
>   4    (offline)    --              --           riak at 104.131.199.79
>   5      probe      no              8            riak at 104.236.79.78
>   6      probe      no              8            riak at 162.243.5.87
>
> This is consistent across all 4 machines, and seems to include some old
> IPs from machines that left the cluster quite a while back, almost
> definitely before I’d used Riak's Strong Consistency. Note that the reason
> I added the fourth machine (104.131.39.61) was to see if this output would
> change, perhaps resulting in a quorum for the root ensemble.
>
> For reference, here’s the status of a sample ensemble that isn’t “Leader
> ready”, from the perspective of Machine 2:
> ================================ Ensemble #62
> =================================
> Id:           {kv,1370157784997721485815954530671515330927436759040,3}
> Leader:       --
> Leader ready: false
>
> ==================================== Peers
> ====================================
>  Peer  Status     Trusted          Epoch         Node
>
> -------------------------------------------------------------------------------
>   1    following    yes             43           riak at 104.131.130.237
>   2    following    yes             43           riak at 104.236.79.78
>   3     leading     yes             43           riak at 162.243.5.87
>
>
> My config consists of riak.conf with:
>
> strong_consistency = on
>
> and advanced.config with:
>
> [
>   {riak_core,
>     [
>       {target_n_val, 5}
>       ]},
>   {riak_ensemble,
>     [
>       {ensemble_tick, 5000}
>     ]}
> ].
>
> though I’ve experimented with the latter in an attempt to get this
> resolved.
>
> I didn’t see any relevant-looking log output on any of the servers.
>
> Has anyone come across this before?
>
> Thanks!
>
> *Jonathan Koff* B.CS.
> co-founder of Projexity
> www.projexity.com
>
> follow us on facebook at: www.facebook.com/projexity
> follow us on twitter at: twitter.com/projexity
>
>
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20150417/07782111/attachment-0002.html>


More information about the riak-users mailing list