cluster health check using riak-java-client

David Byron dbyron at dbyron.com
Sun Dec 13 20:32:05 EST 2015


Thanks for this.  I think in the end I'm going to assume there's 
sufficient traffic that the node state that riak-java-client keeps track 
of is up to date enough.

Of course I have yet another question.  Even if I assume the state of 
each node is correct, how do I know if the cluster overall is considered 
healthy?  This may not be a valid question, but I hope it is.  For 
example, if the cluster configuration requires 3 nodes to write, I can 
write some fairly detailed code in riak-java-client to realize that's 
the configuration and count that there are enough healthy nodes. 
However, if I'm using something like haproxy, I'm not sure there's a 
great spot to put that logic.

Is there a way to query the cluster overall to ask a health question 
like this?

-DB

On 12/8/15 1:09 PM, Alexander Sicular wrote:
> Besides just plainly writing a key, you could also do something like (pseudo code):
>
> Riak.put(canaryKey, pw=n_val){
>    If ok -> cool!
>    If borked -> sad face
> }
>
> The important bit is the pw (primary write) equals your replication value. This means that all copies in the virtual node replica set need to go to virtual nodes allocated to their primary physical machines. This is a way you can check cluster status from the app level as in , is the cluster in some kind of borked state.
>
> -Alexander
>
> @siculars
> http://siculars.posthaven.com
>
> Sent from my iRotaryPhone
>
>> On Dec 8, 2015, at 14:13, David Byron <dbyron at dbyron.com> wrote:
>>
>> I'm still curious what people think here.  As I stare at this longer, I'd like to be able to call RiakNode.checkHealth(), but it's private.
>>
>> HealthMonitorTask.run that only calls checkHealth some of the time, so without the ability to call it directly, I think I'm getting a stale notion of health in circumstances like I outlined below -- when the last operation was successful, but the node has since gone down.
>>
>> Thanks for your input.
>>
>> -DB
>>
>>> On 12/2/15 10:25 PM, David Byron wrote:
>>> I'm implementing a health check for a service of mine that uses riak.
>>> I've seen this code from
>>> https://github.com/basho/riak-java-client/issues/456:
>>>
>>> RiakCluster cluster = clientInstance.getRiakCluster();
>>> List<RiakNode> nodes = cluster.getNodes();
>>> for (RiakNode node : nodes)
>>> {
>>>    State state = node.getNodeState();
>>> }
>>>
>>> and it's great.  From what I can tell, it depends on some background
>>> processing that keeps track of the state of the nodes.  I did a quick
>>> test though, and if I run 'riak stop' from the command line and then
>>> this loop with no intervening operations, the nodes report RUNNING. Even
>>> after some time passes (more than three minutes), still RUNNING.
>>>
>>> However, if I run do run an intervening operation (some actual query of
>>> data for example) that fails, the nodes then report HEALTH_CHECKING.
>>> Then, after 'riak start', the nodes report RUNNING again.  I suppose
>>> that's good.
>>>
>>> So, I'm trying to decide how to implement the health check.  The above
>>> loop doesn't seem to be enough, but do I really need to do something like:
>>>
>>> final RiakFuture<Void, Void> future = cluster.execute(new PingOperation());
>>>
>>> try {
>>>    future.await();
>>>    future.get();
>>> } catch (ExecutionException | InterruptedException e) {
>>>    // bad
>>> }
>>> // good
>>>
>>> Maybe it's sufficient to only do this if all the nodes report RUNNING? I
>>> suppose there's always a small window in time where a node could report
>>> bad, but via a ping I'd learn it was up...so I'm torn.  Any suggestions
>>> for whether pinging every time is correct, or there's something more
>>> efficient (and safe)?
>>>
>>> Thanks for your help.
>>>
>>> -DB




More information about the riak-users mailing list