cluster health check using riak-java-client

David Byron dbyron at dbyron.com
Tue Dec 8 15:13:33 EST 2015


I'm still curious what people think here.  As I stare at this longer, 
I'd like to be able to call RiakNode.checkHealth(), but it's private.

HealthMonitorTask.run that only calls checkHealth some of the time, so 
without the ability to call it directly, I think I'm getting a stale 
notion of health in circumstances like I outlined below -- when the last 
operation was successful, but the node has since gone down.

Thanks for your input.

-DB

On 12/2/15 10:25 PM, David Byron wrote:
> I'm implementing a health check for a service of mine that uses riak.
> I've seen this code from
> https://github.com/basho/riak-java-client/issues/456:
>
> RiakCluster cluster = clientInstance.getRiakCluster();
> List<RiakNode> nodes = cluster.getNodes();
> for (RiakNode node : nodes)
> {
>    State state = node.getNodeState();
> }
>
> and it's great.  From what I can tell, it depends on some background
> processing that keeps track of the state of the nodes.  I did a quick
> test though, and if I run 'riak stop' from the command line and then
> this loop with no intervening operations, the nodes report RUNNING. Even
> after some time passes (more than three minutes), still RUNNING.
>
> However, if I run do run an intervening operation (some actual query of
> data for example) that fails, the nodes then report HEALTH_CHECKING.
> Then, after 'riak start', the nodes report RUNNING again.  I suppose
> that's good.
>
> So, I'm trying to decide how to implement the health check.  The above
> loop doesn't seem to be enough, but do I really need to do something like:
>
> final RiakFuture<Void, Void> future = cluster.execute(new PingOperation());
>
> try {
>    future.await();
>    future.get();
> } catch (ExecutionException | InterruptedException e) {
>    // bad
> }
> // good
>
> Maybe it's sufficient to only do this if all the nodes report RUNNING? I
> suppose there's always a small window in time where a node could report
> bad, but via a ping I'd learn it was up...so I'm torn.  Any suggestions
> for whether pinging every time is correct, or there's something more
> efficient (and safe)?
>
> Thanks for your help.
>
> -DB




More information about the riak-users mailing list