Jim Raney jim.raney at
Mon Apr 11 16:11:22 EDT 2016


We're seeing the following error in riak/yokazuna:

2016-04-11 19:36:18.803 [error] 
<0.23120.8>@yz_pb_search:maybe_process:84 "Failed to determine Solr port 
for all nodes in search plan" 

This is a 7-node cluster running the RPM of 2.1.3 on CentOS 7, in Google 
cloud, with 16-CPU/60GB RAM VMs.  They are configured with levelDB, with 
a 500G SSD disk for the first four tiers and a 2TB magnetic disk for the 
remainder.  IOPSs/throughput are not an issue with our application.

There is a UWSGI-based REST service that sits in front of riak that 
contains all of the application logic.  The testing suite (locust) loads 
binary data files that the uwsgi service processes and inserts into 
riak.  As part of that processing yokazuna indexes get searched.

We find that ~40 minutes to an hour into load testing we start seeing 
the above error logged (leading to 500s from locust's perspective).  It 
corresponds with Search Query Fail Count, which we graph with zabbix.  
Over time the number gets larger and larger, and after about an hour of 
load testng it starts to curve upwards sharply.

In riak.conf we have:

search = on
search.solr.start_timeout = 120s
search.solr.port = 8093
search.solr.jmx_port = 8985
search.solr.jvm_options = -d64 -Xms2g -Xmx16g -XX:+UseStringCache 

and we are using java-1.7.0-openjdk- from 
the CentOS repos.  I've been graphing JMX stats with zabbix and nothing 
looks untoward, the heap gradually climbs up in size but never 
skyrockets and certainly doesn't come close to the 16GB cap (barely gets 
above 3GB before things really go south).  With jconsole I see the same 
numbers, with a gradually increasing time for garbage collection (last 
recorded was "23.751 seconds on PS Scavenge (640 collections)"), 
although it's hard to tell if there's any large pauses from gc.

We graph a bunch of additional stats in zabbix, and the boxes in the 
cluster never get close to capping out CPU or running out of RAM.

I googled around and couldn't find any reference to the logged error.  
Does it have to do with solr having a problem contacting other nodes in 
the cluster?  Or is it some kind of node lookup issue?

Jim Raney

