Issues with search (2.0)

Jordan West jwest at basho.com
Mon Aug 11 10:47:35 EDT 2014


Chaim,

Some comments inline:

On Mon, Aug 11, 2014 at 4:14 AM, Chaim Solomon <chaim at itcentralstation.com>
wrote:

> Hi,
>
> I've been running into an issue with the yz search acting up.
>
> I've been getting a lot of these:
>
> 2014-08-11 06:45:22.005 [error] <0.913.0>@yz_kv:index:206 failed to index
> object {<<"bucketname">>,<<"123">>} with error {"Failed to index
> docs",{error,req_timedout}} because [{yz_solr,index,3,[{file,"s
>
> rc/yz_solr.erl"},{line,192}]},{yz_kv,index,7,[{file,"src/yz_kv.erl"},{line,258}]},{yz_kv,index,3,[{file,
>
> "src/yz_kv.erl"},{line,193}]},{riak_kv_vnode,actual_put,6,[{file,"src/riak_kv_vnode.erl"},{line,1416}]},
>
> {riak_kv_vnode,perform_put,3,[{file,"src/riak_kv_vnode.erl"},{line,1404}]},{riak_kv_vnode,do_put,7,[{fil
>
> e,"src/riak_kv_vnode.erl"},{line,1199}]},{riak_kv_vnode,handle_command,3,[{file,"src/riak_kv_vnode.erl"}
>
> ,{line,485}]},{riak_core_vnode,vnode_command,3,[{file,"src/riak_core_vnode.erl"},{line,345}]}]
>
> and the Java process uses a lot of CPU and eventually runs out of memory
> or something like that and gets stuck. Killing the process gets the cluster
> back up and running.
>
> I am guessing that it may be data corruption on the yz data on one node.
>
> Clearing away the yz data on that node and restarting riak makes the
> system work again - and I guess AAE will rebuild the index.
>
>
This sounds very similar to the issue last week. I would certainly like to
rule out any sort of data corruption (are you thinking your disks are
corrupting the data or are you assuming Solr is?).

However, it is also possible, like the last issue, that the node/cluster
simply does not have enough memory. When you delete the data Solr no longer
has anything to cache in-memory thus using significantly less. As
discussed, the recommended minimum


> But I'm wondering why a crashing Java on one node practically takes down
> the search on the cluster. Shouldn't Riak be more resilient than that?
>

The hard part here is, at least initially, the Java process doesn't crash,
it just starts to timeout. In distributed systems a slow-node is often
worse than a down node. Riak, prior to 1.4 had something called "health
check" that would mark a node down in this situation. Unfortunately in some
workloads, and I believe given your cluster's limited resources it would
happen here, this often results in excessive work being offloaded to
another node, which also does not have sufficient resources and around we
go until the entire cluster falls over. A capacity problem, typically, can
only be solved by adding more capacity.


>
> Is there a explicit reindex command for the full text search subsystem?
>
> Could Riak keep an eye on the java process and restart it if it crashes or
> runs away?
>
>
Riak does manage the JVM process (starting/stopping/restarting) .I agree
that if we could include run-away process, like in your case, that would be
even better. I would have to think a bit more about how this would work (to
prevent the same problems mentioned above with the old-style health check)

Jordan



> Chaim Solomon
>
>
>
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20140811/d85cb381/attachment.html>


More information about the riak-users mailing list