Issues with search (2.0)

Eric Redmond eredmond at basho.com
Mon Aug 11 11:14:26 EDT 2014


If Solr is stumbling over bad data, your node's solr.log should be filled up. If Yokozuna is stumbling over bad data that it's trying to send Solr in a loop, the console.log should be full. If yokozuna is going ahead and indexing bad values (such as unparsable json), it will go ahead and index a blank object with _yz_err (just search for existence). If you have a case of sibling explosion, you'll have many duplicates of the same object with different _yz_vtag fields (again search for existence).

You said it's not a resource issue, but just to rule that out, how much RAM does each node have? Also, how much is made available to Solr? You can adjust the max heap size given to Solr in riak.conf, by changing search.solr.jvm_options max heap size values from -Xmx1g to -Xmx2g or more.

Eric


On Aug 11, 2014, at 8:03 AM, Chaim Solomon <chaim at itcentralstation.com> wrote:

> Hi,
> 
> I don't think that it is a resource issue now.
> 
> After removing the data, the other nodes had low load and are handling the workload just fine.
> And the Java process - when it crashed - was really dead, on shutting down Riak it stayed around and needed a -9 to go away.
> 
> I don't think the disks are a problem but rather suspect that a crash may have caused Solr to stumble over bad data and then crash.
> 
> Chaim Solomon
> 
> 
> 
> On Mon, Aug 11, 2014 at 5:47 PM, Jordan West <jwest at basho.com> wrote:
> Chaim,
> 
> Some comments inline:
> 
> On Mon, Aug 11, 2014 at 4:14 AM, Chaim Solomon <chaim at itcentralstation.com> wrote:
> Hi,
> 
> I've been running into an issue with the yz search acting up.
> 
> I've been getting a lot of these: 
> 
> 2014-08-11 06:45:22.005 [error] <0.913.0>@yz_kv:index:206 failed to index object {<<"bucketname">>,<<"123">>} with error {"Failed to index docs",{error,req_timedout}} because [{yz_solr,index,3,[{file,"s
> rc/yz_solr.erl"},{line,192}]},{yz_kv,index,7,[{file,"src/yz_kv.erl"},{line,258}]},{yz_kv,index,3,[{file,
> "src/yz_kv.erl"},{line,193}]},{riak_kv_vnode,actual_put,6,[{file,"src/riak_kv_vnode.erl"},{line,1416}]},
> {riak_kv_vnode,perform_put,3,[{file,"src/riak_kv_vnode.erl"},{line,1404}]},{riak_kv_vnode,do_put,7,[{fil
> e,"src/riak_kv_vnode.erl"},{line,1199}]},{riak_kv_vnode,handle_command,3,[{file,"src/riak_kv_vnode.erl"}
> ,{line,485}]},{riak_core_vnode,vnode_command,3,[{file,"src/riak_core_vnode.erl"},{line,345}]}]
> 
> and the Java process uses a lot of CPU and eventually runs out of memory or something like that and gets stuck. Killing the process gets the cluster back up and running.
> 
> I am guessing that it may be data corruption on the yz data on one node. 
> 
> Clearing away the yz data on that node and restarting riak makes the system work again - and I guess AAE will rebuild the index.
> 
> 
> This sounds very similar to the issue last week. I would certainly like to rule out any sort of data corruption (are you thinking your disks are corrupting the data or are you assuming Solr is?).
> 
> However, it is also possible, like the last issue, that the node/cluster simply does not have enough memory. When you delete the data Solr no longer has anything to cache in-memory thus using significantly less. As discussed, the recommended minimum 
>  
> But I'm wondering why a crashing Java on one node practically takes down the search on the cluster. Shouldn't Riak be more resilient than that?
> 
> The hard part here is, at least initially, the Java process doesn't crash, it just starts to timeout. In distributed systems a slow-node is often worse than a down node. Riak, prior to 1.4 had something called "health check" that would mark a node down in this situation. Unfortunately in some workloads, and I believe given your cluster's limited resources it would happen here, this often results in excessive work being offloaded to another node, which also does not have sufficient resources and around we go until the entire cluster falls over. A capacity problem, typically, can only be solved by adding more capacity. 
>  
> 
> Is there a explicit reindex command for the full text search subsystem?
> 
> Could Riak keep an eye on the java process and restart it if it crashes or runs away?
> 
> 
> Riak does manage the JVM process (starting/stopping/restarting) .I agree that if we could include run-away process, like in your case, that would be even better. I would have to think a bit more about how this would work (to prevent the same problems mentioned above with the old-style health check)
> 
> Jordan
> 
>  
> Chaim Solomon
> 
> 
> 
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> 
> 
> 
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20140811/6497be27/attachment.html>


More information about the riak-users mailing list