Issues with search (2.0)

Chaim Solomon chaim at itcentralstation.com
Mon Aug 11 23:46:30 EDT 2014


The nodes have 8G - so well more then the recommended value.
The configuration was at the default of 1G - I now changed it to 2G.

Chaim Solomon



On Mon, Aug 11, 2014 at 6:14 PM, Eric Redmond <eredmond at basho.com> wrote:

> If Solr is stumbling over bad data, your node's solr.log should be filled
> up. If Yokozuna is stumbling over bad data that it's trying to send Solr in
> a loop, the console.log should be full. If yokozuna is going ahead and
> indexing bad values (such as unparsable json), it will go ahead and index a
> blank object with _yz_err (just search for existence). If you have a case
> of sibling explosion, you'll have many duplicates of the same object with
> different _yz_vtag fields (again search for existence).
>
> You said it's not a resource issue, but just to rule that out, how much
> RAM does each node have? Also, how much is made available to Solr? You can
> adjust the max heap size given to Solr in riak.conf, by
> changing search.solr.jvm_options max heap size values from -Xmx1g to -Xmx2g
> or more.
>
> Eric
>
>
> On Aug 11, 2014, at 8:03 AM, Chaim Solomon <chaim at itcentralstation.com>
> wrote:
>
> Hi,
>
> I don't think that it is a resource issue now.
>
> After removing the data, the other nodes had low load and are handling the
> workload just fine.
> And the Java process - when it crashed - was really dead, on shutting down
> Riak it stayed around and needed a -9 to go away.
>
> I don't think the disks are a problem but rather suspect that a crash may
> have caused Solr to stumble over bad data and then crash.
>
> Chaim Solomon
>
>
>
> On Mon, Aug 11, 2014 at 5:47 PM, Jordan West <jwest at basho.com> wrote:
>
>> Chaim,
>>
>> Some comments inline:
>>
>> On Mon, Aug 11, 2014 at 4:14 AM, Chaim Solomon <
>> chaim at itcentralstation.com> wrote:
>>
>>> Hi,
>>>
>>> I've been running into an issue with the yz search acting up.
>>>
>>>  I've been getting a lot of these:
>>>
>>> 2014-08-11 06:45:22.005 [error] <0.913.0>@yz_kv:index:206 failed to
>>> index object {<<"bucketname">>,<<"123">>} with error {"Failed to index
>>> docs",{error,req_timedout}} because [{yz_solr,index,3,[{file,"s
>>>
>>> rc/yz_solr.erl"},{line,192}]},{yz_kv,index,7,[{file,"src/yz_kv.erl"},{line,258}]},{yz_kv,index,3,[{file,
>>>
>>> "src/yz_kv.erl"},{line,193}]},{riak_kv_vnode,actual_put,6,[{file,"src/riak_kv_vnode.erl"},{line,1416}]},
>>>
>>> {riak_kv_vnode,perform_put,3,[{file,"src/riak_kv_vnode.erl"},{line,1404}]},{riak_kv_vnode,do_put,7,[{fil
>>>
>>> e,"src/riak_kv_vnode.erl"},{line,1199}]},{riak_kv_vnode,handle_command,3,[{file,"src/riak_kv_vnode.erl"}
>>>
>>> ,{line,485}]},{riak_core_vnode,vnode_command,3,[{file,"src/riak_core_vnode.erl"},{line,345}]}]
>>>
>>> and the Java process uses a lot of CPU and eventually runs out of memory
>>> or something like that and gets stuck. Killing the process gets the cluster
>>> back up and running.
>>>
>>> I am guessing that it may be data corruption on the yz data on one node.
>>>
>>> Clearing away the yz data on that node and restarting riak makes the
>>> system work again - and I guess AAE will rebuild the index.
>>>
>>>
>> This sounds very similar to the issue last week. I would certainly like
>> to rule out any sort of data corruption (are you thinking your disks are
>> corrupting the data or are you assuming Solr is?).
>>
>> However, it is also possible, like the last issue, that the node/cluster
>> simply does not have enough memory. When you delete the data Solr no longer
>> has anything to cache in-memory thus using significantly less. As
>> discussed, the recommended minimum
>>
>>
>>> But I'm wondering why a crashing Java on one node practically takes down
>>> the search on the cluster. Shouldn't Riak be more resilient than that?
>>>
>>
>> The hard part here is, at least initially, the Java process doesn't
>> crash, it just starts to timeout. In distributed systems a slow-node is
>> often worse than a down node. Riak, prior to 1.4 had something called
>> "health check" that would mark a node down in this situation. Unfortunately
>> in some workloads, and I believe given your cluster's limited resources it
>> would happen here, this often results in excessive work being offloaded to
>> another node, which also does not have sufficient resources and around we
>> go until the entire cluster falls over. A capacity problem, typically, can
>> only be solved by adding more capacity.
>>
>>
>>>
>>> Is there a explicit reindex command for the full text search subsystem?
>>>
>>> Could Riak keep an eye on the java process and restart it if it crashes
>>> or runs away?
>>>
>>>
>> Riak does manage the JVM process (starting/stopping/restarting) .I agree
>> that if we could include run-away process, like in your case, that would be
>> even better. I would have to think a bit more about how this would work (to
>> prevent the same problems mentioned above with the old-style health check)
>>
>> Jordan
>>
>>
>>
>>> Chaim Solomon
>>>
>>>
>>>
>>> _______________________________________________
>>> riak-users mailing list
>>> riak-users at lists.basho.com
>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>>
>>>
>>
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20140812/84c5c7f7/attachment.html>


More information about the riak-users mailing list