Random Block Not Found Issues with Riak/Riak CS

Charles Bijon bijon.charles at gmail.com
Tue Jul 22 07:54:54 EDT 2014


Hi Dave,

Humm it's not really the same issue.

[error] <0.13320.0>@riak_cs_get_fsm:waiting_chunks:311 riak_cs_get_fsm: 
Cannot get S3 <<"independent-print-limited">> 
<<"independent/independent/2014-07-22/cover/cover.ppm">> block# 
{<<94,144,214,192,123,131,68,132,142,55,30,108,189,81,242,106>>,0}: 
{error,notfound}

We have this issue. And we have 32Go of RAM on each nodes.

I disable the AAE, because when we put a new file. it's available 
directly. But after some hour, it became wrong and we have errors. I 
will try it but if you have another idea....

Regards,

Charles

Le 22/07/2014 13:17, Dave Finster a écrit :
> Hi Charles
>
> We have a slightly different issue to yours in the majority of our 
> requests succeed and only the odd one fails - or is that what you are 
> observing on your cluster?
>
> I've been talking about the issue off-list with Luke and Kelly. Luke 
> took a look at some of our debug logs for Riak and suspects that we 
> had over-committed the resources of our cluster. I've modified the 
> configuration of our cluster as per his recommendations:
>
> {multi_backend, [
>    {be_default, riak_kv_eleveldb_backend, [
>        {max_open_files, 14},
>        {cache_size, 4194304},
>        {data_root, "/var/db/riak/leveldb"}
>    ]},
>    {be_blocks, riak_kv_bitcask_backend, [
>        {data_root, "/var/db/riak/bitcask"}
>    ]}
> ]},
>
> {anti_entropy, {off, []}},
>
> Our environment is a 4 node cluster with 4GB of RAM each running on 
> SmartOS (from Joyent). I applied these changes, but while things 
> improved I still encountered the odd failure. I also de-activated 
> n_val_1_get_requests and I haven't been able to reproduce the issues 
> that I was encountering previously.
>
> Thanks,
> Dave
>
>> On 22 Jul 2014, at 9:01 pm, Charles Bijon <bijon.charles at gmail.com 
>> <mailto:bijon.charles at gmail.com>> wrote:
>>
>> Hi,
>>
>> We have the same issue there. But we have 45 riak/riak-cs nodes in 
>> production. Do you have any idea to correct it ?
>>
>> Regards,
>>
>> Charles
>>
>>
>> Le 17/07/2014 23:21, Dave Finster a écrit :
>>> Hi Kelly
>>>
>>> 1.4.5 - Riak CS
>>> 1.4.8 - Riak
>>> Anti Entropy is on (all nodes)
>>>
>>> Deactivating n_val_1_get_requests still allows me to cause the issue 
>>> (with less occurrence), however a different error has cropped up now:
>>>
>>> 2014-07-17 21:15:38 =ERROR REPORT====
>>> webmachine error: path="/buckets/<bucket 
>>> name>/objects/bf15f98c-eaa1-4ff9-83ff-24c1e7e1380f%2F847c340cfe2f44028d6fd5606f696796%2FAttachment-1.png"
>>> {exit,{{{{case_clause,{error,timeout}},[{riak_cs_manifest_fsm,handle_get_manifests,1,[{file,"src/riak_cs_manifest_fsm.erl"},{line,265}]},{riak_cs_manifest_fsm,waiting_command,3,[{file,"src/riak_cs_manifest_fsm.erl"},{line,201}]},{gen_fsm,handle_msg,7,[{file,"gen_fsm.erl"},{line,494}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]},{gen_fsm,sync_send_event,[<0.1383.0>,get_manifests,infinity]}},{gen_fsm,sync_send_event,[<0.1382.0>,get_manifest,infinity]}},[{gen_fsm,sync_send_event,3,[{file,"gen_fsm.erl"},{line,214}]},{riak_cs_wm_utils,ensure_doc,2,[{file,"src/riak_cs_wm_utils.erl"},{line,236}]},{riak_cs_wm_object,authorize,2,[{file,"src/riak_cs_wm_object.erl"},{line,64}]},{riak_cs_wm_common,authorize,2,[{file,"src/riak_cs_wm_common.erl"},{line,396}]},{riak_cs_wm_common,forbidden,2,[{file,"src/riak_cs_wm_common.erl"},{line,182}]},{webmachine_resource,resource_call,3,[{file,"src/webmachine_resource.erl"},{line,186}]},{webmachine_resource,do,3, 
>>> [{ 
>>> file,"src/webmachine_resource.erl"},{line,142}]},{webmachine_decision_core,resource_call,1,[{file,"src/webmachine_decision_core.erl"},{line,48}]}]}
>>> [{gen_fsm,sync_send_event,3,[{file,"gen_fsm.erl"},{line,214}]},{riak_cs_wm_utils,ensure_doc,2,[{file,"src/riak_cs_wm_utils.erl"},{line,236}]},{riak_cs_wm_object,authorize,2,[{file,"src/riak_cs_wm_object.erl"},{line,64}]},{riak_cs_wm_common,authorize,2,[{file,"src/riak_cs_wm_common.erl"},{line,396}]},{riak_cs_wm_common,forbidden,2,[{file,"src/riak_cs_wm_common.erl"},{line,182}]},{webmachine_resource,resource_call,3,[{file,"src/webmachine_resource.erl"},{line,186}]},{webmachine_resource,do,3,[{file,"src/webmachine_resource.erl"},{line,142}]},{webmachine_decision_core,resource_call,1,[{file,"src/webmachine_decision_core.erl"},{line,48}]}]
>>>
>>> Thanks,
>>> Dave
>>>
>>>> On 18 Jul 2014, at 2:19 am, Kelly McLaughlin <kelly at basho.com 
>>>> <mailto:kelly at basho.com>> wrote:
>>>>
>>>> Dave,
>>>>
>>>> Can you tell me what versions of Riak and Riak CS you have 
>>>> installed? Do you have AAE enabled or disabled? It's tough to come 
>>>> up with an explanation without more information, but I would try 
>>>> setting n_val_1_get_requests to false and see if you continue to 
>>>> experience the problem. My guess is that will resolve the issue, 
>>>> but let me know what happens.
>>>>
>>>> Kelly
>>>>
>>>> On July 17, 2014 at 1:00:19 AM, Dave Finster 
>>>> (davefinster at icloud.com <mailto:davefinster at icloud.com>) wrote:
>>>>
>>>>> Hi Everyone
>>>>>
>>>>> Spent a bit of time trying to debug this one and not sure were to 
>>>>> from here. The use case that appears to cause this breakage is a 
>>>>> web page that links to 8 x 10MB images and it attempts to fetch 
>>>>> them simultaneously.
>>>>>
>>>>> Occasionally, one or two of the images will just fail to load, 
>>>>> while other times they all work file. I've tracked it down to the 
>>>>> crash below. It isn't always the same image. To make the problem 
>>>>> more repeatable, I forced our load balancer into only using a 
>>>>> single Riak-CS node, so it will be getting hit with all the 
>>>>> requests. We are using HAProxy out the front and are running 
>>>>> SmartOS 64-bit images across the board.
>>>>>
>>>>> arekinath helped me look into it and one thought was that I was 
>>>>> hit by the AAE bug prior to 1.4.8, but even clearing the AAE made 
>>>>> no difference. The n-val on the buckets is 3 and its a 4-node 
>>>>> cluster. All 4 nodes have both a Riak and a Riak-CS node on it. I 
>>>>> also have pb_backlog turned up to 256, n_val_1_get_requests set to 
>>>>> true and fold_objects_for_list_keys set to true. 'ring-status' 
>>>>> shows that the whole ring is reachable.
>>>>>
>>>>> Any idea on how to diagnose this one further?
>>>>>
>>>>> 2014-07-17 06:38:54 =CRASH REPORT====
>>>>> crasher:
>>>>> initial call: mochiweb_acceptor:init/3
>>>>> pid: <0.26119.1>
>>>>> registered_name: []
>>>>> exception exit: 
>>>>> {{normal,{gen_fsm,sync_send_event,[<0.27617.1>,get_next_chunk,infinity]}},[{gen_fsm,sync_send_event,3,[{file,"gen_fsm.erl"},{line,214}]},{riak_cs_wm_utils,streaming_get,4,[{file,"src/riak_cs_wm_utils.erl"},{line,272}]},{webmachine_decision_core,'-make_encoder_stream/3-fun-0-',3,[{file,"src/webmachine_decision_core.erl"},{line,667}]},{webmachine_request,send_stream_body_no_chunk,2,[{file,"src/webmachine_request.erl"},{line,334}]},{webmachine_request,send_response,3,[{file,"src/webmachine_request.erl"},{line,398}]},{webmachine_request,call,2,[{file,"src/webmachine_request.erl"},{line,251}]},{webmachine_decision_core,wrcall,1,[{file,"src/webmachine_decision_core.erl"},{line,42}]},{webmachine_decision_core,finish_response,3,[{file,"src/webmachine_decision_core.erl"},{line,92}]}]}
>>>>> ancestors: [object_web_mochiweb,riak_cs_sup,<0.143.0>]
>>>>> messages: []
>>>>> links: [<0.298.0>,#Port<0.12015>]
>>>>> dictionary: 
>>>>> [{reqstate,{wm_reqstate,#Port<0.12015>,[{'content-encoding',"identity"},{'content-type',"application/octet-stream"},{resource_module,riak_cs_wm_object}],undefined,"10.4.242.1",{wm_reqdata,'GET',http,{1,1},"10.4.242.1",undefined,[],"/buckets/<the 
>>>>> bucket 
>>>>> name>/objects/bf15f98c-eaa1-4ff9-83ff-24c1e7e1380f%2F847c340cfe2f44028d6fd5606f696796%2FAttachment-1.png","/buckets/<the 
>>>>> bucket 
>>>>> name>/objects/bf15f98c-eaa1-4ff9-83ff-24c1e7e1380f%2F847c340cfe2f44028d6fd5606f696796%2FAttachment-1.png?Signature=U0By3mIwaRIVBHNcYhSt6r5QgPk%3D&Expires=1405580057&AWSAccessKeyId=DGTXHHWIEDF4XUBSBYVI",[{bucket,"<the 
>>>>> bucket 
>>>>> name>"},{object,"bf15f98c-eaa1-4ff9-83ff-24c1e7e1380f%2F847c340cfe2f44028d6fd5606f696796%2FAttachment-1.png"}],[],"../../../..",{200,undefined},1073741824,67108864,[{"_ga","GA1.3.643660316.1404789703"}],[{"Signature","U0By3mIwaRIVBHNcYhSt6r5QgPk="},{"Expires","1405580057"},{"AWSAccessKeyId","DGTXHHWIEDF4XUBSBYVI"}],{9,{"cookie",{'Cookie',"_ga=GA1.3.643660316.1404789703"},{"accept-language",{'Accept-Language',"en-US,en;q=0.8"},{"accept-encoding",{'Accept-Encoding',"gzip,deflate,sdch"},{"accept",{'Accept',"image/webp,*/*;q=0.8"},nil,nil},nil},{"connection",{'Connection',"keep-alive"},nil,nil}},{"referer",{'Referer',"<the 
>>>>> referrer>"},{"host",{'Host',"<our riak-cs host 
>>>>> name>"},nil,nil},{"user-agent",{'User-Agent',"Mozilla/5.0 
>>>>> (Macintosh; Intel Mac OS X 10_10_0) AppleWebKit/537.36 (KHTML, 
>>>>> like Gecko) Chrome/35.0.1916.153 
>>>>> Safari/537.36"},nil,{"x-rcs-rewrite-path",{"x-rcs-rewrite-path","/<the 
>>>>> bucket 
>>>>> name>/bf15f98c-eaa1-4ff9-83ff-24c1e7e1380f/847c340cfe2f44028d6fd5606f696796/Attachment-1.png?AWSAccessKeyId=DGTXHHWIEDF4XUBSBYVI&Expires=1405580057&Signature=U0By3mIwaRIVBHNcYhSt6r5QgPk%3D"},nil,nil}}}}},not_fetched_yet,false,{3,{"content-type",{"Content-Type","application/octet-stream"},nil,{"etag",{"ETag","\"a3a32cf5d8f502d7e8d35fd8412a6878\""},nil,
>>>>> trap_exit: false
>>>>> status: running
>>>>> heap_size: 28657
>>>>> stack_size: 24
>>>>> reductions: 80773
>>>>> neighbours:
>>>>>
>>>>> Thanks,
>>>>> Dave Finster
>>>>> _______________________________________________
>>>>> riak-users mailing list
>>>>> riak-users at lists.basho.com <mailto:riak-users at lists.basho.com>
>>>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>>
>>>
>>>
>>> _______________________________________________
>>> riak-users mailing list
>>> riak-users at lists.basho.com
>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>
>
>
>
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20140722/fe2429c4/attachment.html>


More information about the riak-users mailing list