preflist_exhausted with map-reduce and riak-js

Matthew Tovbin matthew at tovbin.com
Tue Apr 3 19:10:09 EDT 2012


Jon,

We dont have any MR jobs running at all :)) Here you go:

1>rpc:multicall(supervisor, count_children, [riak_pipe_builder_sup]).
{[[{specs,1},{active,0},{supervisors,0},{workers,0}],
  [{specs,1},{active,0},{supervisors,0},{workers,0}],
  [{specs,1},{active,0},{supervisors,0},{workers,0}]],
 []}
2>rpc:multicall(supervisor, count_children, [riak_pipe_fitting_sup]).
{[[{specs,1},{active,0},{supervisors,0},{workers,0}],
  [{specs,1},{active,0},{supervisors,0},{workers,0}],
  [{specs,1},{active,0},{supervisors,0},{workers,0}]],
 []}
3>rpc:multicall(riak_kv_js_manager,pool_size,[riak_kv_js_map]).
{[0,0,0],[]}
4>rpc:multicall(riak_kv_js_manager,pool_size,[riak_kv_js_reduce]).
{[0,0,0],[]}

-Matthew



On Tue, Apr 3, 2012 at 14:00, Jon Meredith <jmeredith at basho.com> wrote:

> Hi Matthew,
>
> Thanks for the logs.  Do you know how many map/reduce jobs you run
> concurrently?  Is this a load-based issue or do you run into problems?
>
> Running this from the erlang console should give an idea of how many
> pipes, fittings and JS VMs you are running concurrently ('riak attach' to
> start, ^D to exit)
>
>   rpc:multicall(supervisor, count_children, [riak_pipe_builder_sup]).
>   rpc:multicall(supervisor, count_children, [riak_pipe_fitting_sup]).
>   rpc:multicall(riak_kv_js_manager,pool_size,[riak_kv_js_map]).
>   rpc:multicall(riak_kv_js_manager,pool_size,[riak_kv_js_reduce]).
>
> For the two supervisors, add up the active counts, for the two managers,
> that gives how many JS VMs are running per-node.
>
> If the number of concurrent requests is low, maybe we can find a higher
> bandwidth way of investigating.
>
> Jon
>
> On Mon, Apr 2, 2012 at 4:21 PM, Matthew Tovbin <matthew at tovbin.com> wrote:
>
>> Unfortunately, yes, Jon.
>>
>> Here are my log files after setting 128/128 for 'map_js_vm_count' and
>> 'reduce_js_vm_count' accordingly and restarting all the nodes (see
>> attached app.config).
>>
>> If you need any other information provided or tests performed please let
>> me know.
>>
>> *-Console.log:*
>> 2012-04-02 15:16:56.848 [notice]
>> <0.22138.162>@riak_kv_js_manager:blocking_dispatch:247 JS call failed: All
>> VMs are busy.
>> 2012-04-02 15:16:56.853 [error] <0.22121.162> gen_fsm <0.22121.162> in
>> state wait_pipeline_shutdown terminated with reason: {sink_died,shutdown}
>> 2012-04-02 15:16:56.860 [error] <0.22121.162> CRASH REPORT Process
>> <0.22121.162> with 0 neighbours crashed with reason: {sink_died,shutdown}
>> 2012-04-02 15:16:56.862 [error] <0.1451.0> Supervisor
>> riak_pipe_builder_sup had child undefined started with
>> {riak_pipe_builder,start_link,undefined} at <0.22121.162> exit with reason
>> {sink_died,shutdown} in context child_terminated
>>
>> *-Error.log:*
>> ** State machine <0.22121.162> terminating
>> ** Last message in was
>> {'DOWN',#Ref<0.0.33.48784>,process,<0.22119.162>,shutdown}
>> ** When State == wait_pipeline_shutdown
>> **      Data  ==
>> {state,[{sink,{fitting,<0.22119.162>,#Ref<0.0.33.48768>,{riak_kv_pipe_get,bkey_chash},{riak_kv_pipe_get,bkey_nval}}},{trace,{set,1,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[error],[],[],[],[],[],[],[],[],[],[],[],[],[]}}}},{log,{sink,{fitting,<0.21554.162>,#Ref<0.0.33.48768>,sink,undefined}}}],{pipe,<0.22121.162>,[{listkeys,{fitting,<0.22122.162>,#Ref<0.0.33.48768>,{chash,key_of},1}}],{fitting,<0.22119.162>,#Ref<0.0.33.48768>,{riak_kv_pipe_get,bkey_chash},{riak_kv_pipe_get,bkey_nval}}},[{{fitting,<0.22122.162>,#Ref<0.0.33.48768>,{chash,key_of},1},#Ref<0.0.33.48787>}],#Ref<0.0.33.48784>}
>> ** Reason for termination =
>> ** {sink_died,shutdown}
>> 2012-04-02 15:16:56 =CRASH REPORT====
>>   crasher:
>>     initial call: riak_pipe_builder:init/1
>>     pid: <0.22121.162>
>>     registered_name: []
>>     exception exit: {sink_died,shutdown}
>>       in function  gen_fsm:terminate/7
>>        in call from proc_lib:init_p_do_apply/3
>>     ancestors: [riak_pipe_builder_sup,riak_pipe_sup,<0.1448.0>]
>>     messages: [{'DOWN',#Ref<0.0.33.48787>,process,<0.22122.162>,shutdown}]
>>     links: [<0.1451.0>]
>>     dictionary:
>> [{eunit,[{module,riak_pipe_builder},{ref,#Ref<0.0.33.48768>},{spec,[{fitting_spec,listkeys,riak_kv_pipe_listkeys,undefined,{chash,key_of},1,64}]},{options,[{sink,{fitting,<0.22119.162>,#Ref<0.0.33.48768>,{riak_kv_pipe_get,bkey_chash},{riak_kv_pipe_get,bkey_nval}}},{trace,{set,1,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[error],[],[],[],[],[],[],[],[],[],[],[],[],[]}}}},{log,{sink,{fitting,<0.21554.162>,#Ref<0.0.33.48768>,sink,undefined}}}]},{fittings,[{{fitting,<0.22122.162>,#Ref<0.0.33.48768>,{chash,key_of},1},#Ref<0.0.33.48787>}]}]}]
>>      trap_exit: false
>>     status: running
>>     heap_size: 987
>>     stack_size: 24
>>     reductions: 251
>>   neighbours:
>> 2012-04-02 15:16:56 =SUPERVISOR REPORT====
>>      Supervisor: {local,riak_pipe_builder_sup}
>>      Context:    child_terminated
>>      Reason:     {sink_died,shutdown}
>>      Offender:
>> [{pid,<0.22121.162>},{name,undefined},{mfargs,{riak_pipe_builder,start_link,undefined}},{restart_type,temporary},{shutdown,brutal_kill},{child_type,worker}]
>>
>>
>>
>>
>>
>>
>> -Matthew
>>
>>
>>
>> On Mon, Apr 2, 2012 at 12:27, Jon Meredith <jmeredith at basho.com> wrote:
>>
>>> Very strange.
>>>
>>> Are you seeing the
>>>
>>> 16:40:50.037 [notice] JS call failed: All VMs are busy.
>>>
>>> message in your console.log files on any of the nodes in your cluster?
>>>
>>> Jon
>>>
>>>
>>> On Mon, Apr 2, 2012 at 11:48 AM, Matthew Tovbin <matthew at tovbin.com>wrote:
>>>
>>>> Jon,
>>>>
>>>> Thanks for the reply. I tried increasing the 'map_js_vm_count' and
>>>> 'reduce_js_vm_count' on the cluster to the values you suggested 24/18
>>>> and even larger ones 32/24 and 128/128, but with no success - I still get
>>>> the same error in the console log of the node.
>>>>
>>>> What are we missing?!
>>>>
>>>> -Matthew
>>>>
>>>>
>>>>
>>>> On Fri, Mar 30, 2012 at 16:19, Jon Meredith <jmeredith at basho.com>wrote:
>>>>
>>>>> Hi list,
>>>>>
>>>>> I spend some time looking at this issue today.  I suspect it is due to
>>>>> all the Javascript VMs being busy inside Riak due to many parallel
>>>>> MapReduce jobs.
>>>>>
>>>>> If you're seeing this issue please check for this message in the
>>>>> console.log file
>>>>>
>>>>> 16:40:50.037 [notice] JS call failed: All VMs are busy.
>>>>>
>>>>> I've filed https://github.com/basho/riak_pipe/issues/44 to track
>>>>> improvements to error reporting.
>>>>>
>>>>> If you are seeing the log message, the number of VMs allocated for map
>>>>> and reduce tasks can be configured in the app.config - defaults are 8 for
>>>>> map and 6 for reduce, increasing them will use more memory but should
>>>>> reduce the incidence of the error message.
>>>>>
>>>>> {riak_kv, [
>>>>>              ...
>>>>>             {map_js_vm_count, 24 },
>>>>>             {reduce_js_vm_count, 18 },
>>>>>             ...]
>>>>>
>>>>> If you were affected by it and changing this does not resolve your
>>>>> issue, I'll keep digging.
>>>>>
>>>>> Cheers, Jon.
>>>>>
>>>>> On Thu, Mar 29, 2012 at 10:29 AM, Matthew Tovbin <matthew at tovbin.com>wrote:
>>>>>
>>>>>> Guys,
>>>>>>
>>>>>> Any updates on the issue?!
>>>>>>
>>>>>> -Matthew
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Mar 13, 2012 at 18:29, Matthew Tovbin <matthew at tovbin.com>wrote:
>>>>>>
>>>>>>> Here is a log from one of the servers:
>>>>>>>
>>>>>>> ==> /mnt/dataraid/log/crash.log <==
>>>>>>> 2012-03-13 18:24:44 =CRASH REPORT====
>>>>>>>   crasher:
>>>>>>>     initial call: riak_pipe_builder:init/1
>>>>>>>     pid: <0.20949.24>
>>>>>>>     registered_name: []
>>>>>>>     exception exit: {sink_died,shutdown}
>>>>>>>       in function  gen_fsm:terminate/7
>>>>>>>       in call from proc_lib:init_p_do_apply/3
>>>>>>>     ancestors: [riak_pipe_builder_sup,riak_pipe_sup,<0.172.0>]
>>>>>>>     messages:
>>>>>>> [{'DOWN',#Ref<0.0.5.38339>,process,<0.20950.24>,shutdown}]
>>>>>>>     links: [<0.175.0>]
>>>>>>>     dictionary:
>>>>>>> [{eunit,[{module,riak_pipe_builder},{ref,#Ref<0.0.5.38318>},{spec,[{fitting_spec,listkeys,riak_kv_pipe_listkeys,undefined,{chash,key_of},1,64}]},{options,[{sink,{fitting,<0.20947.24>,#Ref<0.0.5.38318>,{riak_kv_pipe_get,bkey_chash},{riak_kv_pipe_get,bkey_nval}}},{trace,{set,1,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[error],[],[],[],[],[],[],[],[],[],[],[],[],[]}}}},{log,{sink,{fitting,<0.20060.24>,#Ref<0.0.5.38318>,sink,undefined}}}]},{fittings,[{{fitting,<0.20950.24>,#Ref<0.0.5.38318>,{chash,key_of},1},#Ref<0.0.5.38339>}]}]}]
>>>>>>>     trap_exit: false
>>>>>>>     status: running
>>>>>>>     heap_size: 987
>>>>>>>     stack_size: 24
>>>>>>>     reductions: 251
>>>>>>>   neighbours:
>>>>>>> 2012-03-13 18:24:44 =SUPERVISOR REPORT====
>>>>>>>      Supervisor: {local,riak_pipe_builder_sup}
>>>>>>>      Context:    child_terminated
>>>>>>>      Reason:     {sink_died,shutdown}
>>>>>>>      Offender:
>>>>>>> [{pid,<0.20949.24>},{name,undefined},{mfargs,{riak_pipe_builder,start_link,undefined}},{restart_type,temporary},{shutdown,brutal_kill},{child_type,worker}]
>>>>>>>
>>>>>>>
>>>>>>> ==> /mnt/dataraid/log/console.log <==
>>>>>>> 2012-03-13 18:24:44.645 [error] <0.20949.24> CRASH REPORT Process
>>>>>>> <0.20949.24> with 0 neighbours crashed with reason: {sink_died,shutdown}
>>>>>>> 2012-03-13 18:24:44.652 [error] <0.175.0> Supervisor
>>>>>>> riak_pipe_builder_sup had child undefined started with
>>>>>>> {riak_pipe_builder,start_link,undefined} at <0.20949.24> exit with reason
>>>>>>> {sink_died,shutdown} in context child_terminated
>>>>>>> 2012-03-13 18:24:56.420 [info]
>>>>>>> <0.71.0>@riak_core_sysmon_handler:handle_event:85 monitor long_gc
>>>>>>> <0.32603.24>
>>>>>>> [{initial_call,{mochiweb_acceptor,init,3}},{almost_current_function,{gen,do_call,4}},{message_queue_len,0}]
>>>>>>> [{timeout,120},{old_heap_block_size,0},{heap_block_size,4181},{mbuf_size,0},{stack_size,40},{old_heap_size,0},{heap_size,1955}]
>>>>>>>
>>>>>>>
>>>>>>> ==> /mnt/dataraid/log/erlang.log.1 <==
>>>>>>> 18:24:44.634 [error] gen_fsm <0.20949.24> in state
>>>>>>> wait_pipeline_shutdown terminated with reason: {sink_died,shutdown}
>>>>>>> (riak at 1)1> 18:24:44.641 [error] Pipe worker startup failed:fitting
>>>>>>> was gone before startup
>>>>>>> (riak at 1)1>
>>>>>>> 18:24:44.645 [error] CRASH REPORT Process <0.20949.24> with 0
>>>>>>> neighbours crashed with reason: {sink_died,shutdown}
>>>>>>> (riak at 1)1> 18:24:44.652 [error] Supervisor riak_pipe_builder_sup
>>>>>>> had child undefined started with {riak_pipe_builder,start_link,undefined}
>>>>>>> at <0.20949.24> exit with reason {sink_died,shutdown} in context
>>>>>>> child_terminated
>>>>>>> (riak at 1)1> 18:24:56.420 [info] monitor long_gc <0.32603.24>
>>>>>>> [{initial_call,{mochiweb_acceptor,init,3}},{almost_current_function,{gen,do_call,4}},{message_queue_len,0}]
>>>>>>> [{timeout,120},{old_heap_block_size,0},{heap_block_size,4181},{mbuf_size,0},{stack_size,40},{old_heap_size,0},{heap_size,1955}]
>>>>>>> (riak at 1)1>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ==> /mnt/dataraid/log/error.log <==
>>>>>>> 2012-03-13 18:24:44.645 [error] <0.20949.24> CRASH REPORT Process
>>>>>>> <0.20949.24> with 0 neighbours crashed with reason: {sink_died,shutdown}
>>>>>>> 2012-03-13 18:24:44.652 [error] <0.175.0> Supervisor
>>>>>>> riak_pipe_builder_sup had child undefined started with
>>>>>>> {riak_pipe_builder,start_link,undefined} at <0.20949.24> exit with reason
>>>>>>> {sink_died,shutdown} in context child_terminated
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> -Matthew
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Mar 13, 2012 at 18:17, Matthew Tovbin <matthew at tovbin.com>wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I got the same problem today on 1.1.0.
>>>>>>>>
>>>>>>>> As suggested, I updated all the nodes to 1.1.1. The error remains
>>>>>>>> the same:
>>>>>>>>
>>>>>>>> { stack: [Getter/Setter],
>>>>>>>>    arguments: undefined,
>>>>>>>>   type: undefined,
>>>>>>>>   message: 'HTTP error 500:
>>>>>>>> {"phase":0,"error":"[preflist_exhausted]","input":"{ok,{r_object ....."}}}
>>>>>>>>
>>>>>>>>
>>>>>>>> -Matthew
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Mar 13, 2012 at 06:48, Jon Meredith <jmeredith at basho.com>wrote:
>>>>>>>>
>>>>>>>>> Hi Arne,
>>>>>>>>>
>>>>>>>>> Could you try with 1.1.1 please.  We've fixed a number of bugs
>>>>>>>>> with how MapReduce handles errors that could explain some of what you are
>>>>>>>>> seeing.
>>>>>>>>>
>>>>>>>>> Jon
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Mar 13, 2012 at 12:55 AM, Arne Claassen <
>>>>>>>>> arne+riak at claassen.net> wrote:
>>>>>>>>>
>>>>>>>>>> I'm running 1.1.0 built from source on Centos 6. I set up two
>>>>>>>>>> nodes on localhost following the instructions on the basho site for testing
>>>>>>>>>> and everything was going fine until i needed to fetch all items in a bucket.
>>>>>>>>>>
>>>>>>>>>> For this i was using node.js with riak-js and it's db.getAll
>>>>>>>>>> call,which from the debug logs looks like it uses /mapred
>>>>>>>>>>
>>>>>>>>>> The bucket only has about 1200 items in it, each a json payload
>>>>>>>>>> of about 3KB. Initially, it'd keep getting inconsistent results, sometimes
>>>>>>>>>> the data would come back, sometimes data would be undefined. So i kept
>>>>>>>>>> running it, trying to see if there was a pattern, adding more debugging
>>>>>>>>>> code, etc. But still it would be about a request every couple of seconds.
>>>>>>>>>> Then it just stopped working altogether and now i always get:
>>>>>>>>>>
>>>>>>>>>> { [Error: HTTP error 500:
>>>>>>>>>> {"phase":0,"error":"[preflist_exhausted]","input":"{ok,{r_object,<<\"tweets_15368536\">>,<<\"178528155714330620\">>,[{r_content,{dict,6,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[[<<\"Links\">>]],[],[],[],[],[],[],[],[[<<\"content-type\">>,97,112,112,108,105,99,97,116,105,111,110,47,106,115,111,110],[<<\"X-Riak-VTag\">>,49,76,115,54,70,75,74,71,119,65,111,70,88,73,117,84,98,88,79,69,52,116]],[[<<\"index\">>]],[],[[<<\"X-Riak-Last-Modified\">>|{1331,487523,218968}]],[],[[<<\"X-Riak-Meta\">>]]}}},<<\"{\"coordinates\":null,\"create...\">>}],...},...}","type":"forward_preflist","stack":"[]"}]
>>>>>>>>>> statusCode: 500 }
>>>>>>>>>>
>>>>>>>>>> in the console. I haven't tried restarting riak, just because i'd
>>>>>>>>>> rather find out what's happening than it going away and coming back later
>>>>>>>>>> when i don't expect it.
>>>>>>>>>>
>>>>>>>>>> Any suggestions on how to troubleshoot this? I can still access
>>>>>>>>>> items individually and i can even list the keys via ?keys=true, it's just
>>>>>>>>>> he map reduce that seems to be busted.
>>>>>>>>>>
>>>>>>>>>> cheers,
>>>>>>>>>> arne
>>>>>>>>>> _______________________________________________
>>>>>>>>>> riak-users mailing list
>>>>>>>>>> riak-users at lists.basho.com
>>>>>>>>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Jon Meredith
>>>>>>>>> Platform Engineering Manager
>>>>>>>>> Basho Technologies, Inc.
>>>>>>>>> jmeredith at basho.com
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> riak-users mailing list
>>>>>>>>> riak-users at lists.basho.com
>>>>>>>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Jon Meredith
>>>>> Platform Engineering Manager
>>>>> Basho Technologies, Inc.
>>>>> jmeredith at basho.com
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Jon Meredith
>>> Platform Engineering Manager
>>> Basho Technologies, Inc.
>>> jmeredith at basho.com
>>>
>>>
>>
>
>
> --
> Jon Meredith
> Platform Engineering Manager
> Basho Technologies, Inc.
> jmeredith at basho.com
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20120403/0211b908/attachment.html>


More information about the riak-users mailing list