preflist_exhausted with map-reduce and riak-js

Jon Meredith jmeredith at basho.com
Tue Apr 3 19:56:22 EDT 2012


That's very interesting, you don't have any Javascript VMs running. Here's
an example on a single node system after restart - I've configured it with
2 map VMs and 3 reduce VMs just to get easy numbers.

(dev1 at 127.0.0.1)1>
rpc:multicall(riak_kv_js_manager,pool_size,[riak_kv_js_map]).
{[2],[]}
(dev1 at 127.0.0.1)2>
(dev1 at 127.0.0.1)2>
rpc:multicall(riak_kv_js_manager,pool_size,[riak_kv_js_reduce]).
{[3],[]}

Not having any VMs available would explain your preflist exhausted issue.

Could you restart one of the nodes and check the count, preferably before
running any JS MapReduce jobs?  Then if they are still zero we'll have to
work out why they aren't starting up.

Cheers, Jon.


On Tue, Apr 3, 2012 at 5:10 PM, Matthew Tovbin <matthew at tovbin.com> wrote:

> Jon,
>
> We dont have any MR jobs running at all :)) Here you go:
>
> 1>rpc:multicall(supervisor, count_children, [riak_pipe_builder_sup]).
> {[[{specs,1},{active,0},{supervisors,0},{workers,0}],
>   [{specs,1},{active,0},{supervisors,0},{workers,0}],
>   [{specs,1},{active,0},{supervisors,0},{workers,0}]],
>  []}
> 2>rpc:multicall(supervisor, count_children, [riak_pipe_fitting_sup]).
> {[[{specs,1},{active,0},{supervisors,0},{workers,0}],
>   [{specs,1},{active,0},{supervisors,0},{workers,0}],
>   [{specs,1},{active,0},{supervisors,0},{workers,0}]],
>  []}
> 3>rpc:multicall(riak_kv_js_manager,pool_size,[riak_kv_js_map]).
> {[0,0,0],[]}
> 4>rpc:multicall(riak_kv_js_manager,pool_size,[riak_kv_js_reduce]).
> {[0,0,0],[]}
>
> -Matthew
>
>
>
> On Tue, Apr 3, 2012 at 14:00, Jon Meredith <jmeredith at basho.com> wrote:
>
>> Hi Matthew,
>>
>> Thanks for the logs.  Do you know how many map/reduce jobs you run
>> concurrently?  Is this a load-based issue or do you run into problems?
>>
>> Running this from the erlang console should give an idea of how many
>> pipes, fittings and JS VMs you are running concurrently ('riak attach' to
>> start, ^D to exit)
>>
>>   rpc:multicall(supervisor, count_children, [riak_pipe_builder_sup]).
>>   rpc:multicall(supervisor, count_children, [riak_pipe_fitting_sup]).
>>   rpc:multicall(riak_kv_js_manager,pool_size,[riak_kv_js_map]).
>>   rpc:multicall(riak_kv_js_manager,pool_size,[riak_kv_js_reduce]).
>>
>> For the two supervisors, add up the active counts, for the two managers,
>> that gives how many JS VMs are running per-node.
>>
>> If the number of concurrent requests is low, maybe we can find a higher
>> bandwidth way of investigating.
>>
>> Jon
>>
>> On Mon, Apr 2, 2012 at 4:21 PM, Matthew Tovbin <matthew at tovbin.com>wrote:
>>
>>> Unfortunately, yes, Jon.
>>>
>>> Here are my log files after setting 128/128 for 'map_js_vm_count' and
>>> 'reduce_js_vm_count' accordingly and restarting all the nodes (see
>>> attached app.config).
>>>
>>> If you need any other information provided or tests performed please let
>>> me know.
>>>
>>> *-Console.log:*
>>> 2012-04-02 15:16:56.848 [notice]
>>> <0.22138.162>@riak_kv_js_manager:blocking_dispatch:247 JS call failed: All
>>> VMs are busy.
>>> 2012-04-02 15:16:56.853 [error] <0.22121.162> gen_fsm <0.22121.162> in
>>> state wait_pipeline_shutdown terminated with reason: {sink_died,shutdown}
>>> 2012-04-02 15:16:56.860 [error] <0.22121.162> CRASH REPORT Process
>>> <0.22121.162> with 0 neighbours crashed with reason: {sink_died,shutdown}
>>> 2012-04-02 15:16:56.862 [error] <0.1451.0> Supervisor
>>> riak_pipe_builder_sup had child undefined started with
>>> {riak_pipe_builder,start_link,undefined} at <0.22121.162> exit with reason
>>> {sink_died,shutdown} in context child_terminated
>>>
>>> *-Error.log:*
>>> ** State machine <0.22121.162> terminating
>>> ** Last message in was
>>> {'DOWN',#Ref<0.0.33.48784>,process,<0.22119.162>,shutdown}
>>> ** When State == wait_pipeline_shutdown
>>> **      Data  ==
>>> {state,[{sink,{fitting,<0.22119.162>,#Ref<0.0.33.48768>,{riak_kv_pipe_get,bkey_chash},{riak_kv_pipe_get,bkey_nval}}},{trace,{set,1,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[error],[],[],[],[],[],[],[],[],[],[],[],[],[]}}}},{log,{sink,{fitting,<0.21554.162>,#Ref<0.0.33.48768>,sink,undefined}}}],{pipe,<0.22121.162>,[{listkeys,{fitting,<0.22122.162>,#Ref<0.0.33.48768>,{chash,key_of},1}}],{fitting,<0.22119.162>,#Ref<0.0.33.48768>,{riak_kv_pipe_get,bkey_chash},{riak_kv_pipe_get,bkey_nval}}},[{{fitting,<0.22122.162>,#Ref<0.0.33.48768>,{chash,key_of},1},#Ref<0.0.33.48787>}],#Ref<0.0.33.48784>}
>>> ** Reason for termination =
>>> ** {sink_died,shutdown}
>>> 2012-04-02 15:16:56 =CRASH REPORT====
>>>   crasher:
>>>     initial call: riak_pipe_builder:init/1
>>>     pid: <0.22121.162>
>>>     registered_name: []
>>>     exception exit: {sink_died,shutdown}
>>>       in function  gen_fsm:terminate/7
>>>        in call from proc_lib:init_p_do_apply/3
>>>     ancestors: [riak_pipe_builder_sup,riak_pipe_sup,<0.1448.0>]
>>>     messages:
>>> [{'DOWN',#Ref<0.0.33.48787>,process,<0.22122.162>,shutdown}]
>>>     links: [<0.1451.0>]
>>>     dictionary:
>>> [{eunit,[{module,riak_pipe_builder},{ref,#Ref<0.0.33.48768>},{spec,[{fitting_spec,listkeys,riak_kv_pipe_listkeys,undefined,{chash,key_of},1,64}]},{options,[{sink,{fitting,<0.22119.162>,#Ref<0.0.33.48768>,{riak_kv_pipe_get,bkey_chash},{riak_kv_pipe_get,bkey_nval}}},{trace,{set,1,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[error],[],[],[],[],[],[],[],[],[],[],[],[],[]}}}},{log,{sink,{fitting,<0.21554.162>,#Ref<0.0.33.48768>,sink,undefined}}}]},{fittings,[{{fitting,<0.22122.162>,#Ref<0.0.33.48768>,{chash,key_of},1},#Ref<0.0.33.48787>}]}]}]
>>>      trap_exit: false
>>>     status: running
>>>     heap_size: 987
>>>     stack_size: 24
>>>     reductions: 251
>>>   neighbours:
>>> 2012-04-02 15:16:56 =SUPERVISOR REPORT====
>>>      Supervisor: {local,riak_pipe_builder_sup}
>>>      Context:    child_terminated
>>>      Reason:     {sink_died,shutdown}
>>>      Offender:
>>> [{pid,<0.22121.162>},{name,undefined},{mfargs,{riak_pipe_builder,start_link,undefined}},{restart_type,temporary},{shutdown,brutal_kill},{child_type,worker}]
>>>
>>>
>>>
>>>
>>>
>>>
>>> -Matthew
>>>
>>>
>>>
>>> On Mon, Apr 2, 2012 at 12:27, Jon Meredith <jmeredith at basho.com> wrote:
>>>
>>>> Very strange.
>>>>
>>>> Are you seeing the
>>>>
>>>> 16:40:50.037 [notice] JS call failed: All VMs are busy.
>>>>
>>>> message in your console.log files on any of the nodes in your cluster?
>>>>
>>>> Jon
>>>>
>>>>
>>>> On Mon, Apr 2, 2012 at 11:48 AM, Matthew Tovbin <matthew at tovbin.com>wrote:
>>>>
>>>>> Jon,
>>>>>
>>>>> Thanks for the reply. I tried increasing the 'map_js_vm_count' and
>>>>> 'reduce_js_vm_count' on the cluster to the values you suggested 24/18
>>>>> and even larger ones 32/24 and 128/128, but with no success - I still get
>>>>> the same error in the console log of the node.
>>>>>
>>>>> What are we missing?!
>>>>>
>>>>> -Matthew
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Mar 30, 2012 at 16:19, Jon Meredith <jmeredith at basho.com>wrote:
>>>>>
>>>>>> Hi list,
>>>>>>
>>>>>> I spend some time looking at this issue today.  I suspect it is due
>>>>>> to all the Javascript VMs being busy inside Riak due to many parallel
>>>>>> MapReduce jobs.
>>>>>>
>>>>>> If you're seeing this issue please check for this message in the
>>>>>> console.log file
>>>>>>
>>>>>> 16:40:50.037 [notice] JS call failed: All VMs are busy.
>>>>>>
>>>>>> I've filed https://github.com/basho/riak_pipe/issues/44 to track
>>>>>> improvements to error reporting.
>>>>>>
>>>>>> If you are seeing the log message, the number of VMs allocated for
>>>>>> map and reduce tasks can be configured in the app.config - defaults are 8
>>>>>> for map and 6 for reduce, increasing them will use more memory but should
>>>>>> reduce the incidence of the error message.
>>>>>>
>>>>>> {riak_kv, [
>>>>>>              ...
>>>>>>             {map_js_vm_count, 24 },
>>>>>>             {reduce_js_vm_count, 18 },
>>>>>>             ...]
>>>>>>
>>>>>> If you were affected by it and changing this does not resolve your
>>>>>> issue, I'll keep digging.
>>>>>>
>>>>>> Cheers, Jon.
>>>>>>
>>>>>> On Thu, Mar 29, 2012 at 10:29 AM, Matthew Tovbin <matthew at tovbin.com>wrote:
>>>>>>
>>>>>>> Guys,
>>>>>>>
>>>>>>> Any updates on the issue?!
>>>>>>>
>>>>>>> -Matthew
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Mar 13, 2012 at 18:29, Matthew Tovbin <matthew at tovbin.com>wrote:
>>>>>>>
>>>>>>>> Here is a log from one of the servers:
>>>>>>>>
>>>>>>>> ==> /mnt/dataraid/log/crash.log <==
>>>>>>>> 2012-03-13 18:24:44 =CRASH REPORT====
>>>>>>>>   crasher:
>>>>>>>>     initial call: riak_pipe_builder:init/1
>>>>>>>>     pid: <0.20949.24>
>>>>>>>>     registered_name: []
>>>>>>>>     exception exit: {sink_died,shutdown}
>>>>>>>>       in function  gen_fsm:terminate/7
>>>>>>>>       in call from proc_lib:init_p_do_apply/3
>>>>>>>>     ancestors: [riak_pipe_builder_sup,riak_pipe_sup,<0.172.0>]
>>>>>>>>     messages:
>>>>>>>> [{'DOWN',#Ref<0.0.5.38339>,process,<0.20950.24>,shutdown}]
>>>>>>>>     links: [<0.175.0>]
>>>>>>>>     dictionary:
>>>>>>>> [{eunit,[{module,riak_pipe_builder},{ref,#Ref<0.0.5.38318>},{spec,[{fitting_spec,listkeys,riak_kv_pipe_listkeys,undefined,{chash,key_of},1,64}]},{options,[{sink,{fitting,<0.20947.24>,#Ref<0.0.5.38318>,{riak_kv_pipe_get,bkey_chash},{riak_kv_pipe_get,bkey_nval}}},{trace,{set,1,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[error],[],[],[],[],[],[],[],[],[],[],[],[],[]}}}},{log,{sink,{fitting,<0.20060.24>,#Ref<0.0.5.38318>,sink,undefined}}}]},{fittings,[{{fitting,<0.20950.24>,#Ref<0.0.5.38318>,{chash,key_of},1},#Ref<0.0.5.38339>}]}]}]
>>>>>>>>     trap_exit: false
>>>>>>>>     status: running
>>>>>>>>     heap_size: 987
>>>>>>>>     stack_size: 24
>>>>>>>>     reductions: 251
>>>>>>>>   neighbours:
>>>>>>>> 2012-03-13 18:24:44 =SUPERVISOR REPORT====
>>>>>>>>      Supervisor: {local,riak_pipe_builder_sup}
>>>>>>>>      Context:    child_terminated
>>>>>>>>      Reason:     {sink_died,shutdown}
>>>>>>>>      Offender:
>>>>>>>> [{pid,<0.20949.24>},{name,undefined},{mfargs,{riak_pipe_builder,start_link,undefined}},{restart_type,temporary},{shutdown,brutal_kill},{child_type,worker}]
>>>>>>>>
>>>>>>>>
>>>>>>>> ==> /mnt/dataraid/log/console.log <==
>>>>>>>> 2012-03-13 18:24:44.645 [error] <0.20949.24> CRASH REPORT Process
>>>>>>>> <0.20949.24> with 0 neighbours crashed with reason: {sink_died,shutdown}
>>>>>>>> 2012-03-13 18:24:44.652 [error] <0.175.0> Supervisor
>>>>>>>> riak_pipe_builder_sup had child undefined started with
>>>>>>>> {riak_pipe_builder,start_link,undefined} at <0.20949.24> exit with reason
>>>>>>>> {sink_died,shutdown} in context child_terminated
>>>>>>>> 2012-03-13 18:24:56.420 [info]
>>>>>>>> <0.71.0>@riak_core_sysmon_handler:handle_event:85 monitor long_gc
>>>>>>>> <0.32603.24>
>>>>>>>> [{initial_call,{mochiweb_acceptor,init,3}},{almost_current_function,{gen,do_call,4}},{message_queue_len,0}]
>>>>>>>> [{timeout,120},{old_heap_block_size,0},{heap_block_size,4181},{mbuf_size,0},{stack_size,40},{old_heap_size,0},{heap_size,1955}]
>>>>>>>>
>>>>>>>>
>>>>>>>> ==> /mnt/dataraid/log/erlang.log.1 <==
>>>>>>>> 18:24:44.634 [error] gen_fsm <0.20949.24> in state
>>>>>>>> wait_pipeline_shutdown terminated with reason: {sink_died,shutdown}
>>>>>>>> (riak at 1)1> 18:24:44.641 [error] Pipe worker startup failed:fitting
>>>>>>>> was gone before startup
>>>>>>>> (riak at 1)1>
>>>>>>>> 18:24:44.645 [error] CRASH REPORT Process <0.20949.24> with 0
>>>>>>>> neighbours crashed with reason: {sink_died,shutdown}
>>>>>>>> (riak at 1)1> 18:24:44.652 [error] Supervisor riak_pipe_builder_sup
>>>>>>>> had child undefined started with {riak_pipe_builder,start_link,undefined}
>>>>>>>> at <0.20949.24> exit with reason {sink_died,shutdown} in context
>>>>>>>> child_terminated
>>>>>>>> (riak at 1)1> 18:24:56.420 [info] monitor long_gc <0.32603.24>
>>>>>>>> [{initial_call,{mochiweb_acceptor,init,3}},{almost_current_function,{gen,do_call,4}},{message_queue_len,0}]
>>>>>>>> [{timeout,120},{old_heap_block_size,0},{heap_block_size,4181},{mbuf_size,0},{stack_size,40},{old_heap_size,0},{heap_size,1955}]
>>>>>>>> (riak at 1)1>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> ==> /mnt/dataraid/log/error.log <==
>>>>>>>> 2012-03-13 18:24:44.645 [error] <0.20949.24> CRASH REPORT Process
>>>>>>>> <0.20949.24> with 0 neighbours crashed with reason: {sink_died,shutdown}
>>>>>>>> 2012-03-13 18:24:44.652 [error] <0.175.0> Supervisor
>>>>>>>> riak_pipe_builder_sup had child undefined started with
>>>>>>>> {riak_pipe_builder,start_link,undefined} at <0.20949.24> exit with reason
>>>>>>>> {sink_died,shutdown} in context child_terminated
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> -Matthew
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Mar 13, 2012 at 18:17, Matthew Tovbin <matthew at tovbin.com>wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I got the same problem today on 1.1.0.
>>>>>>>>>
>>>>>>>>> As suggested, I updated all the nodes to 1.1.1. The error remains
>>>>>>>>> the same:
>>>>>>>>>
>>>>>>>>> { stack: [Getter/Setter],
>>>>>>>>>    arguments: undefined,
>>>>>>>>>   type: undefined,
>>>>>>>>>   message: 'HTTP error 500:
>>>>>>>>> {"phase":0,"error":"[preflist_exhausted]","input":"{ok,{r_object ....."}}}
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -Matthew
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Mar 13, 2012 at 06:48, Jon Meredith <jmeredith at basho.com>wrote:
>>>>>>>>>
>>>>>>>>>> Hi Arne,
>>>>>>>>>>
>>>>>>>>>> Could you try with 1.1.1 please.  We've fixed a number of bugs
>>>>>>>>>> with how MapReduce handles errors that could explain some of what you are
>>>>>>>>>> seeing.
>>>>>>>>>>
>>>>>>>>>> Jon
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Mar 13, 2012 at 12:55 AM, Arne Claassen <
>>>>>>>>>> arne+riak at claassen.net> wrote:
>>>>>>>>>>
>>>>>>>>>>> I'm running 1.1.0 built from source on Centos 6. I set up two
>>>>>>>>>>> nodes on localhost following the instructions on the basho site for testing
>>>>>>>>>>> and everything was going fine until i needed to fetch all items in a bucket.
>>>>>>>>>>>
>>>>>>>>>>> For this i was using node.js with riak-js and it's db.getAll
>>>>>>>>>>> call,which from the debug logs looks like it uses /mapred
>>>>>>>>>>>
>>>>>>>>>>> The bucket only has about 1200 items in it, each a json payload
>>>>>>>>>>> of about 3KB. Initially, it'd keep getting inconsistent results, sometimes
>>>>>>>>>>> the data would come back, sometimes data would be undefined. So i kept
>>>>>>>>>>> running it, trying to see if there was a pattern, adding more debugging
>>>>>>>>>>> code, etc. But still it would be about a request every couple of seconds.
>>>>>>>>>>> Then it just stopped working altogether and now i always get:
>>>>>>>>>>>
>>>>>>>>>>> { [Error: HTTP error 500:
>>>>>>>>>>> {"phase":0,"error":"[preflist_exhausted]","input":"{ok,{r_object,<<\"tweets_15368536\">>,<<\"178528155714330620\">>,[{r_content,{dict,6,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[[<<\"Links\">>]],[],[],[],[],[],[],[],[[<<\"content-type\">>,97,112,112,108,105,99,97,116,105,111,110,47,106,115,111,110],[<<\"X-Riak-VTag\">>,49,76,115,54,70,75,74,71,119,65,111,70,88,73,117,84,98,88,79,69,52,116]],[[<<\"index\">>]],[],[[<<\"X-Riak-Last-Modified\">>|{1331,487523,218968}]],[],[[<<\"X-Riak-Meta\">>]]}}},<<\"{\"coordinates\":null,\"create...\">>}],...},...}","type":"forward_preflist","stack":"[]"}]
>>>>>>>>>>> statusCode: 500 }
>>>>>>>>>>>
>>>>>>>>>>> in the console. I haven't tried restarting riak, just because
>>>>>>>>>>> i'd rather find out what's happening than it going away and coming back
>>>>>>>>>>> later when i don't expect it.
>>>>>>>>>>>
>>>>>>>>>>> Any suggestions on how to troubleshoot this? I can still access
>>>>>>>>>>> items individually and i can even list the keys via ?keys=true, it's just
>>>>>>>>>>> he map reduce that seems to be busted.
>>>>>>>>>>>
>>>>>>>>>>> cheers,
>>>>>>>>>>> arne
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> riak-users mailing list
>>>>>>>>>>> riak-users at lists.basho.com
>>>>>>>>>>>
>>>>>>>>>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Jon Meredith
>>>>>>>>>> Platform Engineering Manager
>>>>>>>>>> Basho Technologies, Inc.
>>>>>>>>>> jmeredith at basho.com
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> riak-users mailing list
>>>>>>>>>> riak-users at lists.basho.com
>>>>>>>>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Jon Meredith
>>>>>> Platform Engineering Manager
>>>>>> Basho Technologies, Inc.
>>>>>> jmeredith at basho.com
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Jon Meredith
>>>> Platform Engineering Manager
>>>> Basho Technologies, Inc.
>>>> jmeredith at basho.com
>>>>
>>>>
>>>
>>
>>
>> --
>> Jon Meredith
>> Platform Engineering Manager
>> Basho Technologies, Inc.
>> jmeredith at basho.com
>>
>>
>


-- 
Jon Meredith
Platform Engineering Manager
Basho Technologies, Inc.
jmeredith at basho.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20120403/e1c7201b/attachment.html>


More information about the riak-users mailing list