Warning "Can not start proc_lib:init_p"

Evan Vigil-McClanahan emcclanahan at basho.com
Wed Apr 3 12:33:16 EDT 2013


Again, all of these things are signs of large objects, so if you could
track the object_size stats on the cluster, I think that we might see
something.  Even if you have no monitoring, a simple shell script
curling /stats/ on each node once a minute should do the job for a day
or two.

On Wed, Apr 3, 2013 at 9:29 AM, Ingo Rockel
<ingo.rockel at bluelionmobile.com> wrote:
> We just had it again (around this time of the day we have our highest user
> activity).
>
> I will set +P to 131072 tomorrow, anything else I should check or change?
>
> What about this memory-high-watermark which I get sporadically?
>
> Ingo
>
> Am 03.04.2013 17:57, schrieb Evan Vigil-McClanahan:
>
>> As for +P it's been raised in R16 (which is on the current man page)
>> on R15 it's only 32k.
>>
>> The behavior that you're describing does sound like a very large
>> object getting put into the cluster (which may cause backups and push
>> you up against the process limit, could have caused scheduler collapse
>> on 1.2, etc.).
>>
>> On Wed, Apr 3, 2013 at 8:39 AM, Ingo Rockel
>> <ingo.rockel at bluelionmobile.com> wrote:
>>>
>>> Evan,
>>>
>>> sys_process_count is somewhere between 5k and 11k on the nodes right now.
>>> Concerning your suggested +P config, according to the erlang-docs, the
>>> default for this param already is 262144, so setting it to 655536 would
>>> in
>>> fact lower it?
>>>
>>> We chose the ring size to be able to handle growth which was the main
>>> reason
>>> to switch from mysql to nosql/riak. We have 12 Nodes, so about 86 vnodes
>>> per
>>> node.
>>>
>>> No, we don't monitor object sizes, the majority of objects is very small
>>> (below 200 bytes), but we have objects storing references to this small
>>> objects which might grow to a few megabytes in size, most of these are
>>> paged
>>> though and should not exceed one megabyte. Only one type is not paged
>>> (implementation reasons).
>>>
>>> The outgoing/incoming traffic constantly is around 100 Mbit, if the
>>> peformance drops happen, we suddenly see spikes up to 1GBit. And these
>>> spikes constantly happen on three nodes as long as the performance drop
>>> exists.
>>>
>>> Ingo
>>>
>>> Am 03.04.2013 17:12, schrieb Evan Vigil-McClanahan:
>>>
>>>> Ingo,
>>>>
>>>> riak-admin status | grep sys_process_count
>>>>
>>>> will tell you how many processes are running.  The default process
>>>> limit on erlang is a little low, and we'd suggest raising in
>>>> (especially with your extra-large ring_size).   Erlang processes are
>>>> cheap, so 65535 or even double that will be fine.
>>>>
>>>> Busy dist ports are still worrying.  Are you monitoring object sizes?
>>>> Are there any spikes there associated with performance drops?
>>>>
>>>> On Wed, Apr 3, 2013 at 8:03 AM, Ingo Rockel
>>>> <ingo.rockel at bluelionmobile.com> wrote:
>>>>>
>>>>>
>>>>> Hi Evan,
>>>>>
>>>>> I set swt very_low and zdbbl to 64MB, setting this params helped
>>>>> reducing
>>>>> the busy_dist_port and Monitor got {suppressed,... Messages a lot. But
>>>>> when
>>>>> the performance of the cluster suddenly drops we still see these
>>>>> messages.
>>>>>
>>>>> The cluster was updated to 1.3 in the meantime.
>>>>>
>>>>> The eleveldb section:
>>>>>
>>>>>    %% eLevelDB Config
>>>>>    {eleveldb, [
>>>>>                {data_root, "/var/lib/riak/leveldb"},
>>>>>                {cache_size, 33554432},
>>>>>                {write_buffer_size_min, 67108864}, %% 64 MB in bytes
>>>>>                {write_buffer_size_max, 134217728}, %% 128 MB in bytes
>>>>>                {max_open_files, 4000}
>>>>>               ]},
>>>>>
>>>>> the ring size is 1024 and the machines have 48GB of memory. Concerning
>>>>> the
>>>>> params from vm.args:
>>>>>
>>>>> -env ERL_MAX_PORTS 4096
>>>>> -env ERL_MAX_ETS_TABLES 8192
>>>>>
>>>>> +P isn't set
>>>>>
>>>>> Ingo
>>>>>
>>>>> Am 03.04.2013 16:53, schrieb Evan Vigil-McClanahan:
>>>>>
>>>>>> For your prior mail, I thought that someone had answered.  Our initial
>>>>>> suggestion was to add +swt very_low to your vm.args, as well as
>>>>>> setting the +zdbbl setting that Jon recommended in the list post you
>>>>>> pointed to.  If those help, moving to 1.3 should help more.
>>>>>>
>>>>>> Other limits in vm.args that can cause problems are +P, ERL_MAX_PORTS,
>>>>>> and  ERL_MAX_ETS_TABLES.  Are any of these set?  If so, to what?
>>>>>>
>>>>>> Can you also pate the eleveldb section of your app.config?
>>>>>>
>>>>>> On Wed, Apr 3, 2013 at 7:41 AM, Ingo Rockel
>>>>>> <ingo.rockel at bluelionmobile.com> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Hi Evan,
>>>>>>>
>>>>>>> I'm not sure, I find a lot of these:
>>>>>>>
>>>>>>> 2013-03-30 23:27:52.992 [error]
>>>>>>> <0.8036.323>@riak_api_pb_server:handle_info:141 Unrecognized message
>>>>>>> {22243034,{error,timeout}}
>>>>>>>
>>>>>>> and some of these at the same time one of the kind below gets logged
>>>>>>> (although the one has a different time stamp):
>>>>>>>
>>>>>>> 2013-03-30 23:27:53.056 [error]
>>>>>>> <0.9457.323>@riak_kv_console:status:178
>>>>>>> Status failed error:terminated
>>>>>>>
>>>>>>> Ingo
>>>>>>>
>>>>>>> Am 03.04.2013 16:24, schrieb Evan Vigil-McClanahan:
>>>>>>>
>>>>>>>> Resending to the list:
>>>>>>>>
>>>>>>>> Ingo,
>>>>>>>>
>>>>>>>> That is an indication that the protocol buffers server can't spawn a
>>>>>>>> put fsm, which means that a put cannot be done for some reason or
>>>>>>>> another.  Are there any other messages that appear around this time
>>>>>>>> that might indicate why?
>>>>>>>>
>>>>>>>> On Wed, Apr 3, 2013 at 12:09 AM, Ingo Rockel
>>>>>>>> <ingo.rockel at bluelionmobile.com> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> we have some performance issues with our riak cluster, from time to
>>>>>>>>> time
>>>>>>>>> we
>>>>>>>>> have a sudden drop in performance (already asked the list about
>>>>>>>>> this,
>>>>>>>>> no-one
>>>>>>>>> had an idea though). Although not the same time but on the
>>>>>>>>> problematic
>>>>>>>>> nodes
>>>>>>>>> we have a lot of these messages from time to time:
>>>>>>>>>
>>>>>>>>> 2013-04-02 21:41:11.173 [warning] <0.25646.475> ** Can not start
>>>>>>>>> proc_lib:init_p
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ,[<0.14556.474>,[<0.9519.474>,riak_api_pb_sup,riak_api_sup,<0.1291.0>],riak_kv_p
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ut_fsm,start_link,[{raw,65032165,<0.9519.474>},{r_object,<<109>>,<<77,115,124,49
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ,53,55,57,56,57,56,50,124,49,51,54,52,57,51,49,54,49,49,53,49,50,52,53,54>>,[{r_
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> content,{dict,0,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> {{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]}}},<<>>}],[],{dict,2,16,16,8,8
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 0,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[],[],[],[],[],[]
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ,[],[],[[<<99,111,110,116,101,110,116,45,116,121,112,101>>,97,112,112,108,105,99
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ,97,116,105,111,110,47,106,115,111,110]],[],[],[],[],[[<<99,104,97,114,115,101,1
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 16>>,85,84,70,45,56]]}}},<<123,34,115,116,34,58,50,44,34,116,34,58,49,44,34,99,3
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 4,58,34,66,117,116,32,115,104,101,32,105,115,32,103,111,110,101,44,32,110,32,101
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ,118,101,110,32,116,104,111,117,103,104,32,105,109,32,110,111,116,32,105,110,32,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 117,114,32,99,105,116,121,32,105,32,108,111,118,101,32,117,32,110,100,32,105,32,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 109,101,97,110,32,105,116,32,58,39,40,34,44,34,114,34,58,49,52,51,52,54,52,51,57
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ,44,34,115,34,58,49,53,55,57,56,57,56,50,44,34,99,116,34,58,49,51,54,52,57,51,49
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ,54,49,49,53,49,50,44,34,97,110,34,58,102,97,108,115,101,44,34,115,107,34,58,49,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 51,54,52,57,51,49,54,49,49,53,49,50,52,53,54,44,34,115,117,34,58,48,125>>},[{tim
>>>>>>>>> eout,60000}]]] on 'riak at 172.22.3.12' **
>>>>>>>>>
>>>>>>>>> Can anyone explain to me what these messages mean and if I need to
>>>>>>>>> do
>>>>>>>>> something about it? Could these messages be in any way related to
>>>>>>>>> the
>>>>>>>>> performance issues?
>>>>>>>>>
>>>>>>>>> Ingo
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> riak-users mailing list
>>>>>>>>> riak-users at lists.basho.com
>>>>>>>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Software Architect
>>>>>>>
>>>>>>> Blue Lion mobile GmbH
>>>>>>> Tel. +49 (0) 221 788 797 14
>>>>>>> Fax. +49 (0) 221 788 797 19
>>>>>>> Mob. +49 (0) 176 24 87 30 89
>>>>>>>
>>>>>>> ingo.rockel at bluelionmobile.com
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> qeep: Hefferwolf
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> www.bluelionmobile.com
>>>>>>> www.qeep.net
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Software Architect
>>>>>
>>>>> Blue Lion mobile GmbH
>>>>> Tel. +49 (0) 221 788 797 14
>>>>> Fax. +49 (0) 221 788 797 19
>>>>> Mob. +49 (0) 176 24 87 30 89
>>>>>
>>>>> ingo.rockel at bluelionmobile.com
>>>>>>>>
>>>>>>>>
>>>>>>>> qeep: Hefferwolf
>>>>>
>>>>>
>>>>>
>>>>> www.bluelionmobile.com
>>>>> www.qeep.net
>>>
>>>
>>>
>>>
>>> --
>>> Software Architect
>>>
>>> Blue Lion mobile GmbH
>>> Tel. +49 (0) 221 788 797 14
>>> Fax. +49 (0) 221 788 797 19
>>> Mob. +49 (0) 176 24 87 30 89
>>>
>>> ingo.rockel at bluelionmobile.com
>>>>>>
>>>>>> qeep: Hefferwolf
>>>
>>>
>>> www.bluelionmobile.com
>>> www.qeep.net
>
>
>
> --
> Software Architect
>
> Blue Lion mobile GmbH
> Tel. +49 (0) 221 788 797 14
> Fax. +49 (0) 221 788 797 19
> Mob. +49 (0) 176 24 87 30 89
>
> ingo.rockel at bluelionmobile.com
>>>> qeep: Hefferwolf
>
> www.bluelionmobile.com
> www.qeep.net




More information about the riak-users mailing list