Warning "Can not start proc_lib:init_p"

Ingo Rockel ingo.rockel at bluelionmobile.com
Thu Apr 4 04:58:36 EDT 2013


Hi Evan,

thanks for all the infos! I adjusted the leveldb-config as suggested, 
except the cache, which I reduced to 16MB, keeping this above the 
default helped a lot at least during load testing. And I added +P 130072 
to the vm.args. Will be applied to the riak nodes the next hours.

We have a monitoring using zabbbix, but haven't included the object 
sizes so far, will be added today.

We double-checked the Linux-Performance-Doc to be sure everything is 
applied to the nodes, especially as the problems always are caused from 
the same three nodes. But everything looks fine.

Ingo

Am 03.04.2013 18:42, schrieb Evan Vigil-McClanahan:
> Another engineer mentions that you posted your eleveldb section and I
> totally missed it:
>
> The eleveldb section:
>
>   %% eLevelDB Config
>   {eleveldb, [
>               {data_root, "/var/lib/riak/leveldb"},
>               {cache_size, 33554432},
>               {write_buffer_size_min, 67108864}, %% 64 MB in bytes
>               {write_buffer_size_max, 134217728}, %% 128 MB in bytes
>               {max_open_files, 4000}
>              ]},
>
> This is likely going to make you unhappy as time goes on; Since all of
> those settings are per-vnode, your max memory utilization is well
> beyond your physical memory.  I'd remove the tunings for the caches
> and buffers and drop max open files to 500, perhaps.  Make sure that
> you've followed everything in:
> http://docs.basho.com/riak/latest/cookbooks/Linux-Performance-Tuning/,
> etc.
>
> On Wed, Apr 3, 2013 at 9:33 AM, Evan Vigil-McClanahan
> <emcclanahan at basho.com> wrote:
>> Again, all of these things are signs of large objects, so if you could
>> track the object_size stats on the cluster, I think that we might see
>> something.  Even if you have no monitoring, a simple shell script
>> curling /stats/ on each node once a minute should do the job for a day
>> or two.
>>
>> On Wed, Apr 3, 2013 at 9:29 AM, Ingo Rockel
>> <ingo.rockel at bluelionmobile.com> wrote:
>>> We just had it again (around this time of the day we have our highest user
>>> activity).
>>>
>>> I will set +P to 131072 tomorrow, anything else I should check or change?
>>>
>>> What about this memory-high-watermark which I get sporadically?
>>>
>>> Ingo
>>>
>>> Am 03.04.2013 17:57, schrieb Evan Vigil-McClanahan:
>>>
>>>> As for +P it's been raised in R16 (which is on the current man page)
>>>> on R15 it's only 32k.
>>>>
>>>> The behavior that you're describing does sound like a very large
>>>> object getting put into the cluster (which may cause backups and push
>>>> you up against the process limit, could have caused scheduler collapse
>>>> on 1.2, etc.).
>>>>
>>>> On Wed, Apr 3, 2013 at 8:39 AM, Ingo Rockel
>>>> <ingo.rockel at bluelionmobile.com> wrote:
>>>>>
>>>>> Evan,
>>>>>
>>>>> sys_process_count is somewhere between 5k and 11k on the nodes right now.
>>>>> Concerning your suggested +P config, according to the erlang-docs, the
>>>>> default for this param already is 262144, so setting it to 655536 would
>>>>> in
>>>>> fact lower it?
>>>>>
>>>>> We chose the ring size to be able to handle growth which was the main
>>>>> reason
>>>>> to switch from mysql to nosql/riak. We have 12 Nodes, so about 86 vnodes
>>>>> per
>>>>> node.
>>>>>
>>>>> No, we don't monitor object sizes, the majority of objects is very small
>>>>> (below 200 bytes), but we have objects storing references to this small
>>>>> objects which might grow to a few megabytes in size, most of these are
>>>>> paged
>>>>> though and should not exceed one megabyte. Only one type is not paged
>>>>> (implementation reasons).
>>>>>
>>>>> The outgoing/incoming traffic constantly is around 100 Mbit, if the
>>>>> peformance drops happen, we suddenly see spikes up to 1GBit. And these
>>>>> spikes constantly happen on three nodes as long as the performance drop
>>>>> exists.
>>>>>
>>>>> Ingo
>>>>>
>>>>> Am 03.04.2013 17:12, schrieb Evan Vigil-McClanahan:
>>>>>
>>>>>> Ingo,
>>>>>>
>>>>>> riak-admin status | grep sys_process_count
>>>>>>
>>>>>> will tell you how many processes are running.  The default process
>>>>>> limit on erlang is a little low, and we'd suggest raising in
>>>>>> (especially with your extra-large ring_size).   Erlang processes are
>>>>>> cheap, so 65535 or even double that will be fine.
>>>>>>
>>>>>> Busy dist ports are still worrying.  Are you monitoring object sizes?
>>>>>> Are there any spikes there associated with performance drops?
>>>>>>
>>>>>> On Wed, Apr 3, 2013 at 8:03 AM, Ingo Rockel
>>>>>> <ingo.rockel at bluelionmobile.com> wrote:
>>>>>>>
>>>>>>>
>>>>>>> Hi Evan,
>>>>>>>
>>>>>>> I set swt very_low and zdbbl to 64MB, setting this params helped
>>>>>>> reducing
>>>>>>> the busy_dist_port and Monitor got {suppressed,... Messages a lot. But
>>>>>>> when
>>>>>>> the performance of the cluster suddenly drops we still see these
>>>>>>> messages.
>>>>>>>
>>>>>>> The cluster was updated to 1.3 in the meantime.
>>>>>>>
>>>>>>> The eleveldb section:
>>>>>>>
>>>>>>>     %% eLevelDB Config
>>>>>>>     {eleveldb, [
>>>>>>>                 {data_root, "/var/lib/riak/leveldb"},
>>>>>>>                 {cache_size, 33554432},
>>>>>>>                 {write_buffer_size_min, 67108864}, %% 64 MB in bytes
>>>>>>>                 {write_buffer_size_max, 134217728}, %% 128 MB in bytes
>>>>>>>                 {max_open_files, 4000}
>>>>>>>                ]},
>>>>>>>
>>>>>>> the ring size is 1024 and the machines have 48GB of memory. Concerning
>>>>>>> the
>>>>>>> params from vm.args:
>>>>>>>
>>>>>>> -env ERL_MAX_PORTS 4096
>>>>>>> -env ERL_MAX_ETS_TABLES 8192
>>>>>>>
>>>>>>> +P isn't set
>>>>>>>
>>>>>>> Ingo
>>>>>>>
>>>>>>> Am 03.04.2013 16:53, schrieb Evan Vigil-McClanahan:
>>>>>>>
>>>>>>>> For your prior mail, I thought that someone had answered.  Our initial
>>>>>>>> suggestion was to add +swt very_low to your vm.args, as well as
>>>>>>>> setting the +zdbbl setting that Jon recommended in the list post you
>>>>>>>> pointed to.  If those help, moving to 1.3 should help more.
>>>>>>>>
>>>>>>>> Other limits in vm.args that can cause problems are +P, ERL_MAX_PORTS,
>>>>>>>> and  ERL_MAX_ETS_TABLES.  Are any of these set?  If so, to what?
>>>>>>>>
>>>>>>>> Can you also pate the eleveldb section of your app.config?
>>>>>>>>
>>>>>>>> On Wed, Apr 3, 2013 at 7:41 AM, Ingo Rockel
>>>>>>>> <ingo.rockel at bluelionmobile.com> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hi Evan,
>>>>>>>>>
>>>>>>>>> I'm not sure, I find a lot of these:
>>>>>>>>>
>>>>>>>>> 2013-03-30 23:27:52.992 [error]
>>>>>>>>> <0.8036.323>@riak_api_pb_server:handle_info:141 Unrecognized message
>>>>>>>>> {22243034,{error,timeout}}
>>>>>>>>>
>>>>>>>>> and some of these at the same time one of the kind below gets logged
>>>>>>>>> (although the one has a different time stamp):
>>>>>>>>>
>>>>>>>>> 2013-03-30 23:27:53.056 [error]
>>>>>>>>> <0.9457.323>@riak_kv_console:status:178
>>>>>>>>> Status failed error:terminated
>>>>>>>>>
>>>>>>>>> Ingo
>>>>>>>>>
>>>>>>>>> Am 03.04.2013 16:24, schrieb Evan Vigil-McClanahan:
>>>>>>>>>
>>>>>>>>>> Resending to the list:
>>>>>>>>>>
>>>>>>>>>> Ingo,
>>>>>>>>>>
>>>>>>>>>> That is an indication that the protocol buffers server can't spawn a
>>>>>>>>>> put fsm, which means that a put cannot be done for some reason or
>>>>>>>>>> another.  Are there any other messages that appear around this time
>>>>>>>>>> that might indicate why?
>>>>>>>>>>
>>>>>>>>>> On Wed, Apr 3, 2013 at 12:09 AM, Ingo Rockel
>>>>>>>>>> <ingo.rockel at bluelionmobile.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> we have some performance issues with our riak cluster, from time to
>>>>>>>>>>> time
>>>>>>>>>>> we
>>>>>>>>>>> have a sudden drop in performance (already asked the list about
>>>>>>>>>>> this,
>>>>>>>>>>> no-one
>>>>>>>>>>> had an idea though). Although not the same time but on the
>>>>>>>>>>> problematic
>>>>>>>>>>> nodes
>>>>>>>>>>> we have a lot of these messages from time to time:
>>>>>>>>>>>
>>>>>>>>>>> 2013-04-02 21:41:11.173 [warning] <0.25646.475> ** Can not start
>>>>>>>>>>> proc_lib:init_p
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> ,[<0.14556.474>,[<0.9519.474>,riak_api_pb_sup,riak_api_sup,<0.1291.0>],riak_kv_p
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> ut_fsm,start_link,[{raw,65032165,<0.9519.474>},{r_object,<<109>>,<<77,115,124,49
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> ,53,55,57,56,57,56,50,124,49,51,54,52,57,51,49,54,49,49,53,49,50,52,53,54>>,[{r_
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> content,{dict,0,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> {{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]}}},<<>>}],[],{dict,2,16,16,8,8
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> 0,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[],[],[],[],[],[]
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> ,[],[],[[<<99,111,110,116,101,110,116,45,116,121,112,101>>,97,112,112,108,105,99
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> ,97,116,105,111,110,47,106,115,111,110]],[],[],[],[],[[<<99,104,97,114,115,101,1
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> 16>>,85,84,70,45,56]]}}},<<123,34,115,116,34,58,50,44,34,116,34,58,49,44,34,99,3
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> 4,58,34,66,117,116,32,115,104,101,32,105,115,32,103,111,110,101,44,32,110,32,101
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> ,118,101,110,32,116,104,111,117,103,104,32,105,109,32,110,111,116,32,105,110,32,
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> 117,114,32,99,105,116,121,32,105,32,108,111,118,101,32,117,32,110,100,32,105,32,
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> 109,101,97,110,32,105,116,32,58,39,40,34,44,34,114,34,58,49,52,51,52,54,52,51,57
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> ,44,34,115,34,58,49,53,55,57,56,57,56,50,44,34,99,116,34,58,49,51,54,52,57,51,49
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> ,54,49,49,53,49,50,44,34,97,110,34,58,102,97,108,115,101,44,34,115,107,34,58,49,
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> 51,54,52,57,51,49,54,49,49,53,49,50,52,53,54,44,34,115,117,34,58,48,125>>},[{tim
>>>>>>>>>>> eout,60000}]]] on 'riak at 172.22.3.12' **
>>>>>>>>>>>
>>>>>>>>>>> Can anyone explain to me what these messages mean and if I need to
>>>>>>>>>>> do
>>>>>>>>>>> something about it? Could these messages be in any way related to
>>>>>>>>>>> the
>>>>>>>>>>> performance issues?
>>>>>>>>>>>
>>>>>>>>>>> Ingo
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> riak-users mailing list
>>>>>>>>>>> riak-users at lists.basho.com
>>>>>>>>>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Software Architect
>>>>>>>>>
>>>>>>>>> Blue Lion mobile GmbH
>>>>>>>>> Tel. +49 (0) 221 788 797 14
>>>>>>>>> Fax. +49 (0) 221 788 797 19
>>>>>>>>> Mob. +49 (0) 176 24 87 30 89
>>>>>>>>>
>>>>>>>>> ingo.rockel at bluelionmobile.com
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> qeep: Hefferwolf
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> www.bluelionmobile.com
>>>>>>>>> www.qeep.net
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Software Architect
>>>>>>>
>>>>>>> Blue Lion mobile GmbH
>>>>>>> Tel. +49 (0) 221 788 797 14
>>>>>>> Fax. +49 (0) 221 788 797 19
>>>>>>> Mob. +49 (0) 176 24 87 30 89
>>>>>>>
>>>>>>> ingo.rockel at bluelionmobile.com
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> qeep: Hefferwolf
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> www.bluelionmobile.com
>>>>>>> www.qeep.net
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Software Architect
>>>>>
>>>>> Blue Lion mobile GmbH
>>>>> Tel. +49 (0) 221 788 797 14
>>>>> Fax. +49 (0) 221 788 797 19
>>>>> Mob. +49 (0) 176 24 87 30 89
>>>>>
>>>>> ingo.rockel at bluelionmobile.com
>>>>>>>>
>>>>>>>> qeep: Hefferwolf
>>>>>
>>>>>
>>>>> www.bluelionmobile.com
>>>>> www.qeep.net
>>>
>>>
>>>
>>> --
>>> Software Architect
>>>
>>> Blue Lion mobile GmbH
>>> Tel. +49 (0) 221 788 797 14
>>> Fax. +49 (0) 221 788 797 19
>>> Mob. +49 (0) 176 24 87 30 89
>>>
>>> ingo.rockel at bluelionmobile.com
>>>>>> qeep: Hefferwolf
>>>
>>> www.bluelionmobile.com
>>> www.qeep.net


-- 
Software Architect

Blue Lion mobile GmbH
Tel. +49 (0) 221 788 797 14
Fax. +49 (0) 221 788 797 19
Mob. +49 (0) 176 24 87 30 89

ingo.rockel at bluelionmobile.com
 >>> qeep: Hefferwolf

www.bluelionmobile.com
www.qeep.net




More information about the riak-users mailing list