Warning "Can not start proc_lib:init_p"

Ingo Rockel ingo.rockel at bluelionmobile.com
Thu Apr 4 09:43:50 EDT 2013


Hi Evan,

we added monitoring of the object sizes and there was one object on one 
of the three nodes mentioned which was > 2GB!!

We just changed the application code to get the id of this object to be 
able to delete it. But is does happen only about once a day.

We right now have another node constantly crashing with oom about 12 
minutes after start (always the same time frame), could this be related 
to the big object issue? It is not one of the three nodes. The node logs 
a lot of handoff receiving is going on.

Again, thanks for the help!

Regards,

	Ingo

Am 04.04.2013 15:30, schrieb Evan Vigil-McClanahan:
> If it's always the same three nodes it could well be same very large
> object being updated each day.  Is there anything else that looks
> suspicious in your logs?  Another sign of large objects is large_heap
> (or long_gc) messages from riak_sysmon.
>
> On Thu, Apr 4, 2013 at 3:58 AM, Ingo Rockel
> <ingo.rockel at bluelionmobile.com> wrote:
>> Hi Evan,
>>
>> thanks for all the infos! I adjusted the leveldb-config as suggested, except
>> the cache, which I reduced to 16MB, keeping this above the default helped a
>> lot at least during load testing. And I added +P 130072 to the vm.args. Will
>> be applied to the riak nodes the next hours.
>>
>> We have a monitoring using zabbbix, but haven't included the object sizes so
>> far, will be added today.
>>
>> We double-checked the Linux-Performance-Doc to be sure everything is applied
>> to the nodes, especially as the problems always are caused from the same
>> three nodes. But everything looks fine.
>>
>> Ingo
>>
>> Am 03.04.2013 18:42, schrieb Evan Vigil-McClanahan:
>>
>>> Another engineer mentions that you posted your eleveldb section and I
>>> totally missed it:
>>>
>>> The eleveldb section:
>>>
>>>    %% eLevelDB Config
>>>    {eleveldb, [
>>>                {data_root, "/var/lib/riak/leveldb"},
>>>                {cache_size, 33554432},
>>>                {write_buffer_size_min, 67108864}, %% 64 MB in bytes
>>>                {write_buffer_size_max, 134217728}, %% 128 MB in bytes
>>>                {max_open_files, 4000}
>>>               ]},
>>>
>>> This is likely going to make you unhappy as time goes on; Since all of
>>> those settings are per-vnode, your max memory utilization is well
>>> beyond your physical memory.  I'd remove the tunings for the caches
>>> and buffers and drop max open files to 500, perhaps.  Make sure that
>>> you've followed everything in:
>>> http://docs.basho.com/riak/latest/cookbooks/Linux-Performance-Tuning/,
>>> etc.
>>>
>>> On Wed, Apr 3, 2013 at 9:33 AM, Evan Vigil-McClanahan
>>> <emcclanahan at basho.com> wrote:
>>>>
>>>> Again, all of these things are signs of large objects, so if you could
>>>> track the object_size stats on the cluster, I think that we might see
>>>> something.  Even if you have no monitoring, a simple shell script
>>>> curling /stats/ on each node once a minute should do the job for a day
>>>> or two.
>>>>
>>>> On Wed, Apr 3, 2013 at 9:29 AM, Ingo Rockel
>>>> <ingo.rockel at bluelionmobile.com> wrote:
>>>>>
>>>>> We just had it again (around this time of the day we have our highest
>>>>> user
>>>>> activity).
>>>>>
>>>>> I will set +P to 131072 tomorrow, anything else I should check or
>>>>> change?
>>>>>
>>>>> What about this memory-high-watermark which I get sporadically?
>>>>>
>>>>> Ingo
>>>>>
>>>>> Am 03.04.2013 17:57, schrieb Evan Vigil-McClanahan:
>>>>>
>>>>>> As for +P it's been raised in R16 (which is on the current man page)
>>>>>> on R15 it's only 32k.
>>>>>>
>>>>>> The behavior that you're describing does sound like a very large
>>>>>> object getting put into the cluster (which may cause backups and push
>>>>>> you up against the process limit, could have caused scheduler collapse
>>>>>> on 1.2, etc.).
>>>>>>
>>>>>> On Wed, Apr 3, 2013 at 8:39 AM, Ingo Rockel
>>>>>> <ingo.rockel at bluelionmobile.com> wrote:
>>>>>>>
>>>>>>>
>>>>>>> Evan,
>>>>>>>
>>>>>>> sys_process_count is somewhere between 5k and 11k on the nodes right
>>>>>>> now.
>>>>>>> Concerning your suggested +P config, according to the erlang-docs, the
>>>>>>> default for this param already is 262144, so setting it to 655536
>>>>>>> would
>>>>>>> in
>>>>>>> fact lower it?
>>>>>>>
>>>>>>> We chose the ring size to be able to handle growth which was the main
>>>>>>> reason
>>>>>>> to switch from mysql to nosql/riak. We have 12 Nodes, so about 86
>>>>>>> vnodes
>>>>>>> per
>>>>>>> node.
>>>>>>>
>>>>>>> No, we don't monitor object sizes, the majority of objects is very
>>>>>>> small
>>>>>>> (below 200 bytes), but we have objects storing references to this
>>>>>>> small
>>>>>>> objects which might grow to a few megabytes in size, most of these are
>>>>>>> paged
>>>>>>> though and should not exceed one megabyte. Only one type is not paged
>>>>>>> (implementation reasons).
>>>>>>>
>>>>>>> The outgoing/incoming traffic constantly is around 100 Mbit, if the
>>>>>>> peformance drops happen, we suddenly see spikes up to 1GBit. And these
>>>>>>> spikes constantly happen on three nodes as long as the performance
>>>>>>> drop
>>>>>>> exists.
>>>>>>>
>>>>>>> Ingo
>>>>>>>
>>>>>>> Am 03.04.2013 17:12, schrieb Evan Vigil-McClanahan:
>>>>>>>
>>>>>>>> Ingo,
>>>>>>>>
>>>>>>>> riak-admin status | grep sys_process_count
>>>>>>>>
>>>>>>>> will tell you how many processes are running.  The default process
>>>>>>>> limit on erlang is a little low, and we'd suggest raising in
>>>>>>>> (especially with your extra-large ring_size).   Erlang processes are
>>>>>>>> cheap, so 65535 or even double that will be fine.
>>>>>>>>
>>>>>>>> Busy dist ports are still worrying.  Are you monitoring object sizes?
>>>>>>>> Are there any spikes there associated with performance drops?
>>>>>>>>
>>>>>>>> On Wed, Apr 3, 2013 at 8:03 AM, Ingo Rockel
>>>>>>>> <ingo.rockel at bluelionmobile.com> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hi Evan,
>>>>>>>>>
>>>>>>>>> I set swt very_low and zdbbl to 64MB, setting this params helped
>>>>>>>>> reducing
>>>>>>>>> the busy_dist_port and Monitor got {suppressed,... Messages a lot.
>>>>>>>>> But
>>>>>>>>> when
>>>>>>>>> the performance of the cluster suddenly drops we still see these
>>>>>>>>> messages.
>>>>>>>>>
>>>>>>>>> The cluster was updated to 1.3 in the meantime.
>>>>>>>>>
>>>>>>>>> The eleveldb section:
>>>>>>>>>
>>>>>>>>>      %% eLevelDB Config
>>>>>>>>>      {eleveldb, [
>>>>>>>>>                  {data_root, "/var/lib/riak/leveldb"},
>>>>>>>>>                  {cache_size, 33554432},
>>>>>>>>>                  {write_buffer_size_min, 67108864}, %% 64 MB in bytes
>>>>>>>>>                  {write_buffer_size_max, 134217728}, %% 128 MB in
>>>>>>>>> bytes
>>>>>>>>>                  {max_open_files, 4000}
>>>>>>>>>                 ]},
>>>>>>>>>
>>>>>>>>> the ring size is 1024 and the machines have 48GB of memory.
>>>>>>>>> Concerning
>>>>>>>>> the
>>>>>>>>> params from vm.args:
>>>>>>>>>
>>>>>>>>> -env ERL_MAX_PORTS 4096
>>>>>>>>> -env ERL_MAX_ETS_TABLES 8192
>>>>>>>>>
>>>>>>>>> +P isn't set
>>>>>>>>>
>>>>>>>>> Ingo
>>>>>>>>>
>>>>>>>>> Am 03.04.2013 16:53, schrieb Evan Vigil-McClanahan:
>>>>>>>>>
>>>>>>>>>> For your prior mail, I thought that someone had answered.  Our
>>>>>>>>>> initial
>>>>>>>>>> suggestion was to add +swt very_low to your vm.args, as well as
>>>>>>>>>> setting the +zdbbl setting that Jon recommended in the list post
>>>>>>>>>> you
>>>>>>>>>> pointed to.  If those help, moving to 1.3 should help more.
>>>>>>>>>>
>>>>>>>>>> Other limits in vm.args that can cause problems are +P,
>>>>>>>>>> ERL_MAX_PORTS,
>>>>>>>>>> and  ERL_MAX_ETS_TABLES.  Are any of these set?  If so, to what?
>>>>>>>>>>
>>>>>>>>>> Can you also pate the eleveldb section of your app.config?
>>>>>>>>>>
>>>>>>>>>> On Wed, Apr 3, 2013 at 7:41 AM, Ingo Rockel
>>>>>>>>>> <ingo.rockel at bluelionmobile.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Hi Evan,
>>>>>>>>>>>
>>>>>>>>>>> I'm not sure, I find a lot of these:
>>>>>>>>>>>
>>>>>>>>>>> 2013-03-30 23:27:52.992 [error]
>>>>>>>>>>> <0.8036.323>@riak_api_pb_server:handle_info:141 Unrecognized
>>>>>>>>>>> message
>>>>>>>>>>> {22243034,{error,timeout}}
>>>>>>>>>>>
>>>>>>>>>>> and some of these at the same time one of the kind below gets
>>>>>>>>>>> logged
>>>>>>>>>>> (although the one has a different time stamp):
>>>>>>>>>>>
>>>>>>>>>>> 2013-03-30 23:27:53.056 [error]
>>>>>>>>>>> <0.9457.323>@riak_kv_console:status:178
>>>>>>>>>>> Status failed error:terminated
>>>>>>>>>>>
>>>>>>>>>>> Ingo
>>>>>>>>>>>
>>>>>>>>>>> Am 03.04.2013 16:24, schrieb Evan Vigil-McClanahan:
>>>>>>>>>>>
>>>>>>>>>>>> Resending to the list:
>>>>>>>>>>>>
>>>>>>>>>>>> Ingo,
>>>>>>>>>>>>
>>>>>>>>>>>> That is an indication that the protocol buffers server can't
>>>>>>>>>>>> spawn a
>>>>>>>>>>>> put fsm, which means that a put cannot be done for some reason or
>>>>>>>>>>>> another.  Are there any other messages that appear around this
>>>>>>>>>>>> time
>>>>>>>>>>>> that might indicate why?
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Apr 3, 2013 at 12:09 AM, Ingo Rockel
>>>>>>>>>>>> <ingo.rockel at bluelionmobile.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> we have some performance issues with our riak cluster, from time
>>>>>>>>>>>>> to
>>>>>>>>>>>>> time
>>>>>>>>>>>>> we
>>>>>>>>>>>>> have a sudden drop in performance (already asked the list about
>>>>>>>>>>>>> this,
>>>>>>>>>>>>> no-one
>>>>>>>>>>>>> had an idea though). Although not the same time but on the
>>>>>>>>>>>>> problematic
>>>>>>>>>>>>> nodes
>>>>>>>>>>>>> we have a lot of these messages from time to time:
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2013-04-02 21:41:11.173 [warning] <0.25646.475> ** Can not start
>>>>>>>>>>>>> proc_lib:init_p
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> ,[<0.14556.474>,[<0.9519.474>,riak_api_pb_sup,riak_api_sup,<0.1291.0>],riak_kv_p
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> ut_fsm,start_link,[{raw,65032165,<0.9519.474>},{r_object,<<109>>,<<77,115,124,49
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> ,53,55,57,56,57,56,50,124,49,51,54,52,57,51,49,54,49,49,53,49,50,52,53,54>>,[{r_
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> content,{dict,0,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> {{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]}}},<<>>}],[],{dict,2,16,16,8,8
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> 0,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[],[],[],[],[],[]
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> ,[],[],[[<<99,111,110,116,101,110,116,45,116,121,112,101>>,97,112,112,108,105,99
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> ,97,116,105,111,110,47,106,115,111,110]],[],[],[],[],[[<<99,104,97,114,115,101,1
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> 16>>,85,84,70,45,56]]}}},<<123,34,115,116,34,58,50,44,34,116,34,58,49,44,34,99,3
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> 4,58,34,66,117,116,32,115,104,101,32,105,115,32,103,111,110,101,44,32,110,32,101
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> ,118,101,110,32,116,104,111,117,103,104,32,105,109,32,110,111,116,32,105,110,32,
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> 117,114,32,99,105,116,121,32,105,32,108,111,118,101,32,117,32,110,100,32,105,32,
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> 109,101,97,110,32,105,116,32,58,39,40,34,44,34,114,34,58,49,52,51,52,54,52,51,57
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> ,44,34,115,34,58,49,53,55,57,56,57,56,50,44,34,99,116,34,58,49,51,54,52,57,51,49
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> ,54,49,49,53,49,50,44,34,97,110,34,58,102,97,108,115,101,44,34,115,107,34,58,49,
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> 51,54,52,57,51,49,54,49,49,53,49,50,52,53,54,44,34,115,117,34,58,48,125>>},[{tim
>>>>>>>>>>>>> eout,60000}]]] on 'riak at 172.22.3.12' **
>>>>>>>>>>>>>
>>>>>>>>>>>>> Can anyone explain to me what these messages mean and if I need
>>>>>>>>>>>>> to
>>>>>>>>>>>>> do
>>>>>>>>>>>>> something about it? Could these messages be in any way related
>>>>>>>>>>>>> to
>>>>>>>>>>>>> the
>>>>>>>>>>>>> performance issues?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Ingo
>>>>>>>>>>>>>
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> riak-users mailing list
>>>>>>>>>>>>> riak-users at lists.basho.com
>>>>>>>>>>>>>
>>>>>>>>>>>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Software Architect
>>>>>>>>>>>
>>>>>>>>>>> Blue Lion mobile GmbH
>>>>>>>>>>> Tel. +49 (0) 221 788 797 14
>>>>>>>>>>> Fax. +49 (0) 221 788 797 19
>>>>>>>>>>> Mob. +49 (0) 176 24 87 30 89
>>>>>>>>>>>
>>>>>>>>>>> ingo.rockel at bluelionmobile.com
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> qeep: Hefferwolf
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> www.bluelionmobile.com
>>>>>>>>>>> www.qeep.net
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Software Architect
>>>>>>>>>
>>>>>>>>> Blue Lion mobile GmbH
>>>>>>>>> Tel. +49 (0) 221 788 797 14
>>>>>>>>> Fax. +49 (0) 221 788 797 19
>>>>>>>>> Mob. +49 (0) 176 24 87 30 89
>>>>>>>>>
>>>>>>>>> ingo.rockel at bluelionmobile.com
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> qeep: Hefferwolf
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> www.bluelionmobile.com
>>>>>>>>> www.qeep.net
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Software Architect
>>>>>>>
>>>>>>> Blue Lion mobile GmbH
>>>>>>> Tel. +49 (0) 221 788 797 14
>>>>>>> Fax. +49 (0) 221 788 797 19
>>>>>>> Mob. +49 (0) 176 24 87 30 89
>>>>>>>
>>>>>>> ingo.rockel at bluelionmobile.com
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> qeep: Hefferwolf
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> www.bluelionmobile.com
>>>>>>> www.qeep.net
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Software Architect
>>>>>
>>>>> Blue Lion mobile GmbH
>>>>> Tel. +49 (0) 221 788 797 14
>>>>> Fax. +49 (0) 221 788 797 19
>>>>> Mob. +49 (0) 176 24 87 30 89
>>>>>
>>>>> ingo.rockel at bluelionmobile.com
>>>>>>>>
>>>>>>>> qeep: Hefferwolf
>>>>>
>>>>>
>>>>> www.bluelionmobile.com
>>>>> www.qeep.net
>>
>>
>>
>> --
>> Software Architect
>>
>> Blue Lion mobile GmbH
>> Tel. +49 (0) 221 788 797 14
>> Fax. +49 (0) 221 788 797 19
>> Mob. +49 (0) 176 24 87 30 89
>>
>> ingo.rockel at bluelionmobile.com
>>>>> qeep: Hefferwolf
>>
>> www.bluelionmobile.com
>> www.qeep.net


-- 
Software Architect

Blue Lion mobile GmbH
Tel. +49 (0) 221 788 797 14
Fax. +49 (0) 221 788 797 19
Mob. +49 (0) 176 24 87 30 89

ingo.rockel at bluelionmobile.com
 >>> qeep: Hefferwolf

www.bluelionmobile.com
www.qeep.net




More information about the riak-users mailing list