Riak Nodes Crashing

Matthew Von-Maszewski matthewv at basho.com
Fri Dec 5 14:43:36 EST 2014


Satish,

Here is a key line from /var/log/messages:

Dec  5 06:52:43 ip-10-196-72-106 kernel: [26881589.804401] beam.smp invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0

The log entry does NOT match the timestamps of the crash.log and error.log below.  But that is ok.  The operating system killed off Riak.  There would have be no notification in the Riak log's of the operating system's actions.

The fact that the out of memory monitor, oom-killer, killed Riak further supports the change to max_open_files.  I recommend we now wait to see if the problem occurs again.


Matthew


On Dec 5, 2014, at 2:35 PM, ender <extropy at gmail.com> wrote:

> Hey Matthew,
> 
> The crash occurred around 3:00am:
> 
> -rw-rw-r-- 1 riak riak    920 Dec  5 03:01 crash.log
> -rw-rw-r-- 1 riak riak    617 Dec  5 03:01 error.log
> 
> I have attached the syslog that covers that time.  I also went ahead and changed max_open_files in app.config to to 150 from 315.
> 
> Satish
> 
> 
> On Fri, Dec 5, 2014 at 11:29 AM, Matthew Von-Maszewski <matthewv at basho.com> wrote:
> Satish,
> 
> The "key" system log varies by Linux platform.  Yes, /var/log/messages may hold some key clues.  Again, be sure the file covers the time of a crash.
> 
> Matthew
> 
> 
> On Dec 5, 2014, at 1:29 PM, ender <extropy at gmail.com> wrote:
> 
>> Hey Matthew,
>> 
>> I see a /var/log/messages file, but no syslog or system.log etc.  Is it the messages file you want?
>> 
>> Satish
>> 
>> 
>> On Fri, Dec 5, 2014 at 10:06 AM, Matthew Von-Maszewski <matthewv at basho.com> wrote:
>> Satish,
>> 
>> I find nothing compelling in the log or the app.config.  Therefore I have two additional suggestions/requests:
>> 
>> - lower max_open_files in app.config to to 150 from 315.  There was one other customer report regarding the limit not properly stopping out of memory (OOM) conditions.
>> 
>> - try to locate a /var/log/syslog* file from a node that contains the time of the crash.  There may be helpful information there.  Please send that along.
>> 
>> 
>> Unrelated to this crash … 1.4.7 has a known bug in its active anti-entropy (AAE) logic.  This bug is NOT known to cause a crash.  The bug does cause AAE to be unreliable for data restoration.  The proper steps for upgrading to the current release (1.4.12) are:
>> 
>> -- across the entire cluster
>> - disable anti_entropy in app.config on all nodes: {anti_entropy, {off, []}}
>> - perform a rolling restart of all nodes … AAE is now disabled in the cluster 
>> 
>> -- on each node
>> - stop the node
>> - remove (erase all files and directories) /vol/lib/riak/anti_entropy
>> - update Riak to the new software revision
>> - start the node again
>> 
>> -- across the entire cluster
>> - enable anti_entropy in app.config on all nodes: {anti_entropy, {on, []}}
>> - perform a rolling restart of all nodes … AAE is now enabled in the cluster 
>> 
>> The nodes will start rebuilding the AAE hash data.  Suggest you perform the last rolling restart during a low utilization time of your cluster.
>> 
>> 
>> Matthew
>> 
>> 
>> On Dec 5, 2014, at 11:02 AM, ender <extropy at gmail.com> wrote:
>> 
>>> Hi Matthew,
>>> 
>>> Riak version: 1.4.7
>>> 5 Nodes in cluster
>>> RAM: 30GB
>>> 
>>> The leveldb logs are attached.
>>> 
>>> 
>>> 
>>> On Thu, Dec 4, 2014 at 1:34 PM, Matthew Von-Maszewski <matthewv at basho.com> wrote:
>>> Satish,
>>> 
>>> Some questions:
>>> 
>>> - what version of Riak are you running?  logs suggest 1.4.7
>>> - how many nodes in your cluster?
>>> - what is the physical memory (RAM size) of each node?
>>> - would you send the leveldb LOG  files from one of the crashed servers:
>>>     tar -czf satish_LOG.tgz /vol/lib/riak/leveldb/*/LOG*
>>> 
>>> 
>>> Matthew
>>> 
>>> On Dec 4, 2014, at 4:02 PM, ender <extropy at gmail.com> wrote:
>>> 
>>> > My RIak installation has been running successfully for about a year.  This week nodes suddenly started randomly crashing.  The machines have plenty of memory and free disk space, and looking in the ring directory nothing appears to amiss:
>>> >
>>> > [ec2-user at ip-10-196-72-247 ~]$ ls -l /vol/lib/riak/ring
>>> > total 80
>>> > -rw-rw-r-- 1 riak riak 17829 Nov 29 19:42 riak_core_ring.default.20141129194225
>>> > -rw-rw-r-- 1 riak riak 17829 Dec  3 19:07 riak_core_ring.default.20141203190748
>>> > -rw-rw-r-- 1 riak riak 17829 Dec  4 16:29 riak_core_ring.default.20141204162956
>>> > -rw-rw-r-- 1 riak riak 17847 Dec  4 20:45 riak_core_ring.default.20141204204548
>>> >
>>> > [ec2-user at ip-10-196-72-247 ~]$ du -h /vol/lib/riak/ring
>>> > 84K   /vol/lib/riak/ring
>>> >
>>> > I have attached a tarball with the app.config file plus all the logs from the node at the time of the crash.  Any help much appreciated!
>>> >
>>> > Satish
>>> >
>>> > <riak-crash-data.tar.gz>_______________________________________________
>>> > riak-users mailing list
>>> > riak-users at lists.basho.com
>>> > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>> 
>>> 
>>> <satish_LOG.tgz>
>> 
>> 
> 
> 
> <messages>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20141205/b0158dd1/attachment.html>


More information about the riak-users mailing list