Riak Nodes Crashing

Matthew Von-Maszewski matthewv at basho.com
Fri Dec 5 13:06:55 EST 2014


Satish,

I find nothing compelling in the log or the app.config.  Therefore I have two additional suggestions/requests:

- lower max_open_files in app.config to to 150 from 315.  There was one other customer report regarding the limit not properly stopping out of memory (OOM) conditions.

- try to locate a /var/log/syslog* file from a node that contains the time of the crash.  There may be helpful information there.  Please send that along.


Unrelated to this crash … 1.4.7 has a known bug in its active anti-entropy (AAE) logic.  This bug is NOT known to cause a crash.  The bug does cause AAE to be unreliable for data restoration.  The proper steps for upgrading to the current release (1.4.12) are:

-- across the entire cluster
- disable anti_entropy in app.config on all nodes: {anti_entropy, {off, []}}
- perform a rolling restart of all nodes … AAE is now disabled in the cluster 

-- on each node
- stop the node
- remove (erase all files and directories) /vol/lib/riak/anti_entropy
- update Riak to the new software revision
- start the node again

-- across the entire cluster
- enable anti_entropy in app.config on all nodes: {anti_entropy, {on, []}}
- perform a rolling restart of all nodes … AAE is now enabled in the cluster 

The nodes will start rebuilding the AAE hash data.  Suggest you perform the last rolling restart during a low utilization time of your cluster.


Matthew


On Dec 5, 2014, at 11:02 AM, ender <extropy at gmail.com> wrote:

> Hi Matthew,
> 
> Riak version: 1.4.7
> 5 Nodes in cluster
> RAM: 30GB
> 
> The leveldb logs are attached.
> 
> 
> 
> On Thu, Dec 4, 2014 at 1:34 PM, Matthew Von-Maszewski <matthewv at basho.com> wrote:
> Satish,
> 
> Some questions:
> 
> - what version of Riak are you running?  logs suggest 1.4.7
> - how many nodes in your cluster?
> - what is the physical memory (RAM size) of each node?
> - would you send the leveldb LOG  files from one of the crashed servers:
>     tar -czf satish_LOG.tgz /vol/lib/riak/leveldb/*/LOG*
> 
> 
> Matthew
> 
> On Dec 4, 2014, at 4:02 PM, ender <extropy at gmail.com> wrote:
> 
> > My RIak installation has been running successfully for about a year.  This week nodes suddenly started randomly crashing.  The machines have plenty of memory and free disk space, and looking in the ring directory nothing appears to amiss:
> >
> > [ec2-user at ip-10-196-72-247 ~]$ ls -l /vol/lib/riak/ring
> > total 80
> > -rw-rw-r-- 1 riak riak 17829 Nov 29 19:42 riak_core_ring.default.20141129194225
> > -rw-rw-r-- 1 riak riak 17829 Dec  3 19:07 riak_core_ring.default.20141203190748
> > -rw-rw-r-- 1 riak riak 17829 Dec  4 16:29 riak_core_ring.default.20141204162956
> > -rw-rw-r-- 1 riak riak 17847 Dec  4 20:45 riak_core_ring.default.20141204204548
> >
> > [ec2-user at ip-10-196-72-247 ~]$ du -h /vol/lib/riak/ring
> > 84K   /vol/lib/riak/ring
> >
> > I have attached a tarball with the app.config file plus all the logs from the node at the time of the crash.  Any help much appreciated!
> >
> > Satish
> >
> > <riak-crash-data.tar.gz>_______________________________________________
> > riak-users mailing list
> > riak-users at lists.basho.com
> > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> 
> 
> <satish_LOG.tgz>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20141205/1060005f/attachment.html>


More information about the riak-users mailing list