Random server restarts, swap and moving nodes to new hardware

Jeff Pollard jeff.pollard at gmail.com
Tue Aug 16 03:46:31 EDT 2011

Hello everyone,

We've got a very interesting problem.  We're hosting our 5-node cluster on EC2
running Ubuntu 10.04 LTS (Lucid Lynx) Server
64-bit<http://aws.amazon.com/amis/4348> using
m2.xlarge instance types, and over the past 5 days we've had two EC2 servers
randomly restart on us.  We've checked the logs and there was nothing that
we saw that indicated why they restarted.  One second they were happily
logging and the next second the server was in the process of rebooting.
 This is particularly bad because every time the node comes back up we get
merge errors due to an existing bug in Riak and have to restore from a
recent backup.

Just today we noticed that the EC2 servers did not have swap enabled
(apparently the norm for xlarge+ instances), which we thought might have
been our problem?  My knowledge of what happens when swap is off is pretty
poor - but I have been told that the Linux OOM killer should still be
invoked and start trying to kill processes, rather than the server simply
restarting.  Is that correct?  Also, how would Riak hypothetically handle
swap being off on a system?  We're using Bitcask if that helps.

Secondly, one of our ops guys here thinks the issue might be related to a
bug <http://ubuntuforums.org/showthread.php?t=1436497> (?) that others
Ubuntu users of the same version seem to have.  In fact, we do see the same
"INFO: task cron:15047 blocked for more than 120 seconds: line in our log
file.  We're also running a AMI that isn't the official one from Canonical,
so the thought being an upgrade to the official AMI would help.

If we do want to upgrade, it will mean moving each cluster node to new
hardware.  I wanted to ask the list to make sure we were doing it correctly.
 Here is the plan to transfer a node to new hardware -- note that these
steps will be done on one node at a time, and we'll make sure the cluster
has stabilized after doing one node before moving on to the next one.

   1. Stop riak on old server.
   2. Copy data directory (including bitcask, mr_queue and ring folders) to
   a shared location.
   3. Shutdown old server.
   4. Boot new replacement server, installing (but not starting) Riak.
   5. Transfer data directory from shared location to data folder on new
   6. Start riak.

My main concern is if the ring state will transfer to a new node safely,
assuming the new server has the same hostname and node name as the old
server?  The new server will have a different IP address, but all our node
names in our cluster use hostnames, and those will not be changing.
