Random server restarts, swap and moving nodes to new hardware
sean at basho.com
Tue Aug 16 10:18:10 EDT 2011
We highly recommend you upgrade to 10.10 or later. 10.04 has some known
problems when running under Xen (especially on EC2) -- in some cases under
load, the network interface will break, making the node temporarily
When you do upgrade, the simplest way (if possible) would be to remount the
attached EBS volumes where your Riak data is stored onto the new nodes.
Otherwise, the steps you list are correct.
Regarding swap, whether you have it on or not is a personal decision. Riak
will "do the right thing" and exit when it can't allocate more memory,
allowing you to figure out what went wrong -- as opposed to grinding the
machine into IO oblivion while consuming more and more swap. That said, in
some deployments (notably not on EC2), swap can be helpful.
Hope that helps,
Sean Cribbs <sean at basho.com>
Basho Technologies, Inc.
On Tue, Aug 16, 2011 at 3:46 AM, Jeff Pollard <jeff.pollard at gmail.com>wrote:
> Hello everyone,
> We've got a very interesting problem. We're hosting our 5-node cluster on EC2
> running Ubuntu 10.04 LTS (Lucid Lynx) Server 64-bit<http://aws.amazon.com/amis/4348> using
> m2.xlarge instance types, and over the past 5 days we've had two EC2 servers
> randomly restart on us. We've checked the logs and there was nothing that
> we saw that indicated why they restarted. One second they were happily
> logging and the next second the server was in the process of rebooting.
> This is particularly bad because every time the node comes back up we get
> merge errors due to an existing bug in Riak and have to restore from a
> recent backup.
> Just today we noticed that the EC2 servers did not have swap enabled
> (apparently the norm for xlarge+ instances), which we thought might have
> been our problem? My knowledge of what happens when swap is off is pretty
> poor - but I have been told that the Linux OOM killer should still be
> invoked and start trying to kill processes, rather than the server simply
> restarting. Is that correct? Also, how would Riak hypothetically handle
> swap being off on a system? We're using Bitcask if that helps.
> Secondly, one of our ops guys here thinks the issue might be related to a
> bug <http://ubuntuforums.org/showthread.php?t=1436497> (?) that others
> Ubuntu users of the same version seem to have. In fact, we do see the same
> "INFO: task cron:15047 blocked for more than 120 seconds: line in our log
> file. We're also running a AMI that isn't the official one from Canonical,
> so the thought being an upgrade to the official AMI would help.
> If we do want to upgrade, it will mean moving each cluster node to new
> hardware. I wanted to ask the list to make sure we were doing it correctly.
> Here is the plan to transfer a node to new hardware -- note that these
> steps will be done on one node at a time, and we'll make sure the cluster
> has stabilized after doing one node before moving on to the next one.
> 1. Stop riak on old server.
> 2. Copy data directory (including bitcask, mr_queue and ring folders)
> to a shared location.
> 3. Shutdown old server.
> 4. Boot new replacement server, installing (but not starting) Riak.
> 5. Transfer data directory from shared location to data folder on new
> 6. Start riak.
> My main concern is if the ring state will transfer to a new node safely,
> assuming the new server has the same hostname and node name as the old
> server? The new server will have a different IP address, but all our node
> names in our cluster use hostnames, and those will not be changing.
> riak-users mailing list
> riak-users at lists.basho.com
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the riak-users