Load average spikes during merging?

Jeff Pollard jeff.pollard at gmail.com
Thu Aug 18 17:11:54 EDT 2011


Overnight we do some data collection that is stored in Riak, and just last
night we had one of our server's load spike very high and just drop back
down to more acceptable levels.  You can see a graph of it
here<https://img.skitch.com/20110818-t8bctnyjx5e3bhrwyqfg9cag2k.png>.
 This node happens to be one that we had to restore from a backup on a
couple days ago, so our initial thoughts were that it was just doing a lot
of read repair and merging, but looking at the erlang.*.logs I don't see log
entries for merging during all the points of high load, but certainly for
some of them.  The other nodes did exhibit some spiky load average last
night, but the one I linked to certainly was the most egregious offender.

Another datapoint to consider is that our data collection job is also very
cyclical.  It will be doing 2,500 requests/minute (~2,000 GET, 500 PUT) to
Riak and then one minute later that will suddenly jump to 7,000
requests/minute (~5,500 GET, 1,600 PUT).  This cycle repeats for a somewhere
between 2-4 hours overnight.  Anecdotally, I've seen the load on our Riak
nodes spike when the requests/minute on them goes from more-or-less flat to
a high request rate, so I thought that perhaps the fluctuation in request
rate to the riak nodes was causing some sort of problem.  And after the
initial 2-4 hours of the 2,500/7,000 request rate, we have a similarly
shaped but smaller in throughput cyclical request pattern (500/2,000) where
the load is much lower (< 4).

So my main question for the list is - is this normal/abnormal behavior?
 Should we be concerned?  These nodes are hosted on EC2 with ephemeral
disks, so is the high load average is simply probably due to I/O wait.  I
checked and the CPU usage of Riak itself during the high load averages was
very small (< 10% of total) so the source of the high load has to be I/O
wait as far as I'm concerned, but I wasn't sure if I should be alarmed about
the high load average or not in general?  We're in the process of adding I/O
wait to the monitoring system for our Riak nodes, so I'll likely have more
data tomorrow on I/O wait during overnight data collection.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20110818/4d0c35d9/attachment.html>


More information about the riak-users mailing list