Riak cluster unresponsive after single node failure
Scott Lystig Fritchie
fritchie at snookles.com
Tue May 8 18:26:14 EDT 2012
>>> "ar" == Armon Dadgar <armon.dadgar at gmail.com> wrote:
ar> All the nodes appeared to have been blocked trying to talk to riak
ar> 001 which was the ring claimant at the time. Doing this seems to
ar> have cleared the state enough for the cluster to make progress
Armon, it's quite unlikely that the ring claimant was doing anything
special because the claimant only acts when cluster membership changes.
Instead, it's quite likely that riak001 was busy doing a set of LevelDB
compactions. There have been a number of changes recently to reduce the
amount of time that we've seen worst-case LevelDB compaction blocking Erlang
process schedulers which blocks *everything*, including the keep-alives
that are sent between Erlang nodes. The longest LevelDB-related
stoppage that I've seen was 7.5 minutes. :-( When that happens on a
node X, then all other nodes will complain (almost simultaneously) that
node X is down. It's not *down*, it's just reallyreallyreally slow to
respond to messages ... which is effectively the same as being down.
Checking for big LevelDB compaction storms is pretty easy using
DTrace or SystemTap, but you're probably not using a kernel that
has user-space SystemTap available. There are compaction messages
in the "LOG" file of each LevelDB data directory. The hassle is the
need to look at all of them in parallel.
A secondary effect is watching write ops via "iostat -x 1": the
amount of data written spikes much higher than writes triggered only by
Riak client operations. (Read ops would go higher too, except that many
files input to a compaction are already cached by the OS.)
Your primary keys look UID'ish. If they are not lexigraphically adjacent
to other keys inserted at the same time, you will cause many more LevelDB
compaction events than if your keys were adjacent (e.g. prefixing them with
a wall-clock timestamp).
More information about the riak-users