Riak cluster unresponsive after single node failure
armon.dadgar at gmail.com
Tue May 8 18:54:47 EDT 2012
My mistake, I was not sure if the claimant was responsible for convergence.
If this was a competition, it was not one that would ever finish… The node went
down at about 1AM, and by 9AM when I started to resolve the issue it was in
the same state. I was unable to investigate the state of that machine, as it
was refusing any SSH connections.
Thanks for mentioning the key's. We've been thinking of doing just that
to get keys lexicographically near.
On Tuesday, May 8, 2012 at 3:26 PM, Scott Lystig Fritchie wrote:
> > > > "ar" == Armon Dadgar <armon.dadgar at gmail.com (mailto:armon.dadgar at gmail.com)> wrote:
> > >
> ar> All the nodes appeared to have been blocked trying to talk to riak
> ar> 001 which was the ring claimant at the time. Doing this seems to
> ar> have cleared the state enough for the cluster to make progress
> ar> again.
> Armon, it's quite unlikely that the ring claimant was doing anything
> special because the claimant only acts when cluster membership changes.
> Instead, it's quite likely that riak001 was busy doing a set of LevelDB
> compactions. There have been a number of changes recently to reduce the
> amount of time that we've seen worst-case LevelDB compaction blocking Erlang
> process schedulers which blocks *everything*, including the keep-alives
> that are sent between Erlang nodes. The longest LevelDB-related
> stoppage that I've seen was 7.5 minutes. :-( When that happens on a
> node X, then all other nodes will complain (almost simultaneously) that
> node X is down. It's not *down*, it's just reallyreallyreally slow to
> respond to messages ... which is effectively the same as being down.
> Checking for big LevelDB compaction storms is pretty easy using
> DTrace or SystemTap, but you're probably not using a kernel that
> has user-space SystemTap available. There are compaction messages
> in the "LOG" file of each LevelDB data directory. The hassle is the
> need to look at all of them in parallel.
> A secondary effect is watching write ops via "iostat -x 1": the
> amount of data written spikes much higher than writes triggered only by
> Riak client operations. (Read ops would go higher too, except that many
> files input to a compaction are already cached by the OS.)
> Your primary keys look UID'ish. If they are not lexigraphically adjacent
> to other keys inserted at the same time, you will cause many more LevelDB
> compaction events than if your keys were adjacent (e.g. prefixing them with
> a wall-clock timestamp).
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the riak-users