Riak cluster unresponsive after single node failure

Armon Dadgar armon.dadgar at gmail.com
Tue May 8 18:58:32 EDT 2012


Where by "competition", I meant "compaction". Derp.

Best Regards,

Armon Dadgar


On Tuesday, May 8, 2012 at 3:54 PM, Armon Dadgar wrote:

> Hey Scott,  
>  
> My mistake, I was not sure if the claimant was responsible for convergence.
>  
> If this was a competition, it was not one that would ever finish… The node went
> down at about 1AM, and by 9AM when I started to resolve the issue it was in
> the same state. I was unable to investigate the state of that machine, as it
> was refusing any SSH connections.
>  
> Thanks for mentioning the key's. We've been thinking of doing just that
> to get keys lexicographically near.
>  
> Best Regards,
>  
> Armon Dadgar
>  
>  
> On Tuesday, May 8, 2012 at 3:26 PM, Scott Lystig Fritchie wrote:
>  
> > > > > "ar" == Armon Dadgar <armon.dadgar at gmail.com (mailto:armon.dadgar at gmail.com)> wrote:
> > > >  
> > >  
> >  
> >  
> > ar> All the nodes appeared to have been blocked trying to talk to riak
> > ar> 001 which was the ring claimant at the time. Doing this seems to
> > ar> have cleared the state enough for the cluster to make progress
> > ar> again.
> >  
> > Armon, it's quite unlikely that the ring claimant was doing anything
> > special because the claimant only acts when cluster membership changes.
> >  
> > Instead, it's quite likely that riak001 was busy doing a set of LevelDB  
> > compactions. There have been a number of changes recently to reduce the
> > amount of time that we've seen worst-case LevelDB compaction blocking Erlang
> > process schedulers which blocks *everything*, including the keep-alives
> > that are sent between Erlang nodes. The longest LevelDB-related
> > stoppage that I've seen was 7.5 minutes. :-( When that happens on a
> > node X, then all other nodes will complain (almost simultaneously) that  
> > node X is down. It's not *down*, it's just reallyreallyreally slow to
> > respond to messages ... which is effectively the same as being down.
> >  
> > Checking for big LevelDB compaction storms is pretty easy using
> > DTrace or SystemTap, but you're probably not using a kernel that
> > has user-space SystemTap available. There are compaction messages
> > in the "LOG" file of each LevelDB data directory. The hassle is the
> > need to look at all of them in parallel.
> >  
> > A secondary effect is watching write ops via "iostat -x 1": the
> > amount of data written spikes much higher than writes triggered only by
> > Riak client operations. (Read ops would go higher too, except that many
> > files input to a compaction are already cached by the OS.)
> >  
> > Your primary keys look UID'ish. If they are not lexigraphically adjacent
> > to other keys inserted at the same time, you will cause many more LevelDB
> > compaction events than if your keys were adjacent (e.g. prefixing them with
> > a wall-clock timestamp).
> >  
> > -Scott  
>  

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20120508/6ce0bf5b/attachment.html>


More information about the riak-users mailing list