RiakCS node crash +90% disk i/o

Jon Meredith jmeredith at basho.com
Wed Dec 3 10:05:43 EST 2014


It's most likely you haven't increased from the default, the backends in
Riak require a large number of file descriptors.  If you see a repeat of
the problem, please can you run lsof against the beam.smp process and
provide a full listing of all of the files under the leveled and bitcask
directories so we can check if any unlinked files are still open.

Riak tries to warn you, but you may not see it if started through init
scripts/systemd. It would be good to get it added inside the actual Riak
startup code and written to console.log rather than the shell scripts we
start with.

*riak* *develop* % bin/riak start

!!!!

!!!! WARNING: ulimit -n is 512; 65536 is the recommended minimum.

!!!!




On Wed, Dec 3, 2014 at 7:45 AM, Alex Millar <alex at gobonfire.com> wrote:

> Thanks Jon.
>
> I had thought I had the ulimit bumped up and will need to do some more
> reading on this.
>
> Is it possible a node could have had dangling file descriptor references?
> (Effectively no “garbage collection” happening and thus this was just a
> tipping point)
>
> I’m assuming the more likely case was I didn’t have it increased from the
> default setting on the OS and thus hit the limit and everything crashed.
>
>   [image: Bonfire Logo]  *Alex Millar*, CTO
> Office: 1-800-354-8010 ext. 704 <+18003548010>
> Mobile: 519-729-2539 <+15197292539>
> *GoBonfire*.com <http://GoBonfire.com>
>
> On December 3, 2014 at 9:31:22 AM, Jon Meredith (jmeredith at basho.com)
> wrote:
>
> Hi Alex.
>
> It looks like you exceeded the files ulimit. Information on how to fix is
> here
>
>
> http://docs.basho.com/riak/latest/ops/tuning/open-files-limit/#Changing-the-limit
>
> Jon
>
> On Dec 3, 2014, at 7:15 AM, Alex Millar <alex at gobonfire.com> wrote:
>
>   Good morning Riak-Users
>
>  Last night one of the nodes in my 5 node RiakCS cluster went haywire and
> shot up to +90% disk i/o utilization seemingly out of the blue.
>
>  Looking at the riak error.log I saw the following being continuously
> written.
>
>  2014-12-02 21:57:13.220 [error] <0.29210.3089> CRASH REPORT Process
> <0.29210.3089> with 0 neighbours exited with reason: no match of right hand
> value {error,{db_open,"IO error:
> /var/lib/riak/anti_entropy/570899077082383952423314387779798054553098649600/CURRENT:
> Too many open files"}} in hashtree:new_segment_store/2 line 505 in
> gen_server:init_it/6 line 328
> 2014-12-02 21:57:13.226 [error] <0.29211.3089> CRASH REPORT Process
> <0.29211.3089> with 0 neighbours exited with reason: no match of right hand
> value {error,{db_open,"IO error:
> /var/lib/riak/anti_entropy/776422744832042175295707567380525354192214163456/LOCK:
> Too many open files"}} in hashtree:new_segment_store/2 line 505 in
> gen_server:init_it/6 line 328
> 2014-12-02 21:57:13.226 [error] <0.29212.3089> CRASH REPORT Process
> <0.29212.3089> with 0 neighbours exited with reason: no match of right hand
> value {error,{db_open,"IO error:
> /var/lib/riak/anti_entropy/570899077082383952423314387779798054553098649600/CURRENT:
> Too many open files"}} in hashtree:new_segment_store/2 line 505 in
> gen_server:init_it/6 line 328
> 2014-12-02 21:57:13.226 [error] <0.29213.3089> CRASH REPORT Process
> <0.29213.3089> with 0 neighbours exited with reason: no match of right hand
> value {error,{db_open,"IO error:
> /var/lib/riak/anti_entropy/776422744832042175295707567380525354192214163456/CURRENT:
> Too many open files"}} in hashtree:new_segment_store/2 line 505 in
> gen_server:init_it/6 line 328
> 2014-12-02 21:57:13.286 [error] <0.29215.3089> CRASH REPORT Process
> <0.29215.3089> with 0 neighbours exited with reason: no match of right hand
> value {error,{db_open,"IO error:
> /var/lib/riak/anti_entropy/776422744832042175295707567380525354192214163456/LOCK:
> Too many open files"}} in hashtree:new_segment_store/2 line 505 in
> gen_server:init_it/6 line 328
> 2014-12-02 21:57:13.286 [error] <0.29214.3089> CRASH REPORT Process
> <0.29214.3089> with 0 neighbours exited with reason: no match of right hand
> value {error,{db_open,"IO error:
> /var/lib/riak/anti_entropy/570899077082383952423314387779798054553098649600/LOCK:
> Too many open files"}} in hashtree:new_segment_store/2 line 505 in
> gen_server:init_it/6 line 328
> 2014-12-02 21:57:13.286 [error] <0.29217.3089> CRASH REPORT Process
> <0.29217.3089> with 0 neighbours exited with reason: no match of right hand
> value {error,{db_open,"IO error:
> /var/lib/riak/anti_entropy/570899077082383952423314387779798054553098649600/LOCK:
> Too many open files"}} in hashtree:new_segment_store/2 line 505 in
> gen_server:init_it/6 line 328
> 2014-12-02 21:57:13.287 [error] <0.29216.3089> CRASH REPORT Process
> <0.29216.3089> with 0 neighbours exited with reason: no match of right hand
> value {error,{db_open,"IO error:
> /var/lib/riak/anti_entropy/776422744832042175295707567380525354192214163456/LOCK:
> Too many open files"}} in hashtree:new_segment_store/2 line 505 in
> gen_server:init_it/6 line 328
> 2014-12-02 21:57:13.312 [error] <0.29219.3089> CRASH REPORT Process
> <0.29219.3089> with 0 neighbours exited with reason: no match of right hand
> value {error,{db_open,"IO error:
> /var/lib/riak/anti_entropy/570899077082383952423314387779798054553098649600/LOCK:
> Too many open files"}} in hashtree:new_segment_store/2 line 505 in
> gen_server:init_it/6 line 328
> 2014-12-02 21:57:15.634 [error] <0.29218.3089> CRASH REPORT Process
> <0.29218.3089> with 0 neighbours exited with reason: no match of right hand
> value {error,{db_open,"IO error:
> /var/lib/riak/anti_entropy/776422744832042175295707567380525354192214163456/CURRENT:
> Too many open files"}} in hashtree:new_segment_store/2 line 505 in
> gen_server:init_it/6 line 328
>
>  Leading up to this there didn’t appear to be any significant load on our
> cluster.
>
>  I simply restarted the node and the issue went away but I wanted to reach
> out to get some help as to why / how this arose in the first place.
>
>  Regards,
>
>   [image: Bonfire Logo] *Alex Millar*, CTO
> Office: 1-800-354-8010 ext. 704 <+18003548010>
> Mobile: 519-729-2539 <+15197292539>
> *GoBonfire*.com <http://GoBonfire.com>
>
>  _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
>  http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
>


-- 
Jon Meredith
Chief Architect
Basho Technologies, Inc.
jmeredith at basho.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20141203/3d7b151e/attachment.html>


More information about the riak-users mailing list