Leveldb segfault during Riak startup

Luke Bakken lbakken at basho.com
Fri Dec 18 09:45:08 EST 2015


Hi Antti,

Riak is not tested on btrfs and the file system is not officially
supported. We recommend ext4 or xfs for Linux. ZFS is an option on
Solaris derivatives and FreeBSD.

--
Luke Bakken
Engineer
lbakken at basho.com


On Fri, Dec 18, 2015 at 6:14 AM, Antti Kuusela
<antti.kuusela at firstbeat.fi> wrote:
> Hi,
>
> I have been testing Riak and Riak CS as a possible solution for our future
> storage needs. I have a five server cluster running Centos 7. Riak version
> is 2.1.3 (first installed as 2.1.1, updated twice via Basho repo) and Riak
> CS version is 2.1.0. The servers each have 64GB RAM and six 4TB disks in
> raid 6 using btrfs.
>
> I have been pushing random data into Riak-CS via s3cmd to see how the system
> behaves. Smallest objects have been 2000 bytes, largest 100MB. I have also
> been making btrfs snapshots of the entire platform data dir nightly for
> backup purposes. Stop Riak CS, wait 10 seconds, stop Riak, wait 10, make
> snapshot, start Riak, wait 180 seconds, start Riak CS. This is performed on
> each of the servers in turn with a five minute wait in between. I have added
> the waits to try spread the startup load and allow the system time to get
> things running. New data is constantly pushed to the S3 API but restarting
> the nodes in rotation causes by far the highest stress on the system.
>
> I have encountered one problem in particular. Quite often one of the Riak
> nodes starts up but after a couple of minutes it just drops, all processes
> exited except for epmd.
>
> Following is from /var/log/riak/console, most of the lines skipped for sake
> of brevity. Normal startup stuff, as far as I can see:
>
> 2015-12-16 00:26:04.446 [info] <0.7.0> Application lager started on node
> 'riak at 192.168.50.32'
> ...
> 2015-12-16 00:26:04.490 [info] <0.72.0> alarm_handler:
> {set,{system_memory_high_watermark,[]}}
> ...
> 2015-12-16 00:26:04.781 [info]
> <0.206.0>@riak_core_capability:process_capability_changes:555 New
> capability: {riak_core,vnode_routing} = proxy
> ...
> 2015-12-16 00:26:04.869 [info] <0.7.0> Application riak_core started on node
> 'riak at 192.168.50.32'
> ...
> 2015-12-16 00:26:04.969 [info] <0.407.0>@riak_kv_env:doc_env:46 Environment
> and OS variables:
> 2015-12-16 00:26:05.124 [warning] <0.6.0> lager_error_logger_h dropped 9
> messages in the last second that exceeded the limit of 100 messages/sec
> 2015-12-16 00:26:05.124 [info] <0.407.0> riak_kv_env: Open file limit: 65536
> 2015-12-16 00:26:05.124 [warning] <0.407.0> riak_kv_env: Cores are disabled,
> this may hinder debugging
> 2015-12-16 00:26:05.124 [info] <0.407.0> riak_kv_env: Erlang process limit:
> 262144
> 2015-12-16 00:26:05.125 [info] <0.407.0> riak_kv_env: Erlang ports limit:
> 65536
> 2015-12-16 00:26:05.125 [info] <0.407.0> riak_kv_env: ETS table count limit:
> 256000
> 2015-12-16 00:26:05.125 [info] <0.407.0> riak_kv_env: Thread pool size: 64
> 2015-12-16 00:26:05.125 [info] <0.407.0> riak_kv_env: Generations before
> full sweep: 0
> 2015-12-16 00:26:05.125 [info] <0.407.0> riak_kv_env: Schedulers: 12 for 12
> cores
> 2015-12-16 00:26:05.125 [info] <0.407.0> riak_kv_env: sysctl vm.swappiness
> is 0 greater than or equal to 0)
> 2015-12-16 00:26:05.125 [info] <0.407.0> riak_kv_env: sysctl
> net.core.wmem_default is 8388608 lesser than or equal to 8388608)
> ...
> 2015-12-16 00:26:05.139 [info] <0.478.0>@riak_core:wait_for_service:504
> Waiting for service riak_kv to start (0 seconds)
> 2015-12-16 00:26:05.158 [info]
> <0.495.0>@riak_kv_entropy_manager:set_aae_throttle_limits:790 Setting AAE
> throttle limits: [{-1,0},{200,10},{500,50},{750,250},{900,1000},{1100,5000}]
> ...
> 2015-12-16 00:26:30.160 [info]
> <0.495.0>@riak_kv_entropy_manager:perhaps_log_throttle_change:853 Changing
> AAE throttle from undefined -> 5000 msec/key, based on maximum vnode mailbox
> size {unknown_mailbox_sizes,node_list,['riak at 192.168.50.32']} from
> ['riak at 192.168.50.32']
> 2015-12-16 00:27:12.053 [info] <0.478.0>@riak_core:wait_for_service:504
> Waiting for service riak_kv to start (60 seconds)
> 2015-12-16 00:28:25.057 [info] <0.478.0>@riak_core:wait_for_service:504
> Waiting for service riak_kv to start (120 seconds)
>
> And then nothing
>
> From /var/log/messages:
>
> Dec 16 00:26:02 storage2 su: (to riak) root on none
> Dec 16 00:26:04 storage2 riak[48174]: Starting up
> Dec 16 00:28:59 storage2 kernel: traps: beam.smp[48492] general protection
> ip:7fcaf9402f16 sp:7fca6affcdd0 error:0 in eleveldb.so[7fcaf93b5000+93000]
> Dec 16 00:28:59 storage2 run_erl[48172]: Erlang closed the connection.
>
> On another node at a different time /var/log/riak/console.log had similar
> messages, and also some warnings about invalid hint files, such as:
>
> 2015-12-13 00:15:41.232 [warning] <0.815.0> Hintfile
> '/data/riak/bitcask/570899077082383952423314387779798054553098649600/56.bitcask.hint'
> invalid
>
> In this latter example riak was started with "systemctl start riak" rather
> than "riak start". From /var/log/messages:
>
> Dec 13 00:15:29 storage1 riak: Starting riak: [  OK  ]
> Dec 13 00:15:29 storage1 systemd: Started SYSV: Riak is a distributed data
> store.
> Dec 13 00:15:56 storage1 kernel: beam.smp[131820]: segfault at 160 ip
> 00007f24c0902ce6 sp 00007f24337fddd0 error 4 in
> eleveldb.so[7f24c08b5000+93000]
> Dec 13 00:15:56 storage1 run_erl[131501]: Erlang closed the connection.
>
> Of reported Riak bugs, this is similar to
> https://github.com/basho/riak/issues/790 . However, the poster of that issue
> reported that his problem was fixed by repairing leveldb partitions. I
> looked at this following
> http://docs.basho.com/riak/latest/ops/running/recovery/repairing-leveldb/
> but didn't find any errors.
>
> Incidentally, I started having problems with btrfs as well. On one node
> btrfs caused a kernel crash and on another kernel killed beam.smp process
> after it stopped responding for over 120 seconds while syncing to btrfs. The
> kernel in Centos 7 probably isn't best suited for working with btrfs.
> Advertised version is 3.10.0.
>
> So, my question is what is your take on this? Is this a bug in the leveldb
> library? The same one already reported? What log data would help debug or
> reproduce it? Or is there potentially some problem with my setup? Or could
> this be caused by a bug in btrfs? What is your take on using Riak with
> btrfs?
>
> --
> Antti Kuusela, M.Sc
> Senior Software Developer
> Firstbeat Technologies Ltd.
>
>
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com




More information about the riak-users mailing list