Issue with Riak partition allocation

Peter Bakkum peter at quizlet.com
Mon Sep 8 13:01:12 EDT 2014


Hey all,

Looking for some guidance on a problem we're seeing in production right
now. We're not Riak experts so please bear with us.

We had a member of our 6-node Riak cluster appear to fall out (riak-admin
member status on that node only showed itself). So I ran a riak-admin join
and riak-admin commit to get the node back in the cluster. Node discovery
appears to work now, but for some reason that node is now using a huge
amount of disk space. It appears that the partition balancing process is
creating this condition, and still hasn't completed after ~16 hours. The
cluster is still functional and serving our production traffic, and taking
the entire cluster offline isn't an option for us.

Most of our nodes use about 450GB of space, this node in particular is
using around 1.2TB, which is pushing the limit of its disk.

Questions:
Whats happening here? Is this expected?

Whats the best course of action? Should we clear out this node and attempt
to join the cluster again?

Here are some stats from the node in question. Let me know if anything else
would be helpful.

Thanks for your help.


[root at 192.168.72.19 /data/lib/riak] # riak-admin member-status
================================= Membership
==================================
Status     Ring    Pending    Node
-------------------------------------------------------------------------------
valid      20.3%     16.4%    'xxxx_prod_cluster at 192.168.72.135'
valid      18.0%     17.2%    'xxxx_prod_cluster at 192.168.72.170'
valid      20.3%     17.2%    'xxxx_prod_cluster at 192.168.72.176'
valid       7.0%     16.4%    'xxxx_prod_cluster at 192.168.72.19'
valid      17.2%     16.4%    'xxxx_prod_cluster at 192.168.72.7'
valid      17.2%     16.4%    'xxxx_prod_cluster at 192.168.72.74'


[root at 192.168.72.19 /data/lib/riak] # riak-admin status
1-minute stats for 'xxxx_prod_cluster at 192.168.72.19'
-------------------------------------------
riak_kv_stat_ts : 1410194287
vnode_gets : 1607
vnode_gets_total : 563683
vnode_puts : 39
vnode_puts_total : 5459724
vnode_index_refreshes : 0
vnode_index_refreshes_total : 0
vnode_index_reads : 0
vnode_index_reads_total : 0
vnode_index_writes : 39
vnode_index_writes_total : 5459724
vnode_index_writes_postings : 0
vnode_index_writes_postings_total : 5227558
vnode_index_deletes : 0
vnode_index_deletes_total : 0
vnode_index_deletes_postings : 39
vnode_index_deletes_postings_total : 30613
node_gets : 3602
node_gets_total : 2463956
node_get_fsm_siblings_mean : 1
node_get_fsm_siblings_median : 1
node_get_fsm_siblings_95 : 2
node_get_fsm_siblings_99 : 3
node_get_fsm_siblings_100 : 12
node_get_fsm_objsize_mean : 52047
node_get_fsm_objsize_median : 26936
node_get_fsm_objsize_95 : 167435
node_get_fsm_objsize_99 : 267979
node_get_fsm_objsize_100 : 1313716
node_get_fsm_time_mean : 12223
node_get_fsm_time_median : 6675
node_get_fsm_time_95 : 37390
node_get_fsm_time_99 : 87046
node_get_fsm_time_100 : 345380
node_puts : 39
node_puts_total : 24915
node_put_fsm_time_mean : 4419
node_put_fsm_time_median : 2444
node_put_fsm_time_95 : 12890
node_put_fsm_time_99 : 18775
node_put_fsm_time_100 : 18775
read_repairs : 0
read_repairs_total : 0
coord_redirs_total : 17022
executing_mappers : 0
precommit_fail : 0
postcommit_fail : 0
index_fsm_create : 0
index_fsm_create_error : 0
index_fsm_active : 0
list_fsm_create : 0
list_fsm_create_error : 0
list_fsm_active : 0
pbc_active : 0
pbc_connects : 1
pbc_connects_total : 508
node_get_fsm_active : 1
node_get_fsm_active_60s : 3530
node_get_fsm_in_rate : 55
node_get_fsm_out_rate : 56
node_get_fsm_rejected : 0
node_get_fsm_rejected_60s : 0
node_get_fsm_rejected_total : 0
node_put_fsm_active : 0
node_put_fsm_active_60s : 67
node_put_fsm_in_rate : 1
node_put_fsm_out_rate : 1
node_put_fsm_rejected : 0
node_put_fsm_rejected_60s : 0
node_put_fsm_rejected_total : 0
leveldb_read_block_error : 0
riak_pipe_stat_ts : 1410194286
pipeline_active : 0
pipeline_create_count : 0
pipeline_create_one : 0
pipeline_create_error_count : 0
pipeline_create_error_one : 0
cpu_nprocs : 426
cpu_avg1 : 1352
cpu_avg5 : 1260
cpu_avg15 : 1137
mem_total : 15666507776
mem_allocated : 15479640064
disk : [{"/",8256952,60},
        {"/dev/shm",7649660,0},
        {"/tmpfs",1048576,14},
        {"/tmpfs_mp3",1048576,0},
        {"/data",1514123712,81}]
nodename : 'xxxx_prod_cluster at 192.168.72.19'
connected_nodes : ['xxxx_prod_cluster at 192.168.72.170',
                   'xxxx_prod_cluster at 192.168.72.176',
                   'xxxx_prod_cluster at 192.168.72.74',
                   'xxxx_prod_cluster at 192.168.72.135',
                   'xxxx_prod_cluster at 192.168.72.7']
sys_driver_version : <<"2.0">>
sys_global_heaps_size : 0
sys_heap_type : private
sys_logical_processors : 4
sys_otp_release : <<"R15B01">>
sys_process_count : 2469
sys_smp_support : true
sys_system_version : <<"Erlang R15B01 (erts-5.9.1) [source] [64-bit]
[smp:4:4] [async-threads:64] [kernel-poll:true]">>
sys_system_architecture : <<"x86_64-unknown-linux-gnu">>
sys_threads_enabled : true
sys_thread_pool_size : 64
sys_wordsize : 8
ring_members : ['xxxx_prod_cluster at 192.168.72.135',
                'xxxx_prod_cluster at 192.168.72.170',
                'xxxx_prod_cluster at 192.168.72.176',
                'xxxx_prod_cluster at 192.168.72.19',
                'xxxx_prod_cluster at 192.168.72.7',
                'xxxx_prod_cluster at 192.168.72.74']
ring_num_partitions : 128
ring_ownership : <<"[{'xxxx_prod_cluster at 192.168.72.170',23},\n {'
xxxx_prod_cluster at 192.168.72.74',22},\n
{'xxxx_prod_cluster at 192.168.72.135',26},\n
{'xxxx_prod_cluster at 192.168.72.176',26},\n
{'xxxx_prod_cluster at 192.168.72.7',22},\n
{'xxxx_prod_cluster at 192.168.72.19',9}]">>
ring_creation_size : 128
storage_backend : riak_kv_eleveldb_backend
erlydtl_version : <<"0.7.0">>
riak_control_version : <<"1.4.10-0-g73c43c3">>
cluster_info_version : <<"1.2.4">>
riak_search_version : <<"1.4.10-0-g6e548e7">>
merge_index_version : <<"1.3.2-0-gcb38ee7">>
riak_kv_version : <<"1.4.10-0-g64b6ad8">>
sidejob_version : <<"0.2.0">>
riak_api_version : <<"1.4.10-0-gc407ac0">>
riak_pipe_version : <<"1.4.10-0-g9353526">>
riak_core_version : <<"1.4.10">>
bitcask_version : <<"1.6.6-0-g230b6d6">>
basho_stats_version : <<"1.0.3">>
webmachine_version : <<"1.10.4-0-gfcff795">>
mochiweb_version : <<"1.5.1p6">>
inets_version : <<"5.9">>
erlang_js_version : <<"1.2.2">>
runtime_tools_version : <<"1.8.8">>
os_mon_version : <<"2.2.9">>
riak_sysmon_version : <<"1.1.3">>
ssl_version : <<"5.0.1">>
public_key_version : <<"0.15">>
crypto_version : <<"2.1">>
sasl_version : <<"2.2.1">>
lager_version : <<"2.0.1">>
goldrush_version : <<"0.1.5">>
compiler_version : <<"4.8.1">>
syntax_tools_version : <<"1.6.8">>
stdlib_version : <<"1.18.1">>
kernel_version : <<"2.15.1">>
memory_total : 130705264
memory_processes : 55557705
memory_processes_used : 55341757
memory_system : 75147559
memory_atom : 545377
memory_atom_used : 527226
memory_binary : 12172712
memory_code : 11674242
memory_ets : 11913912
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20140908/a1135c46/attachment.html>


More information about the riak-users mailing list