Issue with Riak partition allocation

Peter Bakkum peter at quizlet.com
Tue Sep 9 13:06:59 EDT 2014


Hey Luke,

Here you go. Last night I believe we had ~70 partitions still to transfer.

[root at 192.168.72.19 ~] #
[root at 192.168.72.19 ~] #
[root at 192.168.72.19 ~] # riak-admin transfers
'xxxx_prod_cluster at 192.168.72.74' waiting to handoff 2 partitions
'xxxx_prod_cluster at 192.168.72.7' waiting to handoff 2 partitions
'xxxx_prod_cluster at 192.168.72.19' waiting to handoff 21 partitions

Active Transfers:

transfer type: ownership_transfer
vnode type: riak_kv_vnode
partition: 422465317040964124793252646957050560369293000704
started: 2014-09-09 16:15:11 [48.93 min ago]
last update: 2014-09-09 17:04:06 [887.56 ms ago]
total size: 47814241779 bytes
objects transferred: 326825

                          111 Objs/s
xxxx_prod_cluster  =======>  xxxx_prod_cluster
@192.168.72.176              @192.168.72.19
        |===================                        |  45%
                          7.13 MB/s

transfer type: hinted_handoff
vnode type: riak_kv_vnode
partition: 1004782375664995756265033322492444576013453623296
started: 2014-09-09 16:36:37 [27.50 min ago]
last update: 2014-09-09 17:04:07 [411.93 ms ago]
total size: 47179582874 bytes
objects transferred: 213001

                          129 Objs/s
xxxx_prod_cluster  =======>  xxxx_prod_cluster
@192.168.72.19               @192.168.72.7
        |=============                              |  31%
                          8.48 MB/s



On Tue, Sep 9, 2014 at 7:22 AM, Luke Bakken <lbakken at basho.com> wrote:

> Hi Peter,
>
> Could you please provide the output of "riak-admin transfers" ?
> --
> Luke Bakken
> Engineer / CSE
> lbakken at basho.com
>
>
> On Mon, Sep 8, 2014 at 10:01 AM, Peter Bakkum <peter at quizlet.com> wrote:
> > Hey all,
> >
> > Looking for some guidance on a problem we're seeing in production right
> now.
> > We're not Riak experts so please bear with us.
> >
> > We had a member of our 6-node Riak cluster appear to fall out (riak-admin
> > member status on that node only showed itself). So I ran a riak-admin
> join
> > and riak-admin commit to get the node back in the cluster. Node discovery
> > appears to work now, but for some reason that node is now using a huge
> > amount of disk space. It appears that the partition balancing process is
> > creating this condition, and still hasn't completed after ~16 hours. The
> > cluster is still functional and serving our production traffic, and
> taking
> > the entire cluster offline isn't an option for us.
> >
> > Most of our nodes use about 450GB of space, this node in particular is
> using
> > around 1.2TB, which is pushing the limit of its disk.
> >
> > Questions:
> > Whats happening here? Is this expected?
> >
> > Whats the best course of action? Should we clear out this node and
> attempt
> > to join the cluster again?
> >
> > Here are some stats from the node in question. Let me know if anything
> else
> > would be helpful.
> >
> > Thanks for your help.
> >
> >
> > [root at 192.168.72.19 /data/lib/riak] # riak-admin member-status
> > ================================= Membership
> > ==================================
> > Status     Ring    Pending    Node
> >
> -------------------------------------------------------------------------------
> > valid      20.3%     16.4%    'xxxx_prod_cluster at 192.168.72.135'
> > valid      18.0%     17.2%    'xxxx_prod_cluster at 192.168.72.170'
> > valid      20.3%     17.2%    'xxxx_prod_cluster at 192.168.72.176'
> > valid       7.0%     16.4%    'xxxx_prod_cluster at 192.168.72.19'
> > valid      17.2%     16.4%    'xxxx_prod_cluster at 192.168.72.7'
> > valid      17.2%     16.4%    'xxxx_prod_cluster at 192.168.72.74'
> >
> >
> > [root at 192.168.72.19 /data/lib/riak] # riak-admin status
> > 1-minute stats for 'xxxx_prod_cluster at 192.168.72.19'
> > -------------------------------------------
> > riak_kv_stat_ts : 1410194287
> > vnode_gets : 1607
> > vnode_gets_total : 563683
> > vnode_puts : 39
> > vnode_puts_total : 5459724
> > vnode_index_refreshes : 0
> > vnode_index_refreshes_total : 0
> > vnode_index_reads : 0
> > vnode_index_reads_total : 0
> > vnode_index_writes : 39
> > vnode_index_writes_total : 5459724
> > vnode_index_writes_postings : 0
> > vnode_index_writes_postings_total : 5227558
> > vnode_index_deletes : 0
> > vnode_index_deletes_total : 0
> > vnode_index_deletes_postings : 39
> > vnode_index_deletes_postings_total : 30613
> > node_gets : 3602
> > node_gets_total : 2463956
> > node_get_fsm_siblings_mean : 1
> > node_get_fsm_siblings_median : 1
> > node_get_fsm_siblings_95 : 2
> > node_get_fsm_siblings_99 : 3
> > node_get_fsm_siblings_100 : 12
> > node_get_fsm_objsize_mean : 52047
> > node_get_fsm_objsize_median : 26936
> > node_get_fsm_objsize_95 : 167435
> > node_get_fsm_objsize_99 : 267979
> > node_get_fsm_objsize_100 : 1313716
> > node_get_fsm_time_mean : 12223
> > node_get_fsm_time_median : 6675
> > node_get_fsm_time_95 : 37390
> > node_get_fsm_time_99 : 87046
> > node_get_fsm_time_100 : 345380
> > node_puts : 39
> > node_puts_total : 24915
> > node_put_fsm_time_mean : 4419
> > node_put_fsm_time_median : 2444
> > node_put_fsm_time_95 : 12890
> > node_put_fsm_time_99 : 18775
> > node_put_fsm_time_100 : 18775
> > read_repairs : 0
> > read_repairs_total : 0
> > coord_redirs_total : 17022
> > executing_mappers : 0
> > precommit_fail : 0
> > postcommit_fail : 0
> > index_fsm_create : 0
> > index_fsm_create_error : 0
> > index_fsm_active : 0
> > list_fsm_create : 0
> > list_fsm_create_error : 0
> > list_fsm_active : 0
> > pbc_active : 0
> > pbc_connects : 1
> > pbc_connects_total : 508
> > node_get_fsm_active : 1
> > node_get_fsm_active_60s : 3530
> > node_get_fsm_in_rate : 55
> > node_get_fsm_out_rate : 56
> > node_get_fsm_rejected : 0
> > node_get_fsm_rejected_60s : 0
> > node_get_fsm_rejected_total : 0
> > node_put_fsm_active : 0
> > node_put_fsm_active_60s : 67
> > node_put_fsm_in_rate : 1
> > node_put_fsm_out_rate : 1
> > node_put_fsm_rejected : 0
> > node_put_fsm_rejected_60s : 0
> > node_put_fsm_rejected_total : 0
> > leveldb_read_block_error : 0
> > riak_pipe_stat_ts : 1410194286
> > pipeline_active : 0
> > pipeline_create_count : 0
> > pipeline_create_one : 0
> > pipeline_create_error_count : 0
> > pipeline_create_error_one : 0
> > cpu_nprocs : 426
> > cpu_avg1 : 1352
> > cpu_avg5 : 1260
> > cpu_avg15 : 1137
> > mem_total : 15666507776
> > mem_allocated : 15479640064
> > disk : [{"/",8256952,60},
> >         {"/dev/shm",7649660,0},
> >         {"/tmpfs",1048576,14},
> >         {"/tmpfs_mp3",1048576,0},
> >         {"/data",1514123712,81}]
> > nodename : 'xxxx_prod_cluster at 192.168.72.19'
> > connected_nodes : ['xxxx_prod_cluster at 192.168.72.170',
> >                    'xxxx_prod_cluster at 192.168.72.176',
> >                    'xxxx_prod_cluster at 192.168.72.74',
> >                    'xxxx_prod_cluster at 192.168.72.135',
> >                    'xxxx_prod_cluster at 192.168.72.7']
> > sys_driver_version : <<"2.0">>
> > sys_global_heaps_size : 0
> > sys_heap_type : private
> > sys_logical_processors : 4
> > sys_otp_release : <<"R15B01">>
> > sys_process_count : 2469
> > sys_smp_support : true
> > sys_system_version : <<"Erlang R15B01 (erts-5.9.1) [source] [64-bit]
> > [smp:4:4] [async-threads:64] [kernel-poll:true]">>
> > sys_system_architecture : <<"x86_64-unknown-linux-gnu">>
> > sys_threads_enabled : true
> > sys_thread_pool_size : 64
> > sys_wordsize : 8
> > ring_members : ['xxxx_prod_cluster at 192.168.72.135',
> >                 'xxxx_prod_cluster at 192.168.72.170',
> >                 'xxxx_prod_cluster at 192.168.72.176',
> >                 'xxxx_prod_cluster at 192.168.72.19',
> >                 'xxxx_prod_cluster at 192.168.72.7',
> >                 'xxxx_prod_cluster at 192.168.72.74']
> > ring_num_partitions : 128
> > ring_ownership : <<"[{'xxxx_prod_cluster at 192.168.72.170',23},\n
> > {'xxxx_prod_cluster at 192.168.72.74',22},\n
> > {'xxxx_prod_cluster at 192.168.72.135',26},\n
> > {'xxxx_prod_cluster at 192.168.72.176',26},\n
> > {'xxxx_prod_cluster at 192.168.72.7',22},\n
> > {'xxxx_prod_cluster at 192.168.72.19',9}]">>
> > ring_creation_size : 128
> > storage_backend : riak_kv_eleveldb_backend
> > erlydtl_version : <<"0.7.0">>
> > riak_control_version : <<"1.4.10-0-g73c43c3">>
> > cluster_info_version : <<"1.2.4">>
> > riak_search_version : <<"1.4.10-0-g6e548e7">>
> > merge_index_version : <<"1.3.2-0-gcb38ee7">>
> > riak_kv_version : <<"1.4.10-0-g64b6ad8">>
> > sidejob_version : <<"0.2.0">>
> > riak_api_version : <<"1.4.10-0-gc407ac0">>
> > riak_pipe_version : <<"1.4.10-0-g9353526">>
> > riak_core_version : <<"1.4.10">>
> > bitcask_version : <<"1.6.6-0-g230b6d6">>
> > basho_stats_version : <<"1.0.3">>
> > webmachine_version : <<"1.10.4-0-gfcff795">>
> > mochiweb_version : <<"1.5.1p6">>
> > inets_version : <<"5.9">>
> > erlang_js_version : <<"1.2.2">>
> > runtime_tools_version : <<"1.8.8">>
> > os_mon_version : <<"2.2.9">>
> > riak_sysmon_version : <<"1.1.3">>
> > ssl_version : <<"5.0.1">>
> > public_key_version : <<"0.15">>
> > crypto_version : <<"2.1">>
> > sasl_version : <<"2.2.1">>
> > lager_version : <<"2.0.1">>
> > goldrush_version : <<"0.1.5">>
> > compiler_version : <<"4.8.1">>
> > syntax_tools_version : <<"1.6.8">>
> > stdlib_version : <<"1.18.1">>
> > kernel_version : <<"2.15.1">>
> > memory_total : 130705264
> > memory_processes : 55557705
> > memory_processes_used : 55341757
> > memory_system : 75147559
> > memory_atom : 545377
> > memory_atom_used : 527226
> > memory_binary : 12172712
> > memory_code : 11674242
> > memory_ets : 11913912
> >
> >
> >
> > _______________________________________________
> > riak-users mailing list
> > riak-users at lists.basho.com
> > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20140909/808ee756/attachment.html>


More information about the riak-users mailing list