Ownership handoff timed out

Vladyslav Zakhozhai v.zakhozhai at smartweb.com.ua
Tue Oct 27 05:54:56 EDT 2015


Hi,

Jon thank you for the answer. During approval of my mail to this list I've
troubleshoot my issue more deep. And yes, your are right. Neither {error,
enotconn} nor max_concurrency is my problem.

I'm going to migrate my cluster entierly to eleveldb only, i.e. I need to
refuse using bitcask. I have a talk with basho support and they said that
it is tricky to tune bitcask on servers with 32 GB RAM (and I guess that it
is not tricky, but it is impossible, because bitcask loads all keys in
memory regardless of free available RAM). With LevelDB I have opportunity
to tune using RAM on servers.

So I have 15 nodes with multibackend (bitcask for data and leveldb for
metadata). 2 additional servers are without multibackend - only with
leveldb. Now I'm not sure do I need still use mutibackend with levedb-only
backend.

And my problem is (as I mentioned earlier) the following. On leveldb-only
nodes I see handoffs timedout and no further progress.

On multibackend hosts I have configuration:

{riak_kv, [
       {add_paths, ["/usr/lib/riak-cs/lib/riak_cs-1.5.0/ebin"]},
       {storage_backend, riak_cs_kv_multi_backend},
       {multi_backend_prefix_list, [{<<"0b:">>, be_blocks}]},
       {multi_backend_default, be_default},
       {multi_backend, [
           {be_default, riak_kv_eleveldb_backend, [
               {max_open_files, 50},
               {data_root, "/var/lib/riak/leveldb"}
           ]},
           {be_blocks, riak_kv_bitcask_backend, [
               {data_root, "/var/lib/riak/bitcask"}
           ]}
       ]},

And for hosts with leveldb-only backend:

{riak_kv, [
            {storage_backend, riak_kv_eleveldb_backend},
...
{eleveldb, [
            {data_root, "/var/lib/riak/leveldb"}
(default values for leveldb)

In leveldb logs I see nothing that could help me (no errors in logs).


On Mon, Oct 26, 2015 at 3:57 PM Jon Meredith <jmeredith at basho.com> wrote:

> Hi,
>
> I suspect your {error,enotconn} messages are unrelated - that's likely to
> be caused by an HTTP client closing the connection while Riak looks up
>  some networking information about the requestor.
>
> The max_concurrency message you are seeing is related to the handoff
> transfer limit - it should be labelled as informational. When a node has
> data to handoff it starts the handoff sender process and if there are
> either too many local handoff processes or too many on the remote side it
> exits with max_concurrency.  You could increase with riak-admin
> transfer-limit but that probably won't help if you're timing out.
>
> As you're using the multi-backend you're transferring data from bitcask
> and leveldb.  The next place I would look is in the leveldb LOG files to
> see if there are any leveldb vnodes that are having problems that's
> preventing repair.
>
> Jon
>
> On Mon, Oct 26, 2015 at 7:15 AM Vladyslav Zakhozhai <
> v.zakhozhai at smartweb.com.ua> wrote:
>
>> Hello,
>>
>> I have a problem with persistent timeouts during ownership handoffs. I've
>> tried to surf over Internet and current mail list but no success.
>>
>> I have Riak 1.4.12 cluster with 17 nodes. Almost all nodes use
>> multibackend with bitcask and eleveldb as storage backends (we need
>> multiple backend for Riak CS 1.5.0 integration).
>>
>> Now I'm working to migrate Riak cluster to eleveldb as primary and only
>> backend. For now I have 2 nodes with eleveldb backend in the same cluster.
>>
>> During ownership handoff process I permanently see errors of timed out
>> handoff receivers and sender.
>>
>> Here is partial output of riak-admin transfers:
>> ...
>> transfer type: ownership_transfer
>> vnode type: riak_kv_vnode
>> partition: 331121464707782692405522344912282871640797216768
>> started: 2015-10-21 08:32:55 [46.66 min ago]
>> last update: no updates seen
>> total size: unknown
>> objects transferred: unknown
>>
>>                            unknown
>> riak at taipan.pleiad.uaprom  =======>  riak at eggeater.pleiad.uapr
>>                                      om
>>         |                                           |   0%
>>                            unknown
>>
>> transfer type: ownership_transfer
>> vnode type: riak_kv_vnode
>> partition: 336830455478606531929755488790080852186328203264
>> started: 2015-10-21 08:32:54 [46.68 min ago]
>> last update: no updates seen
>> total size: unknown
>> objects transferred: unknown
>> ...
>>
>> Some of partition handoffs state never updates, some of them terminates
>> after partial handoff objects and never starts again.
>>
>> I see nothing in logs but following:
>>
>> On receiver side:
>>
>> 2015-10-21 11:33:55.131 [error]
>> <0.25390.1266>@riak_core_handoff_receiver:handle_info:105 Handoff receiver
>> for partition 331121464707782692405522344912282871640797216768 timed out
>> after processing 0 objects.
>>
>> On sender side:
>>
>> 2015-10-21 11:01:58.879 [error] <0.13177.1401> CRASH REPORT Process
>> <0.13177.1401> with 0 neighbours crashed with reason: no function clause
>> matching webmachine_request:peer_from_peername({error,enotconn},
>> {webmachine_request,{wm_reqstate,#Port<0.50978116>,[],undefined,undefined,undefined,{wm_reqdata,...},...}})
>> line 150
>> 2015-10-21 11:32:50.055 [error] <0.207.0> Supervisor
>> riak_core_handoff_sender_sup had child riak_core_handoff_sender started
>> with {riak_core_handoff_sender,start_link,undefined} at <0.22312.1090> exit
>> with reason max_concurrency in context child_terminated
>>
>> {error, enotconn} - seems to be network issue. But I have no any problems
>> with network. All hosts resolve their neighbors correctly and /etc/hosts on
>> each node are correct.
>>
>> I've tried to increase handoff_timeout and handoff_receive_timeout. But
>> no success.
>>
>> Forcing handoff helped me but for short period of time:
>>
>> rpc:multicall([node() | nodes()], riak_core_vnode_manager, force_handoffs, []).
>>
>>
>> I see progress of handoffs (riak-admin transfers) but then I see handoff timed out again.
>>
>>
>> A week ago I've joined 4 nodes with bitcask. And there was no such problems.
>>
>>
>> I'm confused a little bit and need to understand my next steps in troubleshooting this issue.
>>
>>
>> _______________________________________________
>> riak-users mailing list
>> riak-users at lists.basho.com
>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20151027/b15cca60/attachment-0002.html>


More information about the riak-users mailing list