Ownership handoff timed out

Jon Meredith jmeredith at basho.com
Mon Oct 26 09:56:51 EDT 2015


Hi,

I suspect your {error,enotconn} messages are unrelated - that's likely to
be caused by an HTTP client closing the connection while Riak looks up
 some networking information about the requestor.

The max_concurrency message you are seeing is related to the handoff
transfer limit - it should be labelled as informational. When a node has
data to handoff it starts the handoff sender process and if there are
either too many local handoff processes or too many on the remote side it
exits with max_concurrency.  You could increase with riak-admin
transfer-limit but that probably won't help if you're timing out.

As you're using the multi-backend you're transferring data from bitcask and
leveldb.  The next place I would look is in the leveldb LOG files to see if
there are any leveldb vnodes that are having problems that's preventing
repair.

Jon

On Mon, Oct 26, 2015 at 7:15 AM Vladyslav Zakhozhai <
v.zakhozhai at smartweb.com.ua> wrote:

> Hello,
>
> I have a problem with persistent timeouts during ownership handoffs. I've
> tried to surf over Internet and current mail list but no success.
>
> I have Riak 1.4.12 cluster with 17 nodes. Almost all nodes use
> multibackend with bitcask and eleveldb as storage backends (we need
> multiple backend for Riak CS 1.5.0 integration).
>
> Now I'm working to migrate Riak cluster to eleveldb as primary and only
> backend. For now I have 2 nodes with eleveldb backend in the same cluster.
>
> During ownership handoff process I permanently see errors of timed out
> handoff receivers and sender.
>
> Here is partial output of riak-admin transfers:
> ...
> transfer type: ownership_transfer
> vnode type: riak_kv_vnode
> partition: 331121464707782692405522344912282871640797216768
> started: 2015-10-21 08:32:55 [46.66 min ago]
> last update: no updates seen
> total size: unknown
> objects transferred: unknown
>
>                            unknown
> riak at taipan.pleiad.uaprom  =======>  riak at eggeater.pleiad.uapr
>                                      om
>         |                                           |   0%
>                            unknown
>
> transfer type: ownership_transfer
> vnode type: riak_kv_vnode
> partition: 336830455478606531929755488790080852186328203264
> started: 2015-10-21 08:32:54 [46.68 min ago]
> last update: no updates seen
> total size: unknown
> objects transferred: unknown
> ...
>
> Some of partition handoffs state never updates, some of them terminates
> after partial handoff objects and never starts again.
>
> I see nothing in logs but following:
>
> On receiver side:
>
> 2015-10-21 11:33:55.131 [error]
> <0.25390.1266>@riak_core_handoff_receiver:handle_info:105 Handoff receiver
> for partition 331121464707782692405522344912282871640797216768 timed out
> after processing 0 objects.
>
> On sender side:
>
> 2015-10-21 11:01:58.879 [error] <0.13177.1401> CRASH REPORT Process
> <0.13177.1401> with 0 neighbours crashed with reason: no function clause
> matching webmachine_request:peer_from_peername({error,enotconn},
> {webmachine_request,{wm_reqstate,#Port<0.50978116>,[],undefined,undefined,undefined,{wm_reqdata,...},...}})
> line 150
> 2015-10-21 11:32:50.055 [error] <0.207.0> Supervisor
> riak_core_handoff_sender_sup had child riak_core_handoff_sender started
> with {riak_core_handoff_sender,start_link,undefined} at <0.22312.1090> exit
> with reason max_concurrency in context child_terminated
>
> {error, enotconn} - seems to be network issue. But I have no any problems
> with network. All hosts resolve their neighbors correctly and /etc/hosts on
> each node are correct.
>
> I've tried to increase handoff_timeout and handoff_receive_timeout. But no
> success.
>
> Forcing handoff helped me but for short period of time:
>
> rpc:multicall([node() | nodes()], riak_core_vnode_manager, force_handoffs, []).
>
>
> I see progress of handoffs (riak-admin transfers) but then I see handoff timed out again.
>
>
> A week ago I've joined 4 nodes with bitcask. And there was no such problems.
>
>
> I'm confused a little bit and need to understand my next steps in troubleshooting this issue.
>
>
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20151026/18dd611b/attachment-0002.html>


More information about the riak-users mailing list