Product Advisory - Riak 2.1.0: Default Configuration For Handoff May Cause Data Loss

Tyler Hannan tyler at basho.com
Fri May 1 16:51:20 EDT 2015


UPDATE: Further investigation has shown that fallback transfers
(hinted handoff) are affected in the same way as ownership transfers.

Our engineering team is working on fixing this as high priority. We
apologize for any impact this having on our users.

Cheers,

Tyler Hannan  |  Director of Technical Marketing
Basho Technologies
t: @tylerhannan
c: 720-280-9216


On Fri, May 1, 2015 at 12:36 PM, Tyler Hannan <tyler at basho.com> wrote:
> -Description-
>
> In Riak 2.1.0, the default configuration for handoff.ip causes vnodes
> marked for transfer during handoff to be removed without transferring
> data to their new destination nodes. A mandatory change to
> configuration (riak.conf) will resolve this issue. While not all users
> are impacted by this issue, we recommend that all 2.1.0 users upgrade
> to 2.1.1 which will be released shortly.
>
> NOTE: This is known to occur for ownership handoff. Investigation as
> to whether hinted handoff is affected is ongoing and this advisory
> will be updated when more information is available.
>
> -Affected Users-
>
> All users of 2.1.0 using riak.conf to configure their clusters are
> potentially impacted. Users that are using app.config and vm.args to
> configure their clusters are unaffected but should upgrade to 2.1.1
> upon release.
>
> To verify whether you are affected, the below command must be run on
> each node in your cluster:
>      riak config effective | grep handoff.ip
>
> Affected nodes will have a handoff ip of 127.0.0.1
>      handoff.ip = 127.0.0.1
>
> -Impact-
>
> This bug impacts vnodes that are in process of handoff. Handoff data
> will be looped back to the source node during ownership handoff rather
> than being transferred to the destination node. Once ownership handoff
> is completed the data is removed from the source node. In the event of
> significant ownership handoff, which can happen during cluster
> expansion or contraction, all replicas of an object may be lost. Data
> loss occurs if all replicas of an object are lost as a result of this
> configuration issue. Replica loss can be triggered by cluster
> membership changes or other Riak cluster activity that triggers
> handoff behavior. Data loss is mitigated as long as at least one
> replica still exists and the below steps are followed.
>
> -Mitigation-
>
> You can immediately mitigate the issue by setting transfer limit to
> zero across the cluster by issuing the following on any node:
>
>      riak-admin transfer-limit 0
>
> Then configure handoff.ip in riak.conf to an external IP address or
> 0.0.0.0 on all nodes.
>
> Perform a rolling restart of Riak across your cluster to activate the
> new setting.
>
> After correcting the configuration and restarting the nodes, you
> should run Riak KV repair on each cluster member as documented at
> http://docs.basho.com/riak/latest/ops/running/recovery/repairing-partitions/
> to recreate any missing replicas from available replicas elsewhere in
> the cluster.  It is recommended to perform the Riak KV repair in a
> round-robin fashion on each node of your cluster (node0, node1, node2,
> etc). Repeat this round-robin repair “n_val - 1” times. For example:
> the default configuration for n_val is 3, which means you would run
> Riak KV repair twice across the entire cluster.
>
> NOTE: It is important to ensure that you execute in a round-robin
> fashion: node0, node1, node2 and then repeat.
>
> A forthcoming 2.1.1 release will provide an updated default configuration.
>
> Questions?
>
> Please open a ticket with Basho if you have any questions about the above issue.
>
> Cheers,
>
> Tyler Hannan  |  Director of Technical Marketing
> Basho Technologies
> t: @tylerhannan
> c: 720-280-9216




More information about the riak-users mailing list