Endless AAE keys repairing

Ryan Zezeski ryan at zinascii.com
Thu Jul 17 08:41:26 EDT 2014


On Jul 17, 2014, at 4:30 AM, Daniil Churikov <ddosia at gmail.com> wrote:

> Hello, In our test env we have 3 nodes riak 1.4.8-1 cluster on debians. According to logs: 2014-07-17 02:48:03.748 [info] <0.10542.85>@riak_kv_exchange_fsm:key_exchange:206 Repaired 1 keys during active anti-entropy exchange of {936274486415109681974235595958868809467081785344,3} between {936274486415109681974235595958868809467081785344,'riak at 10.3.13.96'} and {981946412581700398168100746981252653831329677312,'riak at 10.3.13.96'} Messages like this constantly appears, there is not so much load on this test cluster and I expected that eventually everything will be fixed, but this messages keep coming from day to day. In the past we had several issues with one of the cluster participants and as a result we did enabled AAE to fix it. What could be possble the reason of this? 

This is probably caused by regular puts.  When AAE performs an exchange it takes snapshots of each tree in a concurrent manner.  This means that a snapshot could occur while replicas for a given object are still in flight.  For example:

1. User writes object O.
2. Coordinator sends O to 3 partitions A, B, and C.
3. Partition A accepts O and updates hash tree.
4. Entropy manager on node which own partition A decides to perform an exchange between A and B.
5. Snapshot is taken of hash tree for A.
6. Snapshot is taken of hash tree for B.
7. Partition B accepts O and updates hash tree (but the update is not reflected in the snapshot just taken)
8. Partition C accepts O and updates hash tree.
9. Exchange between A & B determines object is missing on B and performs a read repair.
10. Read repair notices that object O exists on all three partitions and there is nothing to be done.

The higher the load the more keys that could be included in one snapshot but not the other.  I would say that any time your cluster is accepting writes it might be normal to see a handful of keys getting “repaired”.  But if you see, say, more than 10 (especially if there are 0 outstanding writes) then that is probably a sign of real repair.

-Z



More information about the riak-users mailing list