Riak 1.0.2 RC1

Joseph Blomstedt joe at basho.com
Mon Nov 7 14:11:43 EST 2011


In the release notes, we mention 256, 512, or 1024 as reasonable
maximum ring sizes depending on the performance of your underlying
hardware. If you have the hardware to spare, you could try setting up
a duplicate cluster and then upgrade it to 1.0 and see how things work
out. If not, you may want to wait until the issue is resolved. If
we're talking about the 75 node case here, then I imagine duplication
isn't an option. So, waiting may in fact be best. This issue is
entirely related to riak_core itself, so any apps that build on
riak_core will have this limitation.

This is a huge priority for Basho. This issue should be completely
resolved in the next major release of Riak. The underlying problem is
a set of operations that should be independent of ring size, but are
currently linearly dependent on it due to how vnodes are implemented.
Current work that will be going into riak_core's master branch in the
next 2 to 4 weeks will be eliminating this dependency. So, it may be
possible to pull some of those changes into your code base earlier
than the next release if your risk profile is comfortable with
pre-release code.

You are correct that the issue only occurs during ring changes.
However, upgrading from 0.14.2 to 1.0 will result in a ring divergence
that will trigger at least one period of increased gossip. If your
cluster can make it through this upgrade and convergence window
without nodes running out of RAM and crashing, then you will be fine
after the cluster converges provided you don't add/remove additional
nodes or perform any operations that change the ring metadata.
However, if your nodes do start crashing, then you'll end up in state
that will take a lot of manual effort to resolve and get things
working smoothly again.

At Basho, we are currently considering a stopgap fix for this issue
for a 1.0.x release until the next major release ships. The current
idea is to build in a rate limiting feature that will slow down the
number of gossip messages sent over a period of time, and therefore
allow users to tune this setting to prevent the cluster from being
overloaded. The cluster will still need to process the same number of
messages in order to converge, but could be set to do so at a slower
rate and therefore not become overloaded. It would essentially trade
convergence time for cluster stability. The downside is that a large
75 node, 1024 partition cluster could take hours to converge after a
ring divergence with a reasonable limit. However, if ring divergence
is rare (which is usually is), then this is a good option. We are
currently deep in the QA cycle for our 1.0.2 release, so any rate
limiting feature will likely ship in a 1.0.3 release. However, there
is an older pre-existing branch with this feature (from before the 1.0
release) that may be clean to merge into the current 1.0.2 code base
for those that need an option today rather than tomorrow:
https://github.com/basho/riak_core/tree/jdb-gossip-limit

Again, the real fix going into the next major release is an
architectural change that removes ring size as a variable in how many
gossip messages are necessary for convergence.

-Joe

On Mon, Nov 7, 2011 at 10:31 AM, Anthony Molinaro
<anthonym at alumni.caltech.edu> wrote:
> Lets say I already have a system with a ring size of 1024 in 0.14.2,
> should I wait to upgrade until this is sorted out?  And how long will
> that be?  Where is this in terms of Basho's priorities?
>
> You say stay under 1024, so I assume that means the max size you
> recommend would be 512?   Does this also apply to applications using
> riak_core?  We have a riak_core appliction which currently has about
> 75 nodes with 1024 partitions, should we not update riak_core to
> the version shipped in the 1.0.x releases.
>
> Are there things that can be done if we require a 1024 ring size to
> mitigate the issue?  It appears the issue might only be during ring
> changes, so might not be that awful?
>
> Thanks,
>
> -Anthony
>
> On Sun, Nov 06, 2011 at 08:10:52PM -0700, Joseph Blomstedt wrote:
>> > RE: large ring size warning in the release notes, is the performance
>> > degradation linear below 256? That is, until the major release that fixes
>> > this, is it best to keep ring sizes at 64 for best performance?
>>
>> The large ring size warning in the release notes is predominately
>> related to an issue with Riak's ring gossip functionality.
>> Adding/removing nodes, changing bucket properties, and setting cluster
>> metadata all result in a brief period (usually a few seconds) where
>> gossip traffic increases significantly. The size of the ring
>> determines both the number of gossip messages that occur during this
>> window, as well as the size of each message. With large rings, it is
>> possible that messages are generated faster than they can be handled,
>> resulting in large message queues that impact cluster performance and
>> tie up system memory until the message queues are fully drained. In
>> general, there is no problem as long as your hardware is fast enough
>> to process the brief spike in gossip traffic in close to real time.
>> Concerning this specific issue, choosing a ring size smaller than the
>> maximum you can handle does not provide any additional performance
>> gains.
>>
>> However, unrelated to this issue, there are general performance
>> considerations related to ring size versus the number of nodes in your
>> cluster. Given a fixed number of nodes, a larger ring results in more
>> vnodes per node. This allows more process concurrency which may
>> improvement performance. However, each vnode runs it's own backend
>> instance that has it's own set of on-disk files, and competing
>> reads/writes to different files may result in additional I/O
>> contention than having fewer vnodes per node. The overall performance
>> is going to depend largely on your OS, your file system, and your
>> traffic pattern. So, it's hard to give specific hard and fast rules.
>> The other issue is 2I performance. Secondary indexes send request to a
>> covering set of vnodes, which works out to ring_size / N requests;
>> therefore increasing the ring_size without increasing N leads to more
>> 2I requests. Again, the right answer depends largely on your use case.
>>
>> Overall, I believe we normally recommend between 10 and 50 vnodes per
>> node, with the ring size rounded up to the next power of two.
>> Personally, I think 16 vnodes per node is a good number, which matches
>> the common 64 partition, 4 node Riak cluster. Thus, choosing a ring
>> size based on that ratio and your expected number of future nodes is a
>> reasonable choice. Just be sure to stay under 1024 until the issue
>> with gossip overloading the cluster is resolved.
>>
>> -Joe
>>
>> On Sat, Nov 5, 2011 at 9:49 AM, Jim Adler <jim.adler at comcast.net> wrote:
>> > RE: large ring size warning in the release notes, is the performance
>> > degradation linear below 256? That is, until the major release that fixes
>> > this, is it best to keep ring sizes at 64 for best performance?
>> > Jim
>> >
>> > Sent from my phone. Please forgive the typos.
>> > On Nov 4, 2011, at 7:20 PM, Jared Morrow <jared at basho.com> wrote:
>> >
>> > As we've mentioned earlier, the 1.0.2 release of Riak is coming soon.   Like
>> > we did with Riak 1.0.0, we are provided a release candidate for test before
>> > we release 1.0.2 final.
>> > You can find the packages here:
>> > http://downloads.basho.com/riak/riak-1.0.2rc1/
>> > The release notes have been updated and can be found here:
>> >  https://github.com/basho/riak/blob/1.0/RELEASE-NOTES.org
>> > Thank you, as always, for continuing to provide bug reports and feedback.
>> > -Jared
>> >
>> > _______________________________________________
>> > riak-users mailing list
>> > riak-users at lists.basho.com
>> > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>> >
>> > _______________________________________________
>> > riak-users mailing list
>> > riak-users at lists.basho.com
>> > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>> >
>> >
>>
>>
>>
>> --
>> Joseph Blomstedt <joe at basho.com>
>> Software Engineer
>> Basho Technologies, Inc.
>> http://www.basho.com/
>>
>> _______________________________________________
>> riak-users mailing list
>> riak-users at lists.basho.com
>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
> --
> ------------------------------------------------------------------------
> Anthony Molinaro                           <anthonym at alumni.caltech.edu>
>
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>



-- 
Joseph Blomstedt <joe at basho.com>
Software Engineer
Basho Technologies, Inc.
http://www.basho.com/




More information about the riak-users mailing list