Strange spike

Nam Nguyen nam at tinyco.com
Fri Jun 1 12:45:11 EDT 2012


Sounds like a good plan, Greg.

By the way, I joined back the old node into the cluster (making it 6-node cluster now). And it seems to me that the spike occurs only on that node.

To summarize the problem so far, for the benefits of other list members:

1. The latency spike occurred on one particular node.

2. I took that node out of the cluster, and put in another one. Spikes seemed to happen on all nodes when the cluster were converging and shortly after that. The new node seemed to perform a little bit worse than the rest of the cluster.

3. I rejoin the old node to the cluster. Spike seems to be localized to it again. The spike is, however, less intensive and a little bit shorter in duration. New node seems to be doing on par with the rest.

I'll wait for version 1.2.

Cheers,
Nam


On Jun 1, 2012, at 2:36 AM, Greg Burd wrote:

> Hey Nam,
> 
> It is safe to restart but my advice is to wait until we release 1.2.0 in the next week or two. It has a boat-load of fixes to LevelDB one of which I'm fairly sure is impacting you. Changing these settings as you've indicated is unlikely to make a difference if my intuition is correct.
> 
> -greg 
> 
> @gregburd
> Developer Advocate, Basho Technologies | http://basho.com | @basho
> 
> 
> On Thursday, May 31, 2012 at 9:22 PM, Nam Nguyen wrote:
> 
>> Hi Seth,
>> 
>> Yes, I am using the default config.
>> 
>> Is it safe to change these values and restart riak?
>> 
>> Nam
>> 
>> On May 31, 2012, at 11:24 AM, Seth Benton wrote:
>>> Hey,
>>> 
>>> Apologies if this is the wrong place for this, but I just updated the eLevelDB wiki page to mention randomization of the write buffer length (via setting write_buffer_size_min and write_buffer_size_max). Before there was no mention of these config parameters. Perhaps people were just using levelDB's 4MB default buffer size, causing all the vnodes to compact at the same time? Or are there default write_buffer_size_min and write_buffer_size_max parameters under the hood?
>>> 
>>> http://wiki.basho.com/LevelDB.html
>>> 
>>> P.S. Mathew V is getting back to me shortly on changes to this page due to changes in 1.2.
>>> 
>>> Seth
>>> (Tech Writer)
>>> 
>>> 
>>> On Thu, May 31, 2012 at 9:26 AM, Nam Nguyen <nam at tinyco.com (mailto:nam at tinyco.com)> wrote:
>>>> Hi Sean,
>>>> 
>>>> You are right. At first I thought it was localized to that one particular node. Now others are also exhibiting the same symptom.
>>>> 
>>>> I am putting in another node. 
>>>> 
>>>> Cheers,
>>>> Nam
>>>> 
>>>> 
>>>> On May 30, 2012, at 11:23 PM, Sean Cribbs wrote:
>>>>> Nam,
>>>>> 
>>>>> The LevelDB storage backend has a known issue where compaction can stall a heavily-loaded node for a long time (we've seen 60 seconds or more in production clusters). We're very sorry about this, but an improvement will be available in the next release. In the meantime, DO NOT make the node leave the cluster - this will only make things worse! It might be worth adding another node to the cluster, but I suggest you wait until the node finishes compaction.
>>>>> 
>>>>> On Wed, May 30, 2012 at 10:43 PM, Nam Nguyen <nam at tinyco.com (mailto:nam at tinyco.com)> wrote:
>>>>>> Hi,
>>>>>> 
>>>>>> My 5-node cluster exhibits a strange spike on one particular node.
>>>>>> 
>>>>>> Overall, the mean get time is about 1ms. This node occasionally shoots up to 40ms.
>>>>>> 
>>>>>> During those times, %iowait is still the same as it is before the spike. No error. Console log shows many lines like the below, which I don't think relevant to the spike.
>>>>>> 
>>>>>> 2012-05-30 21:29:50.591 [info] <0.72.0>@riak_core_sysmon_handler:handle_event:85 monitor long_gc <0.938.0> [{initial_call,{riak_core_vnode,init,1}},{almost_current_function,{gen_fsm,loop,7}},{message_queue_len,0}] [{timeout,185},{old_heap_block_size,0},{heap_block_size,2584},{mbuf_size,0},{stack_size,55},{old_heap_size,0},{heap_size,804}]
>>>>>> 
>>>>>> The cluster is set up uniformly. Ubuntu 64bit, m2.2xlarge instance. Riak 1.1.2 with LevelDB backend.
>>>>>> 
>>>>>> What would be the best course of actions for me?
>>>>>> 
>>>>>> I plan to:
>>>>>> 
>>>>>> - riak-admin leave on that node
>>>>>> - set up new instance
>>>>>> - riak-admin reip the new instance
>>>>>> - riak-admin join it to the cluster
>>>>>> 
>>>>>> Cheers,
>>>>>> Nam
>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> riak-users mailing list
>>>>>> riak-users at lists.basho.com (mailto:riak-users at lists.basho.com)
>>>>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> -- 
>>>>> Sean Cribbs <sean at basho.com (mailto:sean at basho.com)>
>>>>> Software Engineer
>>>>> Basho Technologies, Inc.
>>>>> http://basho.com/
>>>> 
>>>> 
>>>> 
>>>> _______________________________________________
>>>> riak-users mailing list
>>>> riak-users at lists.basho.com (mailto:riak-users at lists.basho.com)
>>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>> 
>> 
>> 
>> _______________________________________________
>> riak-users mailing list
>> riak-users at lists.basho.com (mailto:riak-users at lists.basho.com)
>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> 
> 
> 





More information about the riak-users mailing list