LevelDB compaction and timeouts

Matthew Von-Maszewski matthewv at basho.com
Wed Jan 9 11:06:08 EST 2013


Parnell,

I confirmed with the Basho team that "list_keys" is a read only process.  Yes, some read operations would initiate compactions in Riak 1.1, but you have 1.2.1.  I therefore suspect that there is a secondary issue.  

Would you mind gathering the LOG files from one of the machines that you must mark down, and tell me the date/time of the last problem on that machine?  The following command (with path changed as appropriate) should gather the LOGs just fine.  Then tar/zip the output file before email.

    sort /var/lib/riak/leveldb/*/LOG* >LOG_all.txt

I am going to guess that the 24 core machine is not having this problem.  Would you send a zip file of its LOG also?  I want to compare the throughput differences.  

Matthew


On Jan 8, 2013, at 2:53 PM, Parnell Springmeyer wrote:

> Matthew,
> 
> 1. 1.2.1
> 2.
> {eleveldb, [
>             {data_root, "/var/riak/data/leveldb"}
>            ]}
> 3. I'm running 5 physical servers with one Riak node per server.
> 4. Unfortunately all the machines are a hodge podge of parts; we're soon
> going to move to buying our own hardware and coloing it; here's the
> server list (all machines are FreeBSD 9):
> 
> Cores    CPU Model    CPU Speed    RAM    HDD Model    HDD Size
> 24    Xeon X5650    2.67GHz    48GiB    2X INTEL SSDSA2CW30 on LSI
> MegaRaid SAS 2108 mirrored    280GB
> 4    Xeon E5560    2.13GHz    12GiB    Barracuda ST3500418AS    500GB
> 8    Xeon E31230    3.2GHz    8GiB    WDC WD1600JS    160GB
> 4    Core2 Q8400    2.66GHz    8GiB    Seagate ST500DM002-1BD142    500GB
> 4    Xeon L5320    1.86GHz    12GiB        500GB
> 
> 3. There were no "waiting" entries, but quite a few compaction entries,
> I haven't studied leveldb enough to know if that's "normal" or if it
> indicates heavy compaction.
> 
> The compaction event seemed to be triggered by someone issuing a
> list_keys operation; four servers pretty much became unresponsive while
> they were doing compaction. After about an hour only two were dealing
> with compaction but it was still causing the entire cluster to respond
> with timeouts to index().run() queries and M/R jobs.
> 
> I took down those two nodes and marked them as down (riak-admin down)
> and the timeouts disappeared and the cluster operated as it should. So I
> waited till 1AM last night to start the two machines up so they could
> finish compaction. I'm somewhat surprised there isn't a method for
> marking machines as "unavailable" in the event of heavy compaction -
> that way they can finish compacting and the cluster can treat the node
> as unavailable. I don't know how difficult that is though.
>> Parnell,
>> 
>> Would appreciate some configuration info:
>> 
>> - what version of Riak are you running?
>> 
>> - would you copy/paste the eleveldb section of your app.config?
>> 
>> - how many vnodes and physical servers are you running?
>> 
>> - what is hardware? cpu, memory, disk arrays
>> 
>> - are you seeing the work "waiting" in your LOG files?
>> 
>> 
>> Not sure that the above info will lead to a solution.  But it is a start.





More information about the riak-users mailing list