debugging riak behavior by looking at the network
d at d2fn.com
Thu Apr 19 20:25:05 EDT 2012
I just wrote a new blog post debugging some issues I'm seeing with riak by
looking at the network. Lots of words and pretty pictures here:
What seems to be happening is that cleanup tasks in our app eventually
become the primary workload of the cluster (our app is using the
riak-java-client+pb 1.0.3 btw). Those cleanup tasks look like the following.
1. Look in the 2i $key index for keys that fall within a known range (never
really takes more than 7s or so)
2. Sequentially delete those keys from the app (takes several minutes in
the worst case)
Eventually this becomes the primary workload of the cluster and individual
deletion latencies grow (more detailed measurements on the shape of this
degradation are forthcoming if that is helpful).
We are using riak 1.1 directly from
https://github.com/basho/riak/tree/1.1with the eleveldb backend. The
eleveldb specific configuration follows, but
fiddling with these settings hasn't noticeably impacted behavior we've
seen. Planning to set delete_mode to immediate and see if that helps.
Here's some other info that might be helpful but feel free to ask for
N = 3 (changing to 2) on 9 physical nodes w 32GB memory each
Our leveldb config looks like this:
%% eLevelDB Config
Changes we've considered making to avoid the need for cleanup tasks:
- use the bitcask backend and have it handle key expiration for us (can't
because our keys definitely won't fit in memory)
- round-robin keys to avoid cleanup tasks and make the applications smart
enough to translate logical keys (time) into stored keys (0-N) -- this is
time-series data. unsure how leveldb would respond to overwriting keys this
- write a custom backend or riak_core app for storage
Comments appreciated as I dig into this.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the riak-users