riak java client causing OOMs on high latency

Dietrich Featherston d at d2fn.com
Mon Jan 7 17:26:14 EST 2013


We're seeing instances of a JVM app which talks to riak run out of
memory when riak operations rise in latency or riak becomes otherwise
unresponsive. A heap dump of the JVM at the time of the OOM show that
91% of the 1G (active) heap is consumed by large byte[] instances. In
our case 3 of those byte[]s are in the 200MB range with size dropping
off after that. The byte[] instances cannot be traced back to a
specific variable as their references appear to be stack-allocated
local method variables. But, based on the name of the thread, we can
tell that the thread is doing a store operation against
riak at localhost.

Inspection of the data in one of these byte[]s shows what looks like
an r_object response with headers and footer boilerplate around our
object payload. This 200+MB byte[] is filled with 0s after the 338th
element which is really confusing and indicates that far too much
space is being allocated to read the protobuf payload. Here's a dump
of one of these instances:
https://gist.github.com/40ef9b2ff561e973a72c

It's also worth mentioning that, according to /stats,
get_fsm_objsize_100 is consistently under 1MB so there is no reason to
think that our objects are actually this large.

At this point I'm suspicious of the following code creating too large
a byte[] from possibly too large a return from dis.readInt()

https://github.com/basho/riak-java-client/blob/master/src/main/java/com/basho/riak/pbc/RiakConnection.java#L110

Unsure if that indicates a problem in the driver or the server-side
erlang protobuf server.

Suspicious that requests pile up and many of these byte[]s are hanging
out--enough to cause an OOM. It's possible that they are always very
large, but are short-lived enough as to not cause a problem until
latencies rise increasing their numbers briefly.

Thoughts?

Thanks,
D




More information about the riak-users mailing list