riak java client causing OOMs on high latency
d at d2fn.com
Mon Jan 7 17:26:14 EST 2013
We're seeing instances of a JVM app which talks to riak run out of
memory when riak operations rise in latency or riak becomes otherwise
unresponsive. A heap dump of the JVM at the time of the OOM show that
91% of the 1G (active) heap is consumed by large byte instances. In
our case 3 of those bytes are in the 200MB range with size dropping
off after that. The byte instances cannot be traced back to a
specific variable as their references appear to be stack-allocated
local method variables. But, based on the name of the thread, we can
tell that the thread is doing a store operation against
riak at localhost.
Inspection of the data in one of these bytes shows what looks like
an r_object response with headers and footer boilerplate around our
object payload. This 200+MB byte is filled with 0s after the 338th
element which is really confusing and indicates that far too much
space is being allocated to read the protobuf payload. Here's a dump
of one of these instances:
It's also worth mentioning that, according to /stats,
get_fsm_objsize_100 is consistently under 1MB so there is no reason to
think that our objects are actually this large.
At this point I'm suspicious of the following code creating too large
a byte from possibly too large a return from dis.readInt()
Unsure if that indicates a problem in the driver or the server-side
erlang protobuf server.
Suspicious that requests pile up and many of these bytes are hanging
out--enough to cause an OOM. It's possible that they are always very
large, but are short-lived enough as to not cause a problem until
latencies rise increasing their numbers briefly.
More information about the riak-users