riak java client causing OOMs on high latency

Brian Roach roach at basho.com
Tue Jan 8 09:50:06 EST 2013


The code you cite is reading a size (32 bit int, network byte order)
and a message code (8bit int)  from the socket. It then creates a
byte[] of the size required for the amount of data that has been
requested and then sent back by Riak to the client in a response. (See
the docs here: http://docs.basho.com/riak/latest/references/apis/protocol-buffers/
 that show this format )

That byte[] is then passed into the Google protocol buffers generated
code where the appropriate protocol buffers object(s) are deserialized
from those bytes and the information contained therein is extracted
from them into our own objects which are returned to the caller as a
response.

>From the client's perspective, if that's how much data you're getting,
that's how much data you've requested, and how much Riak has sent you.

Thanks,
Brian Roach

On Mon, Jan 7, 2013 at 3:26 PM, Dietrich Featherston <d at d2fn.com> wrote:
> We're seeing instances of a JVM app which talks to riak run out of
> memory when riak operations rise in latency or riak becomes otherwise
> unresponsive. A heap dump of the JVM at the time of the OOM show that
> 91% of the 1G (active) heap is consumed by large byte[] instances. In
> our case 3 of those byte[]s are in the 200MB range with size dropping
> off after that. The byte[] instances cannot be traced back to a
> specific variable as their references appear to be stack-allocated
> local method variables. But, based on the name of the thread, we can
> tell that the thread is doing a store operation against
> riak at localhost.
>
> Inspection of the data in one of these byte[]s shows what looks like
> an r_object response with headers and footer boilerplate around our
> object payload. This 200+MB byte[] is filled with 0s after the 338th
> element which is really confusing and indicates that far too much
> space is being allocated to read the protobuf payload. Here's a dump
> of one of these instances:
> https://gist.github.com/40ef9b2ff561e973a72c
>
> It's also worth mentioning that, according to /stats,
> get_fsm_objsize_100 is consistently under 1MB so there is no reason to
> think that our objects are actually this large.
>
> At this point I'm suspicious of the following code creating too large
> a byte[] from possibly too large a return from dis.readInt()
>
> https://github.com/basho/riak-java-client/blob/master/src/main/java/com/basho/riak/pbc/RiakConnection.java#L110
>
> Unsure if that indicates a problem in the driver or the server-side
> erlang protobuf server.
>
> Suspicious that requests pile up and many of these byte[]s are hanging
> out--enough to cause an OOM. It's possible that they are always very
> large, but are short-lived enough as to not cause a problem until
> latencies rise increasing their numbers briefly.
>
> Thoughts?
>
> Thanks,
> D
>
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com




More information about the riak-users mailing list