Riak crashing due to "eheap_alloc: Cannot allocate xxxx bytes of memory"

Jeff Pollard jeff.pollard at gmail.com
Wed Jul 6 00:28:44 EDT 2011


Thanks to some help from Aphyr + Sean Cribbs on IRC, we narrowed the issue
down to us having several multiple-hundred-megabyte sized documents and one
1.1 gig document.  Deletion of those documents has now kept the cluster
running quite happily for 3+ hours now, where before nodes were crashing
after 15 minutes.

I've managed to delete most of the large documents, but there are still a
handful (3) that I am unable to delete.  Attempts to curl -X DELETE them
result in 503 error from Riak:

 < HTTP/1.1 503 Service Unavailable
> < Server: MochiWeb/1.1 WebMachine/1.7.3 (participate in the frantic)
> < Date: Wed, 06 Jul 2011 04:20:15 GMT
> < Content-Type: text/plain
> < Content-Length: 18

 <
> request timed out


In the erlang.log, I see this right before the timeout comes back:

=INFO REPORT==== 5-Jul-2011::21:26:35 ===
> [{alarm_handler,{set,{process_memory_high_watermark,<0.10425.0>}}}]


Anyone have any help/ideas on what's going on here and how to fix it?

On Tue, Jul 5, 2011 at 8:58 AM, Jeff Pollard <jeff.pollard at gmail.com> wrote:

> Over the last few days we've had random nodes in our 5-node cluster crash
> with "eheap_alloc: Cannot allocate xxxx bytes of memory" errors in the
> erl_crash.dump file.  In general, the error messages seem to crash trying to
> allocate 13-20 gigs of memory (our boxes have 32 gigs total).  As far as I
> can tell crashing doesn't seem to coincide with any particular requests to
> Riak.  I've tried to make some sense fo the erl_crash.dump file but haven't
> had any luck.  I'm also in the process of restoring our riak bakups to our
> staging cluster in hopes of more accurately reproducing the issue in a less
> noisy environment.
>
> My questions for the list are:
>
>    1. Any clue how to further diagnose the issue? I can attach my
>    erl_crash.dump if needed.
>    2. Is it possible/likely this is due to large m/r requests?  We have a
>    couple m/r requests.  One goes over no more than 4 documents at a time while
>    the other goes over anywhere between 60 and 10,000 documents, though more
>    towards the smaller number of documents.  We use 16 js VMs with max memory
>    for the VM and stack of 32 MB, each.
>    3. We're running riak 0.14.1.  Would upgrading to 0.14.2 help?
>
> Thanks!
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20110705/f6b1ac98/attachment.html>


More information about the riak-users mailing list