Riak crashing due to "eheap_alloc: Cannot allocate xxxx bytes of memory"

Aphyr aphyr at aphyr.com
Wed Jul 6 00:36:23 EDT 2011


Since you were able to create and write these objects in the first 
place, you probably had enough ram at one point to load and save them. I 
would try bringing each node up in isolation, then issuing a delete 
request against the local node, then restarting the node in normal, 
talking-to-the-ring mode. If there are any local processes you can stop 
to free up memory, try that too.

When I encountered this problem, I was able to use the riak:local_client 
at the erlang shell to delete my huge objects--so long as other 
processes weren't hammering it with requests.

--Kyle

On 07/05/2011 09:28 PM, Jeff Pollard wrote:
> Thanks to some help from Aphyr + Sean Cribbs on IRC, we narrowed the
> issue down to us having several multiple-hundred-megabyte sized
> documents and one 1.1 gig document.  Deletion of those documents has now
> kept the cluster running quite happily for 3+ hours now, where before
> nodes were crashing after 15 minutes.
>
> I've managed to delete most of the large documents, but there are still
> a handful (3) that I am unable to delete.  Attempts to curl -X DELETE
> them result in 503 error from Riak:
>
>     < HTTP/1.1 503 Service Unavailable
>     < Server: MochiWeb/1.1 WebMachine/1.7.3 (participate in the frantic)
>     < Date: Wed, 06 Jul 2011 04:20:15 GMT
>     < Content-Type: text/plain
>     < Content-Length: 18
>
>     <
>     request timed out
>
>
> In the erlang.log, I see this right before the timeout comes back:
>
>     =INFO REPORT==== 5-Jul-2011::21:26:35 ===
>     [{alarm_handler,{set,{process_memory_high_watermark,<0.10425.0>}}}]
>
>
> Anyone have any help/ideas on what's going on here and how to fix it?
>
> On Tue, Jul 5, 2011 at 8:58 AM, Jeff Pollard <jeff.pollard at gmail.com
> <mailto:jeff.pollard at gmail.com>> wrote:
>
>     Over the last few days we've had random nodes in our 5-node cluster
>     crash with "eheap_alloc: Cannot allocate xxxx bytes of memory"
>     errors in the erl_crash.dump file.  In general, the error messages
>     seem to crash trying to allocate 13-20 gigs of memory (our boxes
>     have 32 gigs total).  As far as I can tell crashing doesn't seem to
>     coincide with any particular requests to Riak.  I've tried to make
>     some sense fo the erl_crash.dump file but haven't had any luck.  I'm
>     also in the process of restoring our riak bakups to our staging
>     cluster in hopes of more accurately reproducing the issue in a less
>     noisy environment.
>
>     My questions for the list are:
>
>        1. Any clue how to further diagnose the issue? I can attach my
>           erl_crash.dump if needed.
>        2. Is it possible/likely this is due to large m/r requests?  We
>           have a couple m/r requests.  One goes over no more than 4
>           documents at a time while the other goes over anywhere between
>           60 and 10,000 documents, though more towards the smaller
>           number of documents.  We use 16 js VMs with max memory for the
>           VM and stack of 32 MB, each.
>        3. We're running riak 0.14.1.  Would upgrading to 0.14.2 help?
>
>     Thanks!
>
>
>
>
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com




More information about the riak-users mailing list