Riak crashing due to "eheap_alloc: Cannot allocate xxxx bytes of memory"
aphyr at aphyr.com
Wed Jul 6 00:36:23 EDT 2011
Since you were able to create and write these objects in the first
place, you probably had enough ram at one point to load and save them. I
would try bringing each node up in isolation, then issuing a delete
request against the local node, then restarting the node in normal,
talking-to-the-ring mode. If there are any local processes you can stop
to free up memory, try that too.
When I encountered this problem, I was able to use the riak:local_client
at the erlang shell to delete my huge objects--so long as other
processes weren't hammering it with requests.
On 07/05/2011 09:28 PM, Jeff Pollard wrote:
> Thanks to some help from Aphyr + Sean Cribbs on IRC, we narrowed the
> issue down to us having several multiple-hundred-megabyte sized
> documents and one 1.1 gig document. Deletion of those documents has now
> kept the cluster running quite happily for 3+ hours now, where before
> nodes were crashing after 15 minutes.
> I've managed to delete most of the large documents, but there are still
> a handful (3) that I am unable to delete. Attempts to curl -X DELETE
> them result in 503 error from Riak:
> < HTTP/1.1 503 Service Unavailable
> < Server: MochiWeb/1.1 WebMachine/1.7.3 (participate in the frantic)
> < Date: Wed, 06 Jul 2011 04:20:15 GMT
> < Content-Type: text/plain
> < Content-Length: 18
> request timed out
> In the erlang.log, I see this right before the timeout comes back:
> =INFO REPORT==== 5-Jul-2011::21:26:35 ===
> Anyone have any help/ideas on what's going on here and how to fix it?
> On Tue, Jul 5, 2011 at 8:58 AM, Jeff Pollard <jeff.pollard at gmail.com
> <mailto:jeff.pollard at gmail.com>> wrote:
> Over the last few days we've had random nodes in our 5-node cluster
> crash with "eheap_alloc: Cannot allocate xxxx bytes of memory"
> errors in the erl_crash.dump file. In general, the error messages
> seem to crash trying to allocate 13-20 gigs of memory (our boxes
> have 32 gigs total). As far as I can tell crashing doesn't seem to
> coincide with any particular requests to Riak. I've tried to make
> some sense fo the erl_crash.dump file but haven't had any luck. I'm
> also in the process of restoring our riak bakups to our staging
> cluster in hopes of more accurately reproducing the issue in a less
> noisy environment.
> My questions for the list are:
> 1. Any clue how to further diagnose the issue? I can attach my
> erl_crash.dump if needed.
> 2. Is it possible/likely this is due to large m/r requests? We
> have a couple m/r requests. One goes over no more than 4
> documents at a time while the other goes over anywhere between
> 60 and 10,000 documents, though more towards the smaller
> number of documents. We use 16 js VMs with max memory for the
> VM and stack of 32 MB, each.
> 3. We're running riak 0.14.1. Would upgrading to 0.14.2 help?
> riak-users mailing list
> riak-users at lists.basho.com
More information about the riak-users