Riak crashing due to "eheap_alloc: Cannot allocate xxxx bytes of memory"

Since you were able to create and write these objects in the first 
place, you probably had enough ram at one point to load and save them. I 
would try bringing each node up in isolation, then issuing a delete 
request against the local node, then restarting the node in normal, 
talking-to-the-ring mode. If there are any local processes you can stop 
to free up memory, try that too.

When I encountered this problem, I was able to use the riak:local_client 
at the erlang shell to delete my huge objects--so long as other 
processes weren't hammering it with requests.


> Thanks to some help from Aphyr + Sean Cribbs on IRC, we narrowed the
> issue down to us having several multiple-hundred-megabyte sized
> documents and one 1.1 gig document.  Deletion of those documents has now
> kept the cluster running quite happily for 3+ hours now, where before
> nodes were crashing after 15 minutes.
> I've managed to delete most of the large documents, but there are still
> a handful (3) that I am unable to delete.  Attempts to curl -X DELETE
> them result in 503 error from Riak:
> In the erlang.log, I see this right before the timeout comes back:
>     =INFO REPORT==== 5-Jul-2011::21:26:35 ===
>     [{alarm_handler,{set,{process_memory_high_watermark,<0.10425.0>}}}]
> Anyone have any help/ideas on what's going on here and how to fix it?
>     Over the last few days we've had random nodes in our 5-node cluster
>     crash with "eheap_alloc: Cannot allocate xxxx bytes of memory"
>     errors in the erl_crash.dump file.  In general, the error messages
>     seem to crash trying to allocate 13-20 gigs of memory (our boxes
>     have 32 gigs total).  As far as I can tell crashing doesn't seem to
>     coincide with any particular requests to Riak.  I've tried to make
>     some sense fo the erl_crash.dump file but haven't had any luck.  I'm
>     also in the process of restoring our riak bakups to our staging
>     cluster in hopes of more accurately reproducing the issue in a less
>     noisy environment.
>     My questions for the list are:
>        1. Any clue how to further diagnose the issue? I can attach my
>           erl_crash.dump if needed.
>        2. Is it possible/likely this is due to large m/r requests?  We
>           have a couple m/r requests.  One goes over no more than 4
>           documents at a time while the other goes over anywhere between
>           60 and 10,000 documents, though more towards the smaller
>           number of documents.  We use 16 js VMs with max memory for the
>           VM and stack of 32 MB, each.
>        3. We're running riak 0.14.1.  Would upgrading to 0.14.2 help?
>     Thanks!
