Recovering Riak data if it can no longer load in memory

Vikram Lalit vikramlalit at
Tue Jul 12 16:04:43 EDT 2016

Thanks much Matthew. Yes the server is low-memory given only development
right now - I'm using an AWS micro instance, so 1 GB RAM and 1 vCPU.

Thanks for the tip - let me try move the manifest file to a larger instance
and see how that works. More than reducing the memory footprint in dev, my
concern was more around reacting to a possible production scenario where
the db stops responding due to memory overload. Understood now that moving
to a larger instance should be possible. Thanks again.

On Tue, Jul 12, 2016 at 12:26 PM, Matthew Von-Maszewski <matthewv at>

> It would be helpful if you described the physical characteristics of the
> servers:  memory size, logical cpu count, etc.
> Google created leveldb to be highly reliable in the face of crashes.  If
> it is not restarting, that suggests to me that you have a low memory
> condition that is not able to load leveldb's MANIFEST file.  That is easily
> fixed by moving the dataset to a machine with larger memory.
> There is also a special flag to reduce Riak's leveldb memory foot print
> during development work.  The setting reduces the leveldb performance, but
> lets you run with less memory.
> In riak.conf, set:
> leveldb.limited_developer_mem = true
> Matthew
> > On Jul 12, 2016, at 11:56 AM, Vikram Lalit <vikramlalit at>
> wrote:
> >
> > Hi - I've been testing a Riak cluster (of 3 nodes) with an ejabberd
> messaging cluster in front of it that writes data to the Riak nodes. Whilst
> load testing the platform (by creating 0.5 million ejabberd users via
> Tsung), I found that the Riak nodes suddenly crashed. My question is how do
> we recover from such a situation if it were to occur in production?
> >
> > To provide further context / details, the leveldb log files storing the
> data suddenly became too huge, thus making the AWS Riak instances not able
> to load them in memory anymore. So we get a core dump if 'riak start' is
> fired on those instances. I had an n_val = 2, and all 3 nodes went down
> almost simultaneously, so in such a scenario, we cannot even rely on a 2nd
> copy of the data. One way to of course prevent it in the first place would
> be to use auto-scaling, but I'm wondering is there a ex post facto / post
> the event recovery that can be performed in such a scenario? Is it possible
> to simply copy the leveldb data to a larger memory instance, or to curtail
> the data further to allow loading in the same instance?
> >
> > Appreciate if you can provide inputs - a tad concerned as to how we
> could recover from such a situation if it were to happen in production
> (apart from leveraging auto-scaling as a preventive measure).
> >
> > Thanks!
> >
> > _______________________________________________
> > riak-users mailing list
> > riak-users at
> >
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

More information about the riak-users mailing list