Recovering Riak data if it can no longer load in memory
matthewv at basho.com
Tue Jul 12 16:18:50 EDT 2016
You can further reduce memory used by leveldb with the following setting in riak.conf:
leveldb.threads = 5
The value "5" needs to be a prime number. The system defaults to 71. Many Linux implementations will allocate 8Mbytes per thread for stack. So bunches of threads lead to bunches of memory reserved for stack. That is fine on servers with higher memory. But probably part of your problem on a small memory machine.
The thread count is high to promote parallelism across vnodes on the same server, especially with "entropy = active". So again, this setting is sacrificing performance to save memory.
P.S. You really want 8 CPU cores, 4 as a dirt minimum. And review this for more cpu performance info:
> On Jul 12, 2016, at 4:04 PM, Vikram Lalit <vikramlalit at gmail.com> wrote:
> Thanks much Matthew. Yes the server is low-memory given only development right now - I'm using an AWS micro instance, so 1 GB RAM and 1 vCPU.
> Thanks for the tip - let me try move the manifest file to a larger instance and see how that works. More than reducing the memory footprint in dev, my concern was more around reacting to a possible production scenario where the db stops responding due to memory overload. Understood now that moving to a larger instance should be possible. Thanks again.
> On Tue, Jul 12, 2016 at 12:26 PM, Matthew Von-Maszewski <matthewv at basho.com <mailto:matthewv at basho.com>> wrote:
> It would be helpful if you described the physical characteristics of the servers: memory size, logical cpu count, etc.
> Google created leveldb to be highly reliable in the face of crashes. If it is not restarting, that suggests to me that you have a low memory condition that is not able to load leveldb's MANIFEST file. That is easily fixed by moving the dataset to a machine with larger memory.
> There is also a special flag to reduce Riak's leveldb memory foot print during development work. The setting reduces the leveldb performance, but lets you run with less memory.
> In riak.conf, set:
> leveldb.limited_developer_mem = true
> > On Jul 12, 2016, at 11:56 AM, Vikram Lalit <vikramlalit at gmail.com <mailto:vikramlalit at gmail.com>> wrote:
> > Hi - I've been testing a Riak cluster (of 3 nodes) with an ejabberd messaging cluster in front of it that writes data to the Riak nodes. Whilst load testing the platform (by creating 0.5 million ejabberd users via Tsung), I found that the Riak nodes suddenly crashed. My question is how do we recover from such a situation if it were to occur in production?
> > To provide further context / details, the leveldb log files storing the data suddenly became too huge, thus making the AWS Riak instances not able to load them in memory anymore. So we get a core dump if 'riak start' is fired on those instances. I had an n_val = 2, and all 3 nodes went down almost simultaneously, so in such a scenario, we cannot even rely on a 2nd copy of the data. One way to of course prevent it in the first place would be to use auto-scaling, but I'm wondering is there a ex post facto / post the event recovery that can be performed in such a scenario? Is it possible to simply copy the leveldb data to a larger memory instance, or to curtail the data further to allow loading in the same instance?
> > Appreciate if you can provide inputs - a tad concerned as to how we could recover from such a situation if it were to happen in production (apart from leveraging auto-scaling as a preventive measure).
> > Thanks!
> > _______________________________________________
> > riak-users mailing list
> > riak-users at lists.basho.com <mailto:riak-users at lists.basho.com>
> > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com <http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com>
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the riak-users