Riak cluster-f#$%

Michael Truog mjtruog at gmail.com
Mon Oct 1 17:07:57 EDT 2012


Still, doesn't that failure show a typical overload of Riak's usage of mochiglobal (i.e., the code_server needing to lock all Erlang schedulers)?  I understand that running more than one node on a single machine is not realistic deployment.  However, I don't see why it would cause errors, unless Riak was unable to handle the requests incoming.

On 10/01/2012 01:54 PM, Alexander Sicular wrote:
> Any time you overload one box you run into all sorts of i/o dreck, screw with your conf files and mess with your versions you just have too many variables in the mix to get anything meaningful out of what you were trying to do. Since this is a test just tear the whole thing down and start clean. 
>
> If you want to dev test your app just use one node and dial the n val down to one in the app.config, which isn't actually there so you'll have to add it manually to the riak_core section like so (with some other stuff):
>
> {default_bucket_props, [{n_val,1},
>    {allow_mult,false},
>    {last_write_wins,false},
>    {precommit, []},
>    {postcommit, []},
>    {chash_keyfun, {riak_core_util, chash_std_keyfun}}
> ]}  
>
> (Hey Basho people, that stuff should be in the app.config file by default. Making people go fish for it and figure out how and where to add this stuff is kinda unnecessary. Here is an example of a great conf file with everything you can conf and a whole bunch of docs: https://github.com/antirez/redis/blob/unstable/redis.conf ).
>
> If you want to performance test your app make your dev system as similar to your prod system as possible and knock it out.
>
>
> -Alexander Sicular
>
> @siculars
>
> On Oct 1, 2012, at 4:30 PM, Callixte Cauchois wrote:
>
>> Thank you, but can you explain a bit more?
>> I mean I understand why it is a bad thing with regards to reliability and in case of hardware issues. But does it have also an impact on the behaviour when the hardware is performing correctly and the load on the machines are the same?
>>
>> On Mon, Oct 1, 2012 at 1:25 PM, Alexander Sicular <siculars at gmail.com <mailto:siculars at gmail.com>> wrote:
>>
>>     Inline.
>>
>>     -Alexander Sicular
>>
>>     @siculars
>>
>>     On Oct 1, 2012, at 3:23 PM, Callixte Cauchois wrote:
>>
>>     > Hi there,
>>     >
>>     > so, I am currently evaluating Riak to see how it can fit in our platform. To do so I have set up a cluster of 4 nodes on SmartOS, all of them on the same physical box.
>>
>>     Mistake. Just stop here. Everything else doesn't matter. Do not put all your virtual machines (riak nodes) on one physical machine. Put em on different physical machines. Fix the config files and try again.
>>
>>     > I then built a simple application in node.js that get log events from our production system through a RabbitMQ queue and store them in my cluster. I let Riak generate the ids, but I have added two secondary indices to be able to retrieve more easily all the log events that belong to a single session.
>>     > Everything was going fine, events come around 130 messages per second are easily ingested by Riak. When stop it and then restart it, there is a bit of an issue as the events are read from the queue at 1500 messages per second and the insertion times go up, so I need some retries to actually store everything.
>>     > I wanted to tweak the LevelDB params to increase the throughput. To do so, I first upgraded from 1.1.6 to 1.2.0. I chose what I thought was the safest way: node by node, I have them leave the cluster, then I upgrade, then join again. During the whole process I kept inserting.
>>     > It went quite well. But, when I ran some queries using 2i, it gave me errors and I realized that for two of my four nodes, I forgot to put back eLevelDB as the default engine. As soon as I ran this query, everything went havoc, a lot of inserts failed, some nodes where not reachable using the ping url.
>>     > I changed the default engine and restarted those nodes, nothing changed. I tried to make them leave the cluster, after two days, they are still leaving. Riak-admin transfers tells that a lot of transfers need to occur, but the system is stuck: the numbers there do not change.
>>     >
>>     > I guess I have done several things wrong. It is test data, so it doesn't really matter if I loose data or if I have to re-start from scratch, but I want to understand what have gone wrong how I could have fixed it. Or if I even can recover from there now.
>>     >
>>     > Thank you.
>>     > C.
>>     > _______________________________________________
>>     > riak-users mailing list
>>     > riak-users at lists.basho.com <mailto:riak-users at lists.basho.com>
>>     > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>
>>
>
>
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20121001/5ae87288/attachment.html>


More information about the riak-users mailing list