Riak cluster-f#$%

Callixte Cauchois ccauchois at virtuoz.com
Tue Oct 2 11:01:28 EDT 2012

I understand that having everything on one box have side effect and that I
may be limited by disk IOs. But, still, I do not understand why the same
bucket can use different engine on different node and why Riak Control was
reporting everything as OK whereas some nodes were not responding.

On Mon, Oct 1, 2012 at 2:07 PM, Michael Truog <mjtruog at gmail.com> wrote:

> **
> Still, doesn't that failure show a typical overload of Riak's usage of
> mochiglobal (i.e., the code_server needing to lock all Erlang schedulers)?
> I understand that running more than one node on a single machine is not
> realistic deployment.  However, I don't see why it would cause errors,
> unless Riak was unable to handle the requests incoming.
> On 10/01/2012 01:54 PM, Alexander Sicular wrote:
> Any time you overload one box you run into all sorts of i/o dreck, screw
> with your conf files and mess with your versions you just have too many
> variables in the mix to get anything meaningful out of what you were trying
> to do. Since this is a test just tear the whole thing down and start
> clean.
>  If you want to dev test your app just use one node and dial the n val
> down to one in the app.config, which isn't actually there so you'll have to
> add it manually to the riak_core section like so (with some other stuff):
>  {default_bucket_props, [{n_val,1},
>    {allow_mult,false},
>    {last_write_wins,false},
>    {precommit, []},
>    {postcommit, []},
>    {chash_keyfun, {riak_core_util, chash_std_keyfun}}
> ]}
>  (Hey Basho people, that stuff should be in the app.config file by
> default. Making people go fish for it and figure out how and where to add
> this stuff is kinda unnecessary. Here is an example of a great conf file
> with everything you can conf and a whole bunch of docs:
> https://github.com/antirez/redis/blob/unstable/redis.conf ).
>  If you want to performance test your app make your dev system as similar
> to your prod system as possible and knock it out.
> -Alexander Sicular
>  @siculars
>  On Oct 1, 2012, at 4:30 PM, Callixte Cauchois wrote:
> Thank you, but can you explain a bit more?
> I mean I understand why it is a bad thing with regards to reliability and
> in case of hardware issues. But does it have also an impact on the
> behaviour when the hardware is performing correctly and the load on the
> machines are the same?
> On Mon, Oct 1, 2012 at 1:25 PM, Alexander Sicular <siculars at gmail.com>wrote:
> Inline.
> -Alexander Sicular
> @siculars
> On Oct 1, 2012, at 3:23 PM, Callixte Cauchois wrote:
> > Hi there,
> >
> > so, I am currently evaluating Riak to see how it can fit in our
> platform. To do so I have set up a cluster of 4 nodes on SmartOS, all of
> them on the same physical box.
>  Mistake. Just stop here. Everything else doesn't matter. Do not put all
> your virtual machines (riak nodes) on one physical machine. Put em on
> different physical machines. Fix the config files and try again.
> > I then built a simple application in node.js that get log events from
> our production system through a RabbitMQ queue and store them in my
> cluster. I let Riak generate the ids, but I have added two secondary
> indices to be able to retrieve more easily all the log events that belong
> to a single session.
> > Everything was going fine, events come around 130 messages per second
> are easily ingested by Riak. When stop it and then restart it, there is a
> bit of an issue as the events are read from the queue at 1500 messages per
> second and the insertion times go up, so I need some retries to actually
> store everything.
> > I wanted to tweak the LevelDB params to increase the throughput. To do
> so, I first upgraded from 1.1.6 to 1.2.0. I chose what I thought was the
> safest way: node by node, I have them leave the cluster, then I upgrade,
> then join again. During the whole process I kept inserting.
> > It went quite well. But, when I ran some queries using 2i, it gave me
> errors and I realized that for two of my four nodes, I forgot to put back
> eLevelDB as the default engine. As soon as I ran this query, everything
> went havoc, a lot of inserts failed, some nodes where not reachable using
> the ping url.
> > I changed the default engine and restarted those nodes, nothing changed.
> I tried to make them leave the cluster, after two days, they are still
> leaving. Riak-admin transfers tells that a lot of transfers need to occur,
> but the system is stuck: the numbers there do not change.
> >
> > I guess I have done several things wrong. It is test data, so it doesn't
> really matter if I loose data or if I have to re-start from scratch, but I
> want to understand what have gone wrong how I could have fixed it. Or if I
> even can recover from there now.
> >
> > Thank you.
> > C.
>  > _______________________________________________
> > riak-users mailing list
> > riak-users at lists.basho.com
> > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> _______________________________________________
> riak-users mailing listriak-users at lists.basho.comhttp://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20121002/0f928151/attachment.html>

More information about the riak-users mailing list