Riak cluster-f#$%

Alexander Sicular siculars at gmail.com
Mon Oct 1 16:25:19 EDT 2012


-Alexander Sicular


On Oct 1, 2012, at 3:23 PM, Callixte Cauchois wrote:

> Hi there,
> so, I am currently evaluating Riak to see how it can fit in our platform. To do so I have set up a cluster of 4 nodes on SmartOS, all of them on the same physical box.

Mistake. Just stop here. Everything else doesn't matter. Do not put all your virtual machines (riak nodes) on one physical machine. Put em on different physical machines. Fix the config files and try again.

> I then built a simple application in node.js that get log events from our production system through a RabbitMQ queue and store them in my cluster. I let Riak generate the ids, but I have added two secondary indices to be able to retrieve more easily all the log events that belong to a single session.
> Everything was going fine, events come around 130 messages per second are easily ingested by Riak. When stop it and then restart it, there is a bit of an issue as the events are read from the queue at 1500 messages per second and the insertion times go up, so I need some retries to actually store everything.
> I wanted to tweak the LevelDB params to increase the throughput. To do so, I first upgraded from 1.1.6 to 1.2.0. I chose what I thought was the safest way: node by node, I have them leave the cluster, then I upgrade, then join again. During the whole process I kept inserting.
> It went quite well. But, when I ran some queries using 2i, it gave me errors and I realized that for two of my four nodes, I forgot to put back eLevelDB as the default engine. As soon as I ran this query, everything went havoc, a lot of inserts failed, some nodes where not reachable using the ping url.
> I changed the default engine and restarted those nodes, nothing changed. I tried to make them leave the cluster, after two days, they are still leaving. Riak-admin transfers tells that a lot of transfers need to occur, but the system is stuck: the numbers there do not change.
> I guess I have done several things wrong. It is test data, so it doesn't really matter if I loose data or if I have to re-start from scratch, but I want to understand what have gone wrong how I could have fixed it. Or if I even can recover from there now.
> Thank you.
> C.
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

More information about the riak-users mailing list