speeding up riaksearch precommit indexing
John D. Rowell
me at jdrowell.com
Sat Jun 18 19:54:11 EDT 2011
The "real" queues like HornetQ and others can take care of this without a
single point of failure but it's a pain (in my opinion) to set them up that
way, and usually with all the cluster and failover features active they get
quite slow for writes.We use Redis for this because it's simpler and
lightweight. The problem is that there is no real clustering option for
Redis today, even thought there are some hacks that get close. When we
cannot afford a single point of failure or any downtime, we tend to use
MongoDB for simple queues. It has full cluster support and the performance
is pretty close to what you get with Redis in this use case.
OTOH you could keep it all Riak and setup a separate small cluster with a
RAM backend and use that as a queue, probably with similar performance. The
idea here is that you can scale these clusters (the "queue" and the indexed
production data) independently in response to your load patterns, and have
optimum hardware and I/O specs for the different cluster nodes.
2011/6/18 Les Mikesell <lesmikesell at gmail.com>
> Is there a good way to handle something like this with redundancy all the
> way through? On simple key/value items you could have two readers write the
> same things to riak and let bitcask cleanup eventually discard one, but with
> indexing you probably need to use some sort of failover approach up front.
> Do any of those queue managers handle that without adding their own single
> point of failure? Assuming there are unique identifiers in the items being
> written, you might use the CAS feature of redis to arbitrate writes into its
> queue, but what happens when the redis node fails?
> On 6/17/11 11:48 PM, John D. Rowell wrote:
>> Why not decouple the twitter stream processing from the indexing? More
>> likely you have a single process consuming the spritzer stream, so you can
>> the fetched results in a queue (hornetq, beanstalk, or even a simple Redis
>> queue) and then have workers pull from the queue and insert into Riak. You
>> run one worker per node and thus insert in parallel into all nodes. If you
>> free CPU (e.g. for searches), just throttle the workers to some sane
>> level. If
>> you see the queue getting bigger, add another Riak node (and thus another
>> 2011/6/13 Steve Webb <swebb at gnip.com <mailto:swebb at gnip.com>>
>> Ok, I've changed my two VMs to each have:
>> 3 CPUs, 1GB ram, 120GB disk
>> I'm ingesting the twitter spritzer stream (about 10-20 tweets per
>> approx 2k of data per tweet). One bucket is storing the non-indexed
>> in full. Another bucket is storing the indexed tweet string, id, date
>> username. A maximum of 20 clients can be hitting the 'cluster' at any
>> one time.
>> I'm using n_val=2 so there is replication going on behind the scenes.
>> I'm using a hardware load-balancer to distribute the work amongst the
>> nodes and now I'm seeing about 75% CPU usage as opposed to 100% on one
>> and 50% on the replicating-only node.
>> I've monitored the VM over the last few days and it seems to be mostly
>> CPU-bound. The disk I/O is low. The Network I/O is low.
>> Q: Can I change the pre-commit to a post-commit trigger or something
>> or will that make any difference at all? I'm ok if the tweet stuff
>> get indexed immediately and there's a slight lag in indexing if it
>> saves on CPU.
>> riak-users mailing list
>> riak-users at lists.basho.com
> riak-users mailing list
> riak-users at lists.basho.com
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the riak-users