speeding up riaksearch precommit indexing
lesmikesell at gmail.com
Sat Jun 18 13:08:27 EDT 2011
Is there a good way to handle something like this with redundancy all the way
through? On simple key/value items you could have two readers write the same
things to riak and let bitcask cleanup eventually discard one, but with indexing
you probably need to use some sort of failover approach up front. Do any of
those queue managers handle that without adding their own single point of
failure? Assuming there are unique identifiers in the items being written, you
might use the CAS feature of redis to arbitrate writes into its queue, but what
happens when the redis node fails?
On 6/17/11 11:48 PM, John D. Rowell wrote:
> Why not decouple the twitter stream processing from the indexing? More than
> likely you have a single process consuming the spritzer stream, so you can put
> the fetched results in a queue (hornetq, beanstalk, or even a simple Redis
> queue) and then have workers pull from the queue and insert into Riak. You could
> run one worker per node and thus insert in parallel into all nodes. If you need
> free CPU (e.g. for searches), just throttle the workers to some sane level. If
> you see the queue getting bigger, add another Riak node (and thus another local
> 2011/6/13 Steve Webb <swebb at gnip.com <mailto:swebb at gnip.com>>
> Ok, I've changed my two VMs to each have:
> 3 CPUs, 1GB ram, 120GB disk
> I'm ingesting the twitter spritzer stream (about 10-20 tweets per second,
> approx 2k of data per tweet). One bucket is storing the non-indexed tweets
> in full. Another bucket is storing the indexed tweet string, id, date and
> username. A maximum of 20 clients can be hitting the 'cluster' at any one time.
> I'm using n_val=2 so there is replication going on behind the scenes.
> I'm using a hardware load-balancer to distribute the work amongst the two
> nodes and now I'm seeing about 75% CPU usage as opposed to 100% on one node
> and 50% on the replicating-only node.
> I've monitored the VM over the last few days and it seems to be mostly
> CPU-bound. The disk I/O is low. The Network I/O is low.
> Q: Can I change the pre-commit to a post-commit trigger or something perhaps
> or will that make any difference at all? I'm ok if the tweet stuff doesn't
> get indexed immediately and there's a slight lag in indexing if it saves on CPU.
> riak-users mailing list
> riak-users at lists.basho.com
More information about the riak-users