Roberto Calero roberto_calero at hotmail.com
Wed Jan 25 10:12:27 EST 2012

From: jeremiah.peschka at gmail.com
Date: Wed, 25 Jan 2012 06:48:45 -0800
Subject: Re: Should Riak have used dedicated nodes for secondary indices?
To: runar.jordahl at gmail.com
CC: riak-users at lists.basho.com

Good news! Riak doesn't use sharding.

Data locality is critical in a distributed system. When you create an index, your structure looks something like:


Reading from an index requires locating indexed_value, finding all matching values, and then retrieving all matching record_ids. By keeping index data on the same node as the source data, Riak avoids having to remote the query to retrieve object data. This is a Good Thing. The network is slow and unreliable. Just ask an Australian.

Riak's approach is intended to provide a uniform system where you can treat any node equally. The idea that there should be an unsharded index node is a bit ludicrous. Let's say you have 1TB of raw data. Your indexes are pretty light and are only about 20% of your data size. This means that you need 200GB of good storage (not some cheap $150 SATA HDD you found on NewEgg). 200GB of RAID 10 SAS storage isn't that pricey to put in a single unsharded machine. Over time as your data grows and your indexing changes, you may have 10TB and your index size is ~40% of your data. Your unsharded index server now has to have 4TB of fast, reliable storage. And, since this is an unsharded system, you'll want multiple replicas of your unsharded index server to make sure that a hardware hiccup doesn't take down your ability to perform fast lookups. Besides - a single indexing server becomes a single bottleneck and a single point of failure in your system.

Most people using Lucene as their indexing store are sharding Lucene. From an anecdotal standpoint, about 70% of the people I've talked to using Lucene are getting to the point of sharding their replicated Lucene indexes.

I'm not saying that either approach is good or bad; just remember that every solution has drawbacks.---
Jeremiah Peschka, SQL Server MVP
Managing Director, Brent Ozar PLF, LLC

On Wed, Jan 25, 2012 at 5:15 AM, Runar Jordahl <runar.jordahl at gmail.com> wrote:

Siddharth Anand, says that secondary indices (for a key-value store)

best is placed on a separate node, avoiding the need to look up 1 / N

nodes during a query:

"Systems that shard data based on a primary key will do well when

routed by that key. When routed by a secondary key, the system will

need to “spray” a query across all shards. If one of the shards is

experiencing high latency, the system will return either no results or

incomplete (i.e. inconsistent) results. For this reason, it would make

sense to store the secondary index on an unsharded (but replicated)



If I understand Riak correctly, it takes the opposite approach,

storing secondary indices together with the data.

To me at appears like Riak’s approach gives a more uniform system,

with all nodes having the same responsibilities. Does anyone else have

any thoughts on this?

Kind regards

Runar Jordahl



riak-users mailing list

riak-users at lists.basho.com


riak-users mailing list
riak-users at lists.basho.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20120126/7c20ab56/attachment.html>

More information about the riak-users mailing list