Map Reduce Requirements

Jeremiah Peschka jeremiah.peschka at
Tue Aug 23 10:01:29 EDT 2011

On Aug 22, 2011, at 8:50 PM, bill robertson wrote:

> I wonder if it would be feasible to deploy an erlang web-service in the riak node's webmachine instance that could translate meta-data into Erlang funs and drive the map reduce operation that way. I'm not sure if I could get around having specific knowledge of the protobuf structures baked into that code, but I don't think it matters in this case.
> I also wonder how much 1.0 will change this picture.
> > Additionally, are secondary indexes meta-data?  i.e. If I built some secondary indices, these are stored in some form internal to Riak, and therefore available for query regardless of the type of data its associated with. Is this correct?
> Secondary indexes are a separate physical structure, or so I gather. (Rusty could be full of lies.) They're stored separately from the initial data and not as metadata in the object headers. So, yes, you can store whatever you want in secondary indexes and query it however you want, provided there's an API that supports what you're doing.
> Would secondary indexes eliminate the need for key-filtering? Logically, it would seem that you could do with indexes, but do they have similar performance characteristics?  (i.e. does one suck more than the other?)

Key filters will always perform a list-keys operation. Meaning that they result in an in memory scan of all keys in the key space. 

Not knowing entirely how indexes are implemented internally (reading the source is on my TO DO list), I can only guess from my experience with other databases how this would work. Indexes generally work best when you have a low search cardinality - when you're seeking only a few records from the index. As long as you can structure secondary indexes to answer the questions you're asking, then indexes make it easy to perform fast queries. 

The difference comes in based on your storage mechanism. With bitcask, all keys are in memory so that list-keys scan only happens between RAM and CPU and isn't THAT expensive of an operation. If indexes are not a memory resident structure, then a scan of an index (when you're doing a search that's some kind of substring or ends with operation) will be painfully slow - much like when you have to perform a table scan in an RDBMS.

The upside of key filtering, and composite key names in general, is that you can create meaningful keys that you can assemble on the fly. e.g. To get yesterday's trades of Ford stock in the NYSE, (assuming you have a trades bucket) you could get at yesterday's trading history through something like http://my_riak_server:8091/riak/trades/NYSE:F:20110822 Being able to perform ad hoc seeks like that is really powerful.

TL;DR - key filters and secondary indexes serve different purposes.

> Thanks again,
> Bill Robertson

Jeremiah Peschka - Founder, Brent Ozar PLF, LLC
Microsoft SQL Server MVP

More information about the riak-users mailing list