Map Reduce Requirements

Brian Rowe rowe at muxspace.com
Tue Aug 23 11:12:19 EDT 2011


I'm a little late to the party, but the way I've been handling
marshaling is using an explicit map/reduce phase to perform the
marshaling and/or data massaging. You can chain map phases together by
using the special bucket/key pair {none,none} and passing the
intermediate data via the KeyData. This also makes the phases more
portable if you wish to re-use them in other situations. I wrote a
blog post about chaining phases a while back, which might be useful:
http://cartesianfaith.wordpress.com/2011/07/27/mapreduce-tips-and-tricks-in-riak/

HTH,
Brian


On Tue, Aug 23, 2011 at 10:01 AM, Jeremiah Peschka
<jeremiah.peschka at gmail.com> wrote:
> On Aug 22, 2011, at 8:50 PM, bill robertson wrote:
>
>> I wonder if it would be feasible to deploy an erlang web-service in the riak node's webmachine instance that could translate meta-data into Erlang funs and drive the map reduce operation that way. I'm not sure if I could get around having specific knowledge of the protobuf structures baked into that code, but I don't think it matters in this case.
>>
>> I also wonder how much 1.0 will change this picture.
>>
>> > Additionally, are secondary indexes meta-data?  i.e. If I built some secondary indices, these are stored in some form internal to Riak, and therefore available for query regardless of the type of data its associated with. Is this correct?
>>
>> Secondary indexes are a separate physical structure, or so I gather. (Rusty could be full of lies.) They're stored separately from the initial data and not as metadata in the object headers. So, yes, you can store whatever you want in secondary indexes and query it however you want, provided there's an API that supports what you're doing.
>>
>> Would secondary indexes eliminate the need for key-filtering? Logically, it would seem that you could do with indexes, but do they have similar performance characteristics?  (i.e. does one suck more than the other?)
>
> Key filters will always perform a list-keys operation. Meaning that they result in an in memory scan of all keys in the key space.
>
> Not knowing entirely how indexes are implemented internally (reading the source is on my TO DO list), I can only guess from my experience with other databases how this would work. Indexes generally work best when you have a low search cardinality - when you're seeking only a few records from the index. As long as you can structure secondary indexes to answer the questions you're asking, then indexes make it easy to perform fast queries.
>
> The difference comes in based on your storage mechanism. With bitcask, all keys are in memory so that list-keys scan only happens between RAM and CPU and isn't THAT expensive of an operation. If indexes are not a memory resident structure, then a scan of an index (when you're doing a search that's some kind of substring or ends with operation) will be painfully slow - much like when you have to perform a table scan in an RDBMS.
>
> The upside of key filtering, and composite key names in general, is that you can create meaningful keys that you can assemble on the fly. e.g. To get yesterday's trades of Ford stock in the NYSE, (assuming you have a trades bucket) you could get at yesterday's trading history through something like http://my_riak_server:8091/riak/trades/NYSE:F:20110822 Being able to perform ad hoc seeks like that is really powerful.
>
> TL;DR - key filters and secondary indexes serve different purposes.
>
>>
>> Thanks again,
>> Bill Robertson
>
>
> ---
> Jeremiah Peschka - Founder, Brent Ozar PLF, LLC
> Microsoft SQL Server MVP
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>




More information about the riak-users mailing list