Looping in MR and multiple input phases
fearsome.lucidity at gmail.com
Sat Nov 12 16:45:05 EST 2011
I figure the following is not currently possible in Riak, so I'd like to
propose them as potential features.
We are processing millions of documents with Riak and we need to very
quickly find that ones we want. One problem we've had is that some of the
terms we want to search on are either very low cardinality (such as
timestamps), or very high cardinality (e.g. they match 50% of documents).
This is compounded by the very low IO performance of EC2. We've deal with
this by creating the equivalent of compound indexes by adding fields to our
documents that combine two or more fields we wanted to index and that we
could query together. We've also performed some quantization of things
like timestamps. This has allows us bring the cardinality terms to
a reasonable level and spread our the values among terms more evenly. We
are using Riak search for this, as secondary indexes does not currently
support compound queries.
The downside of this approach is that if we want to answer a question such
as, more recent documents from some point in time, we have to loop over
these terms trying to find one with some data. That can potentially
involve a lot of round trips between the client and the server.
Ideally we'd like to be able to write a MR job that takes some input, and
if the starting input was not sufficient to produce a result, have it
restart at the beginning of the job with a new set of inputs it
generates programmatically. This can be somewhat emulated by having
repeating groups chains of map-reduce phases, so if an early group did not
find what it was looking for, the last reduce phase group can output a new
set of bucket-key pairs as input for the next group of map-reduce phases.
The problem with this is that is not efficient (the first group could have
found what it needed, and the other phases becomes superflous), you may not
know head of time how many of this groups to chain, and most problematic
for us, we may not know what bucket-key pairs to pass to the next group.
Thus, you need looping, but also a way to specify a new input, not just
with bucket-key pairs, but as search or secondary indexes query.
I suppose the last problem can be solved if we where using Erlang for our
phases, as we could then use the Erlang search client to generate a new set
of keys based on a new search query. Maybe the a search API could be made
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the riak-users