Fwd: Secondary Indexes - Feedback?

Gordon Tillman gtillman at mezeo.com
Thu Nov 17 09:45:14 EST 2011


I forgot to CC the mailing list with this response.

--g

From: Gordon Tillman <gtillman at mezeo.com>
Subject: Re: Secondary Indexes - Feedback?
Date: November 16, 2011 14:55:00 CST
To: Rusty Klophaus <rusty at basho.com>

On Nov 16, 2011, at 13:53 , Rusty Klophaus wrote:

> Hi Gordon,
> 
> Thanks for your feedback! Some follow up questions below:
> 
> For example, with search I can specify something like this to generate my input to map-reduce:
> 
> p:foo AND t:bar  (give me all the objects whose parent "p" is foo and that hav tag "t" of bar).
> 
> So that would get fed to map-reduce where additional processing (think filtering, sorting, pagination) is done.
> 
> I can do the same thing with secondary indexes but would have to move some of that into map.  
> 
> So in this case I would use secondary indexes to grab all of the items whose parent "p" is "foo".  This would generate the input phase and at that point I would have to use map to filter out all of the items that did not contain the tag "t"  of "bar".
> 
> It is doable, but not as performant as I think it could be.
> 
> So to be clear, this is mainly about performance, not convenience? In other words, you don't mind writing your own map function, so long as it is fast?

That is correct, don't mind doing that at all.  We already have a bunch of M/R code and it's all in Erlang so it is pretty fast.  Here is where my comment about speed came from.  

Assume theoretical objects that have these fields: parent, tag, date, data.  Stored in JSON.  Our goal is to retrieve all objects where parent="foo", tag="bar",  and date<20111116 in a M/R job.  

(1) We could do this:  use "input: bucket" (full key listing), and do all of the filtering in a map phase.

(2) Conversely, if using search to index our data we could use search as the input phase:

"parent:foo AND tag:bar AND date:[00000000 TO 20111115]

and you are pretty much done.

everything else is somewhere in between.  So when using secondary indexes we can pick one of those three fields to generate the input phase (say parent_bin=foo) and do the rest of the filtering in a map phase.

I am operating on the assumption that option (1) is the slowest and option (2) is the fastest, so that the solution using secondary indexes would fall somewhere in between.  I am probably over-simplifying but that is what motivated my remark with regards to speed.


> Also, lets say that part 1 of the query is getting a list of keys where "p" == "foo", part 2 is turning those keys into objects, and part 3 is filtering those objects. Are all parts too slow for your application, or is only a specific part of the query too slow?
> 
> Hope that makes sense, this is a nuanced point.

In the example above I would combine part 2 and 3 into one map phase.  I would extract a JSON representation of each object, something like this (assumption in this case of course is that allow mult = false for the bucket in question):

get_json({error,notfound}) ->
    null;
get_json(RiakObject) ->
    ObjMD = riak_object:get_metadata(RiakObject),
    case dict:find(<<"content-type">>, ObjMD) of
        {ok, CtVal} ->
            case CtVal of
                "application/json" ->
                    mochijson2:decode(riak_object:get_value(RiakObject));
                _ ->
                    null
            end;
        error ->
            null
    end.

I would then check to see if the object meet all of the filter criteria.  If not, return [] else return whatever sub-set of the JSON data that was required.  Since there is no other map-phase following this one I don't have to return [[bucket, key]], I can just return the data.

So really, the only thing that would possibly result in slower performance would be that the initial set of objects generated during the input phase would be larger when using secondary indexes as opposed to using search.

> 
> -- 
> Rusty Klophaus (@rustyio)
> Basho Technologies, Inc.
> www.basho.com
> 

Honestly Rusty I don't think that is the biggest performance issue that I'm worried about.  I'm really interested in being able to implement distributed reduce phases (specifically to do a partial sort)  and then have that output handle by a final reduce phase that could perform an efficient merge sort  and stream results back to the client.  That would be really cool!

--g

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20111117/b512f0f7/attachment.html>


More information about the riak-users mailing list