Riak 2i http query much faster than python api?

Jeff Peck jeffp at tnrglobal.com
Wed Apr 10 22:49:19 EDT 2013


> Out of curiousity, how are you planning on segmenting the data?
> 

My plan to segment the data would be to have a secondary index on a key called seg_id (or something similar).

When I add an object to Riak, I will set seg_id to be the first three characters of the md5 of the object's key, which should yield an even distribution.

Then, when querying the data, I will run map-reduce against each segment (so for 3 hexadecimal characters, it would be 4,096 map-reduce queries).

The inputs part of the query would look like this:

"inputs":{
       "bucket":"mybucket",
       "index":"seg_id_bin",
       "key":"aaa"
    }

I would run the map-reduce queries in parallel.

It sounds like a lot of work to just get the value of one field, which makes me think that there is a better way. Plus, I do not know that this will actually work as fast as I expect it to. That's why I'm asking here before I implement it.
> Also, how are you setting up your servers? Single nodes? Multiple nodes?
> 

I am using the default Riak installation (with leveldb as the backend and search turned on). I am on a 16 core 3Ghz node with 20Gb of memory, however it appears that Riak is not using all of the resources available to it. I suspect that this can be resolved by modifying the configuration

That said, if you, or anyone reading this, could suggest a configuration that is more suited for performing a relatively small batch operation across 900k (and soon to be about 5 million) or objects, that would be greatly appreciated.

Thanks!

- Jeff


On Apr 10, 2013, at 10:32 PM, Shuhao Wu <shuhao at shuhaowu.com> wrote:

> Out of curiousity, how are you planning on segmenting the data? Map reduce will execute over the entire data set.
> 
> Also, how are you setting up your servers? Single nodes? Multiple nodes?
> 
> Shuhao
> Sent from my phone.
> 
> On 2013-04-10 10:25 PM, "Jeff Peck" <jeffp at tnrglobal.com> wrote:
> As a follow-up to this thread and my thread from earlier today, I am basically looking for a simple way to extract the value of a single field from approximately 900,000 documents (which happens to be indexed). I have been trying many options including a map-reduce function that executes entirely over http (taking out any python client bottlenecks). I let that run for over an hour before I stopped it. It did not return any output.
> 
> I also have tried grabbing a list of the 900k keys from a secondary index (very fast, about 11 seconds) and then trying to fetch each key in parallel (using curl and gnu parallel). That was also too slow to be feasible.
> 
> Is there something basic that I am missing?
> 
> One idea that I though of was to have a secondary index that is intended to split all of my data into segments. I would use the first three characters of the md5 of the document's key in hexadecimal format. So, the index would contain strings like "ae1", "2f4", "5ee", etc. Then, I can run my map-reduce query against *each* segment individually and possibly even in parallel.
> 
> I have observed that map-reduce is very fast with small sets of data (i.e. 5,000 objects), but with 900,000 objects it does not appear to run in a proportionately fast time. So, the idea is to divide the data into segments that can be better handled by map-reduce.
> 
> Before I implement this, I want to ask: Does this seem like the appropriate way to handle this type of operation? And, is there any better way to do this in the current version of Riak?
> 
> 
> On Apr 10, 2013, at 6:10 PM, Shuhao Wu <shuhao at shuhaowu.com> wrote:
> 
>> There are some inefficiencies in the python client... I've been profiling it recently and found that it occasionally takes the python client longer when you're on the same machine.
>> 
>> Perhaps Sean could comment?
>> 
>> Shuhao
>> Sent from my phone.
>> 
>> On 2013-04-10 4:04 PM, "Jeff Peck" <jeffp at tnrglobal.com> wrote:
>> Thanks Evan. I tried doing it in python like this (realizing that the previous way I did it uses MapReduce) and I had better results. It finished in 3.5 minutes, but nowhere close to the 15 seconds from the straight http query:
>> 
>> import riak
>> from pprint import pprint
>> 
>> bucket_name = "mybucket"
>> 
>> client = riak.RiakClient(port=8087,transport_class=riak.RiakPbcTransport)
>> bucket = client.bucket(bucket_name)
>> results = bucket.get_index('status_bin', 'PERSISTED')
>> 
>> print len(results)
>> 
>> 
>> On Apr 10, 2013, at 4:00 PM, Evan Vigil-McClanahan <emcclanahan at basho.com> wrote:
>> 
>> > get_index() is the right function there, I think.
>> >
>> > On Wed, Apr 10, 2013 at 2:53 PM, Jeff Peck <jeffp at tnrglobal.com> wrote:
>> >> I can grab over 900,000 keys from an indexs, using an http query in about 15 seconds, whereas the same operation in python times out after 5 minutes. Does this indicate that I am using the python API incorrectly? Should I be relying on an http request initially when I need to grab this many keys?
>> >>
>> >> (Note: This is tied to the question that I asked earlier, but is also a general question to help understand the proper usage of the python API.)
>> >>
>> >> Thanks! Examples are below.
>> >>
>> >> - Jeff
>> >>
>> >> ---
>> >>
>> >> HTTP:
>> >>
>> >> $ time curl -s http://localhost:8098/buckets/mybucket/index/status_bin/PERSISTED | grep -o , | wc -l
>> >> 926047
>> >>
>> >> real    0m14.583s
>> >> user    0m2.500s
>> >> sys     0m0.270s
>> >>
>> >> ---
>> >>
>> >> Python:
>> >>
>> >> import riak
>> >>
>> >> bucket = "my bucket"
>> >> client = riak.RiakClient(port=8098)
>> >> results = client.index(bucket, 'status_bin', 'PERSISTED').run(timeout=5*60*1000) # 5 minute timeout
>> >> print len(results)
>> >>
>> >> (times out after 5 minutes)
>> >> _______________________________________________
>> >> riak-users mailing list
>> >> riak-users at lists.basho.com
>> >> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>> 
>> 
>> _______________________________________________
>> riak-users mailing list
>> riak-users at lists.basho.com
>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20130410/26f604bf/attachment.html>


More information about the riak-users mailing list