Riak 2i http query much faster than python api?

Shuhao shuhao at shuhaowu.com
Thu Apr 11 12:47:57 EDT 2013


Evan: Who is working on the streaming interface?

Also, it seems that the python API is a bit slow on its own and a lot of 
time is actually spent in the client as well.

Shuhao

On 13-04-11 10:18 AM, Evan Vigil-McClanahan wrote:
> Have you tried streaming the results?  Sometimes the
> disproportionately slow responses from large mapreduce jobs are due to
> swapping and memory pressure.   Streaming should lower the amount of
> memory used, potentially allowing you to do this in a reasonable
> amount of time.  the downside here is that streaming currently isn't
> supported in the python client (although a new version that does
> should be out sometime soon-ish).
>
> On Wed, Apr 10, 2013 at 9:25 PM, Jeff Peck <jeffp at tnrglobal.com> wrote:
>> As a follow-up to this thread and my thread from earlier today, I am
>> basically looking for a simple way to extract the value of a single field
>> from approximately 900,000 documents (which happens to be indexed). I have
>> been trying many options including a map-reduce function that executes
>> entirely over http (taking out any python client bottlenecks). I let that
>> run for over an hour before I stopped it. It did not return any output.
>>
>> I also have tried grabbing a list of the 900k keys from a secondary index
>> (very fast, about 11 seconds) and then trying to fetch each key in parallel
>> (using curl and gnu parallel). That was also too slow to be feasible.
>>
>> Is there something basic that I am missing?
>>
>> One idea that I though of was to have a secondary index that is intended to
>> split all of my data into segments. I would use the first three characters
>> of the md5 of the document's key in hexadecimal format. So, the index would
>> contain strings like "ae1", "2f4", "5ee", etc. Then, I can run my map-reduce
>> query against *each* segment individually and possibly even in parallel.
>>
>> I have observed that map-reduce is very fast with small sets of data (i.e.
>> 5,000 objects), but with 900,000 objects it does not appear to run in a
>> proportionately fast time. So, the idea is to divide the data into segments
>> that can be better handled by map-reduce.
>>
>> Before I implement this, I want to ask: Does this seem like the appropriate
>> way to handle this type of operation? And, is there any better way to do
>> this in the current version of Riak?
>>
>>
>> On Apr 10, 2013, at 6:10 PM, Shuhao Wu <shuhao at shuhaowu.com> wrote:
>>
>> There are some inefficiencies in the python client... I've been profiling it
>> recently and found that it occasionally takes the python client longer when
>> you're on the same machine.
>>
>> Perhaps Sean could comment?
>>
>> Shuhao
>> Sent from my phone.
>>
>> On 2013-04-10 4:04 PM, "Jeff Peck" <jeffp at tnrglobal.com> wrote:
>>>
>>> Thanks Evan. I tried doing it in python like this (realizing that the
>>> previous way I did it uses MapReduce) and I had better results. It finished
>>> in 3.5 minutes, but nowhere close to the 15 seconds from the straight http
>>> query:
>>>
>>> import riak
>>> from pprint import pprint
>>>
>>> bucket_name = "mybucket"
>>>
>>> client = riak.RiakClient(port=8087,transport_class=riak.RiakPbcTransport)
>>> bucket = client.bucket(bucket_name)
>>> results = bucket.get_index('status_bin', 'PERSISTED')
>>>
>>> print len(results)
>>>
>>>
>>> On Apr 10, 2013, at 4:00 PM, Evan Vigil-McClanahan <emcclanahan at basho.com>
>>> wrote:
>>>
>>>> get_index() is the right function there, I think.
>>>>
>>>> On Wed, Apr 10, 2013 at 2:53 PM, Jeff Peck <jeffp at tnrglobal.com> wrote:
>>>>> I can grab over 900,000 keys from an indexs, using an http query in
>>>>> about 15 seconds, whereas the same operation in python times out after 5
>>>>> minutes. Does this indicate that I am using the python API incorrectly?
>>>>> Should I be relying on an http request initially when I need to grab this
>>>>> many keys?
>>>>>
>>>>> (Note: This is tied to the question that I asked earlier, but is also a
>>>>> general question to help understand the proper usage of the python API.)
>>>>>
>>>>> Thanks! Examples are below.
>>>>>
>>>>> - Jeff
>>>>>
>>>>> ---
>>>>>
>>>>> HTTP:
>>>>>
>>>>> $ time curl -s
>>>>> http://localhost:8098/buckets/mybucket/index/status_bin/PERSISTED | grep -o
>>>>> , | wc -l
>>>>> 926047
>>>>>
>>>>> real    0m14.583s
>>>>> user    0m2.500s
>>>>> sys     0m0.270s
>>>>>
>>>>> ---
>>>>>
>>>>> Python:
>>>>>
>>>>> import riak
>>>>>
>>>>> bucket = "my bucket"
>>>>> client = riak.RiakClient(port=8098)
>>>>> results = client.index(bucket, 'status_bin',
>>>>> 'PERSISTED').run(timeout=5*60*1000) # 5 minute timeout
>>>>> print len(results)
>>>>>
>>>>> (times out after 5 minutes)
>>>>> _______________________________________________
>>>>> riak-users mailing list
>>>>> riak-users at lists.basho.com
>>>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>>
>>>
>>> _______________________________________________
>>> riak-users mailing list
>>> riak-users at lists.basho.com
>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>
>>
>>
>> _______________________________________________
>> riak-users mailing list
>> riak-users at lists.basho.com
>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>




More information about the riak-users mailing list