Riak 2i http query much faster than python api?

Evan Vigil-McClanahan emcclanahan at basho.com
Thu Apr 11 12:51:49 EDT 2013


Sean wrote it, it's already in master, iirc.

That wouldn't surprise me, honestly, re performance.  It's something
we've been meaning to look into, just to make sure that nothing is
extravagantly horrible, but I don't think that any systematic effort
has started as yet.  Have you managed to replicate and profile Jeff's
case?

On Thu, Apr 11, 2013 at 11:47 AM, Shuhao <shuhao at shuhaowu.com> wrote:
> Evan: Who is working on the streaming interface?
>
> Also, it seems that the python API is a bit slow on its own and a lot of
> time is actually spent in the client as well.
>
> Shuhao
>
>
> On 13-04-11 10:18 AM, Evan Vigil-McClanahan wrote:
>>
>> Have you tried streaming the results?  Sometimes the
>> disproportionately slow responses from large mapreduce jobs are due to
>> swapping and memory pressure.   Streaming should lower the amount of
>> memory used, potentially allowing you to do this in a reasonable
>> amount of time.  the downside here is that streaming currently isn't
>> supported in the python client (although a new version that does
>> should be out sometime soon-ish).
>>
>> On Wed, Apr 10, 2013 at 9:25 PM, Jeff Peck <jeffp at tnrglobal.com> wrote:
>>>
>>> As a follow-up to this thread and my thread from earlier today, I am
>>> basically looking for a simple way to extract the value of a single field
>>> from approximately 900,000 documents (which happens to be indexed). I
>>> have
>>> been trying many options including a map-reduce function that executes
>>> entirely over http (taking out any python client bottlenecks). I let that
>>> run for over an hour before I stopped it. It did not return any output.
>>>
>>> I also have tried grabbing a list of the 900k keys from a secondary index
>>> (very fast, about 11 seconds) and then trying to fetch each key in
>>> parallel
>>> (using curl and gnu parallel). That was also too slow to be feasible.
>>>
>>> Is there something basic that I am missing?
>>>
>>> One idea that I though of was to have a secondary index that is intended
>>> to
>>> split all of my data into segments. I would use the first three
>>> characters
>>> of the md5 of the document's key in hexadecimal format. So, the index
>>> would
>>> contain strings like "ae1", "2f4", "5ee", etc. Then, I can run my
>>> map-reduce
>>> query against *each* segment individually and possibly even in parallel.
>>>
>>> I have observed that map-reduce is very fast with small sets of data
>>> (i.e.
>>> 5,000 objects), but with 900,000 objects it does not appear to run in a
>>> proportionately fast time. So, the idea is to divide the data into
>>> segments
>>> that can be better handled by map-reduce.
>>>
>>> Before I implement this, I want to ask: Does this seem like the
>>> appropriate
>>> way to handle this type of operation? And, is there any better way to do
>>> this in the current version of Riak?
>>>
>>>
>>> On Apr 10, 2013, at 6:10 PM, Shuhao Wu <shuhao at shuhaowu.com> wrote:
>>>
>>> There are some inefficiencies in the python client... I've been profiling
>>> it
>>> recently and found that it occasionally takes the python client longer
>>> when
>>> you're on the same machine.
>>>
>>> Perhaps Sean could comment?
>>>
>>> Shuhao
>>> Sent from my phone.
>>>
>>> On 2013-04-10 4:04 PM, "Jeff Peck" <jeffp at tnrglobal.com> wrote:
>>>>
>>>>
>>>> Thanks Evan. I tried doing it in python like this (realizing that the
>>>> previous way I did it uses MapReduce) and I had better results. It
>>>> finished
>>>> in 3.5 minutes, but nowhere close to the 15 seconds from the straight
>>>> http
>>>> query:
>>>>
>>>> import riak
>>>> from pprint import pprint
>>>>
>>>> bucket_name = "mybucket"
>>>>
>>>> client =
>>>> riak.RiakClient(port=8087,transport_class=riak.RiakPbcTransport)
>>>> bucket = client.bucket(bucket_name)
>>>> results = bucket.get_index('status_bin', 'PERSISTED')
>>>>
>>>> print len(results)
>>>>
>>>>
>>>> On Apr 10, 2013, at 4:00 PM, Evan Vigil-McClanahan
>>>> <emcclanahan at basho.com>
>>>> wrote:
>>>>
>>>>> get_index() is the right function there, I think.
>>>>>
>>>>> On Wed, Apr 10, 2013 at 2:53 PM, Jeff Peck <jeffp at tnrglobal.com> wrote:
>>>>>>
>>>>>> I can grab over 900,000 keys from an indexs, using an http query in
>>>>>> about 15 seconds, whereas the same operation in python times out after
>>>>>> 5
>>>>>> minutes. Does this indicate that I am using the python API
>>>>>> incorrectly?
>>>>>> Should I be relying on an http request initially when I need to grab
>>>>>> this
>>>>>> many keys?
>>>>>>
>>>>>> (Note: This is tied to the question that I asked earlier, but is also
>>>>>> a
>>>>>> general question to help understand the proper usage of the python
>>>>>> API.)
>>>>>>
>>>>>> Thanks! Examples are below.
>>>>>>
>>>>>> - Jeff
>>>>>>
>>>>>> ---
>>>>>>
>>>>>> HTTP:
>>>>>>
>>>>>> $ time curl -s
>>>>>> http://localhost:8098/buckets/mybucket/index/status_bin/PERSISTED |
>>>>>> grep -o
>>>>>> , | wc -l
>>>>>> 926047
>>>>>>
>>>>>> real    0m14.583s
>>>>>> user    0m2.500s
>>>>>> sys     0m0.270s
>>>>>>
>>>>>> ---
>>>>>>
>>>>>> Python:
>>>>>>
>>>>>> import riak
>>>>>>
>>>>>> bucket = "my bucket"
>>>>>> client = riak.RiakClient(port=8098)
>>>>>> results = client.index(bucket, 'status_bin',
>>>>>> 'PERSISTED').run(timeout=5*60*1000) # 5 minute timeout
>>>>>> print len(results)
>>>>>>
>>>>>> (times out after 5 minutes)
>>>>>> _______________________________________________
>>>>>> riak-users mailing list
>>>>>> riak-users at lists.basho.com
>>>>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> riak-users mailing list
>>>> riak-users at lists.basho.com
>>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> riak-users mailing list
>>> riak-users at lists.basho.com
>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>>
>




More information about the riak-users mailing list