Riak 2i http query much faster than python api?

Christian Dahlqvist christian at basho.com
Thu Apr 11 03:43:00 EDT 2013


Hi Jeff,

Segmenting the keys based on some random number will allow you to run smaller jobs, but if you still need to run all the batches in order to get a result, I am not sure you have gained much. Exactly what the best way to process and prepare data depends highly on your use case and what your access patterns will look like and how you need to query your access/data.

One way to spread the cost of aggregating data may be to add a timestamp as a secondary index and periodically aggregate data for specific time periods that make sense to your application. Further aggregation can then later be based on these aggregation records rather than the individual records, which will most likely be much faster.

If you can provide additional details about your use case, expected access patterns and queries we might be able to help better. If you do not feel comfortable sharing it here on the list, please feel free to email me directly.

Best regards,

Christian



On 11 Apr 2013, at 03:49, Jeff Peck <jeffp at tnrglobal.com> wrote:

>> Out of curiousity, how are you planning on segmenting the data?
>> 
> 
> My plan to segment the data would be to have a secondary index on a key called seg_id (or something similar).
> 
> When I add an object to Riak, I will set seg_id to be the first three characters of the md5 of the object's key, which should yield an even distribution.
> 
> Then, when querying the data, I will run map-reduce against each segment (so for 3 hexadecimal characters, it would be 4,096 map-reduce queries).
> 
> The inputs part of the query would look like this:
> 
> "inputs":{
>        "bucket":"mybucket",
>        "index":"seg_id_bin",
>        "key":"aaa"
>     }
> 
> I would run the map-reduce queries in parallel.
> 
> It sounds like a lot of work to just get the value of one field, which makes me think that there is a better way. Plus, I do not know that this will actually work as fast as I expect it to. That's why I'm asking here before I implement it.
>> Also, how are you setting up your servers? Single nodes? Multiple nodes?
>> 
> 
> I am using the default Riak installation (with leveldb as the backend and search turned on). I am on a 16 core 3Ghz node with 20Gb of memory, however it appears that Riak is not using all of the resources available to it. I suspect that this can be resolved by modifying the configuration
> 
> That said, if you, or anyone reading this, could suggest a configuration that is more suited for performing a relatively small batch operation across 900k (and soon to be about 5 million) or objects, that would be greatly appreciated.
> 
> Thanks!
> 
> - Jeff
> 
> 
> On Apr 10, 2013, at 10:32 PM, Shuhao Wu <shuhao at shuhaowu.com> wrote:
> 
>> Out of curiousity, how are you planning on segmenting the data? Map reduce will execute over the entire data set.
>> 
>> Also, how are you setting up your servers? Single nodes? Multiple nodes?
>> 
>> Shuhao
>> Sent from my phone.
>> 
>> On 2013-04-10 10:25 PM, "Jeff Peck" <jeffp at tnrglobal.com> wrote:
>> As a follow-up to this thread and my thread from earlier today, I am basically looking for a simple way to extract the value of a single field from approximately 900,000 documents (which happens to be indexed). I have been trying many options including a map-reduce function that executes entirely over http (taking out any python client bottlenecks). I let that run for over an hour before I stopped it. It did not return any output.
>> 
>> I also have tried grabbing a list of the 900k keys from a secondary index (very fast, about 11 seconds) and then trying to fetch each key in parallel (using curl and gnu parallel). That was also too slow to be feasible.
>> 
>> Is there something basic that I am missing?
>> 
>> One idea that I though of was to have a secondary index that is intended to split all of my data into segments. I would use the first three characters of the md5 of the document's key in hexadecimal format. So, the index would contain strings like "ae1", "2f4", "5ee", etc. Then, I can run my map-reduce query against *each* segment individually and possibly even in parallel.
>> 
>> I have observed that map-reduce is very fast with small sets of data (i.e. 5,000 objects), but with 900,000 objects it does not appear to run in a proportionately fast time. So, the idea is to divide the data into segments that can be better handled by map-reduce.
>> 
>> Before I implement this, I want to ask: Does this seem like the appropriate way to handle this type of operation? And, is there any better way to do this in the current version of Riak?
>> 
>> 
>> On Apr 10, 2013, at 6:10 PM, Shuhao Wu <shuhao at shuhaowu.com> wrote:
>> 
>>> There are some inefficiencies in the python client... I've been profiling it recently and found that it occasionally takes the python client longer when you're on the same machine.
>>> 
>>> Perhaps Sean could comment?
>>> 
>>> Shuhao
>>> Sent from my phone.
>>> 
>>> On 2013-04-10 4:04 PM, "Jeff Peck" <jeffp at tnrglobal.com> wrote:
>>> Thanks Evan. I tried doing it in python like this (realizing that the previous way I did it uses MapReduce) and I had better results. It finished in 3.5 minutes, but nowhere close to the 15 seconds from the straight http query:
>>> 
>>> import riak
>>> from pprint import pprint
>>> 
>>> bucket_name = "mybucket"
>>> 
>>> client = riak.RiakClient(port=8087,transport_class=riak.RiakPbcTransport)
>>> bucket = client.bucket(bucket_name)
>>> results = bucket.get_index('status_bin', 'PERSISTED')
>>> 
>>> print len(results)
>>> 
>>> 
>>> On Apr 10, 2013, at 4:00 PM, Evan Vigil-McClanahan <emcclanahan at basho.com> wrote:
>>> 
>>> > get_index() is the right function there, I think.
>>> >
>>> > On Wed, Apr 10, 2013 at 2:53 PM, Jeff Peck <jeffp at tnrglobal.com> wrote:
>>> >> I can grab over 900,000 keys from an indexs, using an http query in about 15 seconds, whereas the same operation in python times out after 5 minutes. Does this indicate that I am using the python API incorrectly? Should I be relying on an http request initially when I need to grab this many keys?
>>> >>
>>> >> (Note: This is tied to the question that I asked earlier, but is also a general question to help understand the proper usage of the python API.)
>>> >>
>>> >> Thanks! Examples are below.
>>> >>
>>> >> - Jeff
>>> >>
>>> >> ---
>>> >>
>>> >> HTTP:
>>> >>
>>> >> $ time curl -s http://localhost:8098/buckets/mybucket/index/status_bin/PERSISTED | grep -o , | wc -l
>>> >> 926047
>>> >>
>>> >> real    0m14.583s
>>> >> user    0m2.500s
>>> >> sys     0m0.270s
>>> >>
>>> >> ---
>>> >>
>>> >> Python:
>>> >>
>>> >> import riak
>>> >>
>>> >> bucket = "my bucket"
>>> >> client = riak.RiakClient(port=8098)
>>> >> results = client.index(bucket, 'status_bin', 'PERSISTED').run(timeout=5*60*1000) # 5 minute timeout
>>> >> print len(results)
>>> >>
>>> >> (times out after 5 minutes)
>>> >> _______________________________________________
>>> >> riak-users mailing list
>>> >> riak-users at lists.basho.com
>>> >> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>> 
>>> 
>>> _______________________________________________
>>> riak-users mailing list
>>> riak-users at lists.basho.com
>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>> 
> 
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20130411/337dad14/attachment.html>


More information about the riak-users mailing list