ListKeys or MapReduce

Christian Dahlqvist christian at basho.com
Wed Feb 13 13:08:31 EST 2013


Hi,

In addition to the $key index, there is also a $bucket index available by default. This contains the name of the bucket, and can be used to get all keys in a specific bucket.

Best regards,

Christian


On 12 Feb 2013, at 22:39, Jeremiah Peschka <jeremiah.peschka at gmail.com> wrote:

> As best as I understand the magical $key index, you need to provide a range in order to query anything from the index. A RiakBucketKeyInput accepts a bucket/key pair - you can add many of these to an MR input phase if you already know which keys in a bucket need to be acted upon.
> 
> Riak's secondary indices only allow for two operatios - exact match and range scan (see the message description if you're interested [1]). To get a full range scan, you'll want to pick a range_max value that is outside the bounds of your largest key. If you know you're only dealing with ASCII characters you can easily pick an ASCII character [2] that's outside the bounds of your data set. This gets trickier if you have to deal with Unicode data. 
> 
> [1]: http://docs.basho.com/riak/latest/references/apis/protocol-buffers/PBC-Index/ 
> [2]: http://www.asciitable.com/
> 
> ---
> Jeremiah Peschka - Founder, Brent Ozar Unlimited
> MCITP: SQL Server 2008, MVP
> Cloudera Certified Developer for Apache Hadoop
> 
> 
> On Tue, Feb 12, 2013 at 2:24 PM, Kevin Burton <rkevinburton at charter.net> wrote:
> Is there a reason why you selected a range and not just the bucket and key (in the example)? My concern is that I don’t want to hard-code any dependencies or fore-knowledge in the code if possible. Using a range assumes that all of the keys are in the range. As I see it if you just specify the bucket and key there is no “assumption”. Right?
> 
>  
> 
> From: Jeremiah Peschka [mailto:jeremiah.peschka at gmail.com] 
> Sent: Tuesday, February 12, 2013 1:52 PM
> 
> 
> To: Kevin Burton
> Cc: riak-users
> Subject: Re: ListKeys or MapReduce
> 
>  
> 
> Oh, and an example can be found https://gist.github.com/peschkaj/4772825
> 
> 
> 
> ---
> 
> Jeremiah Peschka - Founder, Brent Ozar Unlimited
> 
> MCITP: SQL Server 2008, MVP
> 
> Cloudera Certified Developer for Apache Hadoop
> 
>  
> 
> On Tue, Feb 12, 2013 at 11:44 AM, Jeremiah Peschka <jeremiah.peschka at gmail.com> wrote:
> 
> ...and fixed!
> 
>  
> 
> You can get this right now if you're adventurous and want to build CorrugatedIron from source by grabbing the develop branch [1]. We have several other issues to clean up and verify before we release CI 1.1.1 in the next day or so. Or you can download it from [2] if you don't want to build yourself and don't want to wait for NuGet. Once we put 1.1.1 to NuGet we'll respond to this thread or email you directly.
> 
>  
> 
> I make no guarantees that the new DLL won't eat your hard drive or turn your computer into a killer robot.
> 
>  
> 
> [1]: https://github.com/DistributedNonsense/CorrugatedIron/tree/develop
> 
> [2]: http://clientresources.brentozar.com.s3.amazonaws.com/CorrugatedIron-111-alpha.zip
> 
> 
> 
> ---
> 
> Jeremiah Peschka - Founder, Brent Ozar Unlimited
> 
> MCITP: SQL Server 2008, MVP
> 
> Cloudera Certified Developer for Apache Hadoop
> 
>  
> 
> On Tue, Feb 12, 2013 at 11:13 AM, Jeremiah Peschka <jeremiah.peschka at gmail.com> wrote:
> 
> Good news! You've found a bug in CorrugatedIron. Because of index naming, we muck index names to have a suffix of _bin or _int, depending on the index type. This shouldn't be happening on $key, but it is. I'll create a github issue and get that taken care of.
> 
> 
> 
> ---
> 
> Jeremiah Peschka - Founder, Brent Ozar Unlimited
> 
> MCITP: SQL Server 2008, MVP
> 
> Cloudera Certified Developer for Apache Hadoop
> 
>  
> 
> On Tue, Feb 12, 2013 at 7:56 AM, Kevin Burton <rkevinburton at charter.net> wrote:
> 
> I forgot to mention that when I execute this code I get the error:
> 
>  
> 
>                                         {not_found,
> 
>                                          {<<"products">>,
> 
>                                           <<"$keys">>},
> 
>                                          undefined}}}:[{mochijson2,
> 
>                                                         json_encode,2,
> 
>                                                         [{file,
> 
>                                                           "src/mochijson2.erl"},
> 
>                                                          {line,149}]},
> 
>                                                        {mochijson2,
> 
>                                                         '-json_encode_array/2-fun-0-',
> 
>                                                         3,
> 
>                                                         [{file,
> 
>                                                           "src/mochijson2.erl"},
> 
>                                                         {line,157}]},
> 
>                                                        {lists,foldl,3,
> 
>                                                         [{file,"lists.erl"},
> 
>                                                          {line,1197}]},
> 
>                                                        {mochijson2,
> 
>                                                         json_encode_array,2,
> 
>                                                         [{file,
> 
>                                                           "src/mochijson2.erl"},
> 
>                                                          {line,159}]},
> 
>                                                        {riak_kv_pb_mapred,
> 
>                                                         process_stream,3,
> 
>                                                         [{file,
> 
>                                                           "src/riak_kv_pb_mapred.erl"},
> 
>                                                          {line,97}]},
> 
>                                                        {riak_api_pb_server,
> 
>                                                         process_stream,5,
> 
>                                                         [{file,
> 
>                                                           "src/riak_api_pb_server.erl"},
> 
>                                                          {line,227}]},
> 
>                                                        {riak_api_pb_server,
> 
>                                                         handle_info,2,
> 
>                                                         [{file,
> 
>                                                           "src/riak_api_pb_server.erl"},
> 
>                                                          {line,158}]},
> 
>                                                        {gen_server,
> 
>                                                         handle_msg,5,
> 
>                                                         [{file,
> 
>                                                           "gen_server.erl"},
> 
>                                                          {line,607}]}] - CommunicationError
> 
>  
> 
>  
> 
> From: riak-users [mailto:riak-users-bounces at lists.basho.com] On Behalf Of Kevin Burton
> Sent: Tuesday, February 12, 2013 9:48 AM
> To: 'Jeremiah Peschka'
> Cc: 'riak-users'
> Subject: RE: ListKeys or MapReduce
> 
>  
> 
> The name is “$keys”? Something like:
> 
>  
> 
>             using (IRiakEndPoint cluster = RiakCluster.FromConfig("riakConfig"))
> 
>             {
> 
>                 IRiakClient riakClient = cluster.CreateClient();
> 
>                 RiakBucketKeyInput bucketKeyInput = new RiakBucketKeyInput();
> 
>                 bucketKeyInput.AddBucketKey(productBucketName, "$keys");
> 
>                 RiakMapReduceQuery query = new RiakMapReduceQuery()
> 
>                    .Inputs(bucketKeyInput)
> 
>                    .MapJs(m => m.Name("Riak.mapValuesJson").Keep(true));
> 
>                 RiakResult<RiakMapReduceResult> result = riakClient.MapReduce(query);
> 
>                 if (result.IsSuccess)
> 
>                 {
> 
>  
> 
>  
> 
> From: Jeremiah Peschka [mailto:jeremiah.peschka at gmail.com] 
> Sent: Tuesday, February 12, 2013 9:18 AM
> To: Kevin Burton
> Cc: riak-users
> Subject: Re: ListKeys or MapReduce
> 
>  
> 
> It would be queried like any other index as an MR input. I'll create an issue and will try to get this in some time in the next few days - no promises, though.
> 
> 
> 
> ---
> 
> Jeremiah Peschka - Founder, Brent Ozar Unlimited
> 
> MCITP: SQL Server 2008, MVP
> 
> Cloudera Certified Developer for Apache Hadoop
> 
>  
> 
> On Tue, Feb 12, 2013 at 7:09 AM, Kevin Burton <rkevinburton at charter.net> wrote:
> 
> I will read the other URLs that you mentioned. Thank you.
> 
>  
> 
> Would you mind giving a short example (preferably using CI) of the $keys index?
> 
>  
> 
> From: Jeremiah Peschka [mailto:jeremiah.peschka at gmail.com] 
> Sent: Tuesday, February 12, 2013 8:52 AM
> To: Kevin Burton
> Cc: riak-users
> Subject: Re: ListKeys or MapReduce
> 
>  
> 
> They're both pretty crappy in terms of performance - they read all data off of disk. If you're using LevelDB you can use the $keys index to pull back just the keys that in a single bucket.
> 
>  
> 
> A better approach is to maintain a separate bucket - e.g. DocumentCount - that is used for counting documents. Unfortunately, you can't guarantee transactional consistency around counts in Riak today, so you'll want to move maintaining the counts out of Riak and into something else. If you search the list archives [1], you'll find that Redis has been mentioned as a good way to solve this problem - counters are stored in Redis and flushed to Riak on a regular schedule. Because of the lack of consistency (especially around MapReduce operations), Riak isn't the best choice if you require counters/aggregations to be stored in the database.
> 
>  
> 
> Once CRDTs [2] make it into mainstream Riak, you can make use of those data structures to implement distributed counters in Riak.
> 
>  
> 
> [1]: http://riak.markmail.org
> 
> [2]: http://vimeo.com/52414903
> 
> 
> 
> ---
> 
> Jeremiah Peschka - Founder, Brent Ozar Unlimited
> 
> MCITP: SQL Server 2008, MVP
> 
> Cloudera Certified Developer for Apache Hadoop
> 
>  
> 
> On Mon, Feb 11, 2013 at 10:30 AM, <rkevinburton at charter.net> wrote:
> 
> Say I need to determine how many document there are in my database. For a CorrugatedIron application I can do ListKeys and get the warning that it is an expensive operation or I can do a MapReduce query. Which is the the least expensive? Is there an option that I am missing?
> 
> 
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> 
>  
> 
>  
> 
>  
> 
>  
> 
>  
> 
> 
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20130213/9009b5f4/attachment.html>


More information about the riak-users mailing list