ListKeys or MapReduce

Jeremiah Peschka jeremiah.peschka at gmail.com
Tue Feb 12 17:39:28 EST 2013


As best as I understand the magical $key index, you need to provide a range
in order to query anything from the index. A RiakBucketKeyInput accepts a
bucket/key pair - you can add many of these to an MR input phase if you
already know which keys in a bucket need to be acted upon.

Riak's secondary indices only allow for two operatios - exact match and
range scan (see the message description if you're interested [1]). To get a
full range scan, you'll want to pick a range_max value that is outside the
bounds of your largest key. If you know you're only dealing with ASCII
characters you can easily pick an ASCII character [2] that's outside the
bounds of your data set. This gets trickier if you have to deal with
Unicode data.

[1]:
http://docs.basho.com/riak/latest/references/apis/protocol-buffers/PBC-Index/

[2]: http://www.asciitable.com/

---
Jeremiah Peschka - Founder, Brent Ozar Unlimited
MCITP: SQL Server 2008, MVP
Cloudera Certified Developer for Apache Hadoop


On Tue, Feb 12, 2013 at 2:24 PM, Kevin Burton <rkevinburton at charter.net>wrote:

> Is there a reason why you selected a range and not just the bucket and key
> (in the example)? My concern is that I don’t want to hard-code any
> dependencies or fore-knowledge in the code if possible. Using a range
> assumes that all of the keys are in the range. As I see it if you just
> specify the bucket and key there is no “assumption”. Right?****
>
> ** **
>
> *From:* Jeremiah Peschka [mailto:jeremiah.peschka at gmail.com]
> *Sent:* Tuesday, February 12, 2013 1:52 PM
>
> *To:* Kevin Burton
> *Cc:* riak-users
> *Subject:* Re: ListKeys or MapReduce****
>
> ** **
>
> Oh, and an example can be found https://gist.github.com/peschkaj/4772825**
> **
>
>
> ****
>
> ---****
>
> Jeremiah Peschka - Founder, Brent Ozar Unlimited****
>
> MCITP: SQL Server 2008, MVP****
>
> Cloudera Certified Developer for Apache Hadoop****
>
> ** **
>
> On Tue, Feb 12, 2013 at 11:44 AM, Jeremiah Peschka <
> jeremiah.peschka at gmail.com> wrote:****
>
> ...and fixed!****
>
> ** **
>
> You can get this right now if you're adventurous and want to build
> CorrugatedIron from source by grabbing the develop branch [1]. We have
> several other issues to clean up and verify before we release CI 1.1.1 in
> the next day or so. Or you can download it from [2] if you don't want to
> build yourself and don't want to wait for NuGet. Once we put 1.1.1 to NuGet
> we'll respond to this thread or email you directly.****
>
> ** **
>
> I make no guarantees that the new DLL won't eat your hard drive or turn
> your computer into a killer robot.****
>
> ** **
>
> [1]: https://github.com/DistributedNonsense/CorrugatedIron/tree/develop***
> *
>
> [2]:
> http://clientresources.brentozar.com.s3.amazonaws.com/CorrugatedIron-111-alpha.zip
> ****
>
>
> ****
>
> ---****
>
> Jeremiah Peschka - Founder, Brent Ozar Unlimited****
>
> MCITP: SQL Server 2008, MVP****
>
> Cloudera Certified Developer for Apache Hadoop****
>
> ** **
>
> On Tue, Feb 12, 2013 at 11:13 AM, Jeremiah Peschka <
> jeremiah.peschka at gmail.com> wrote:****
>
> Good news! You've found a bug in CorrugatedIron. Because of index naming,
> we muck index names to have a suffix of _bin or _int, depending on the
> index type. This shouldn't be happening on $key, but it is. I'll create a
> github issue and get that taken care of.****
>
>
> ****
>
> ---****
>
> Jeremiah Peschka - Founder, Brent Ozar Unlimited****
>
> MCITP: SQL Server 2008, MVP****
>
> Cloudera Certified Developer for Apache Hadoop****
>
> ** **
>
> On Tue, Feb 12, 2013 at 7:56 AM, Kevin Burton <rkevinburton at charter.net>
> wrote:****
>
> I forgot to mention that when I execute this code I get the error:****
>
>  ****
>
>                                         {not_found,****
>
>                                          {<<"products">>,****
>
>                                           <<"$keys">>},****
>
>                                          undefined}}}:[{mochijson2,****
>
>                                                         json_encode,2,****
>
>                                                         [{file,****
>
>
> "src/mochijson2.erl"},****
>
>                                                          {line,149}]},****
>
>                                                        {mochijson2,****
>
>
>                     '-json_encode_array/2-fun-0-',****
>
>                                                         3,****
>
>                                                         [{file,****
>
>
> "src/mochijson2.erl"},****
>
>                                                         {line,157}]},****
>
>                                                        {lists,foldl,3,****
>
>
> [{file,"lists.erl"},****
>
>                                                          {line,1197}]},***
> *
>
>                                                        {mochijson2,****
>
>
> json_encode_array,2,****
>
>                                                         [{file,****
>
>
>                                              "src/mochijson2.erl"},****
>
>                                                          {line,159}]},****
>
>                                                        {riak_kv_pb_mapred,
> ****
>
>                                                         process_stream,3,*
> ***
>
>                                                         [{file,****
>
>
> "src/riak_kv_pb_mapred.erl"},****
>
>                                                          {line,97}]},****
>
>                                                        {riak_api_pb_server,
> ****
>
>                                                         process_stream,5,*
> ***
>
>                                                         [{file,****
>
>
>               "src/riak_api_pb_server.erl"},****
>
>                                                          {line,227}]},****
>
>                                                        {riak_api_pb_server,
> ****
>
>                                                         handle_info,2,****
>
>                                                         [{file,****
>
>
> "src/riak_api_pb_server.erl"},****
>
>                                                          {line,158}]},****
>
>                                                        {gen_server,****
>
>                                                         handle_msg,5,****
>
>                                                         [{file,****
>
>
>                                            "gen_server.erl"},****
>
>                                                          {line,607}]}] -
> CommunicationError****
>
>  ****
>
>  ****
>
> *From:* riak-users [mailto:riak-users-bounces at lists.basho.com] *On Behalf
> Of *Kevin Burton
> *Sent:* Tuesday, February 12, 2013 9:48 AM
> *To:* 'Jeremiah Peschka'
> *Cc:* 'riak-users'
> *Subject:* RE: ListKeys or MapReduce****
>
>  ****
>
> The name is “$keys”? Something like:****
>
>  ****
>
>             using (IRiakEndPoint cluster = RiakCluster.FromConfig(
> "riakConfig"))****
>
>             {****
>
>                 IRiakClient riakClient = cluster.CreateClient();****
>
>                 RiakBucketKeyInput bucketKeyInput = new RiakBucketKeyInput
> ();****
>
>                 bucketKeyInput.AddBucketKey(productBucketName, "$keys");**
> **
>
>                 RiakMapReduceQuery query = new RiakMapReduceQuery()****
>
>                    .Inputs(bucketKeyInput)****
>
>                    .MapJs(m => m.Name("Riak.mapValuesJson").Keep(true));**
> **
>
>                 RiakResult<RiakMapReduceResult> result =
> riakClient.MapReduce(query);****
>
>                 if (result.IsSuccess)****
>
>                 {****
>
>  ****
>
>  ****
>
> *From:* Jeremiah Peschka [mailto:jeremiah.peschka at gmail.com<jeremiah.peschka at gmail.com>]
>
> *Sent:* Tuesday, February 12, 2013 9:18 AM
> *To:* Kevin Burton
> *Cc:* riak-users
> *Subject:* Re: ListKeys or MapReduce****
>
>  ****
>
> It would be queried like any other index as an MR input. I'll create an
> issue and will try to get this in some time in the next few days - no
> promises, though.****
>
>
> ****
>
> ---****
>
> Jeremiah Peschka - Founder, Brent Ozar Unlimited****
>
> MCITP: SQL Server 2008, MVP****
>
> Cloudera Certified Developer for Apache Hadoop****
>
>  ****
>
> On Tue, Feb 12, 2013 at 7:09 AM, Kevin Burton <rkevinburton at charter.net>
> wrote:****
>
> I will read the other URLs that you mentioned. Thank you.****
>
>  ****
>
> Would you mind giving a short example (preferably using CI) of the $keys
> index?****
>
>  ****
>
> *From:* Jeremiah Peschka [mailto:jeremiah.peschka at gmail.com]
> *Sent:* Tuesday, February 12, 2013 8:52 AM
> *To:* Kevin Burton
> *Cc:* riak-users
> *Subject:* Re: ListKeys or MapReduce****
>
>  ****
>
> They're both pretty crappy in terms of performance - they read all data
> off of disk. If you're using LevelDB you can use the $keys index to pull
> back just the keys that in a single bucket.****
>
>  ****
>
> A better approach is to maintain a separate bucket - e.g. DocumentCount -
> that is used for counting documents. Unfortunately, you can't guarantee
> transactional consistency around counts in Riak today, so you'll want to
> move maintaining the counts out of Riak and into something else. If you
> search the list archives [1], you'll find that Redis has been mentioned as
> a good way to solve this problem - counters are stored in Redis and flushed
> to Riak on a regular schedule. Because of the lack of consistency
> (especially around MapReduce operations), Riak isn't the best choice if you
> require counters/aggregations to be stored in the database.****
>
>  ****
>
> Once CRDTs [2] make it into mainstream Riak, you can make use of those
> data structures to implement distributed counters in Riak.****
>
>  ****
>
> [1]: http://riak.markmail.org****
>
> [2]: http://vimeo.com/52414903****
>
>
> ****
>
> ---****
>
> Jeremiah Peschka - Founder, Brent Ozar Unlimited****
>
> MCITP: SQL Server 2008, MVP****
>
> Cloudera Certified Developer for Apache Hadoop****
>
>  ****
>
> On Mon, Feb 11, 2013 at 10:30 AM, <rkevinburton at charter.net> wrote:****
>
> Say I need to determine how many document there are in my database. For a
> CorrugatedIron application I can do ListKeys and get the warning that it is
> an expensive operation or I can do a MapReduce query. Which is the the
> least expensive? Is there an option that I am missing?****
>
>
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com****
>
>  ****
>
>  ****
>
> ** **
>
> ** **
>
> ** **
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20130212/3c729428/attachment.html>


More information about the riak-users mailing list