Getting all the Keys

Dmitri Zagidulin dzagidulin at basho.com
Thu Apr 25 13:20:18 EDT 2013


In addition, to reiterate what Alexander said in the email thread above,
keep in mind that doing a 'list keys' on a bucket forces Riak to iterate
through ALL of the keys in a cluster, not just those belonging to a bucket.

Meaning, if your cluster has 100 million keys, but a particular bucket has
only 5 keys, doing a list keys on that bucket will iterate not through 5,
but through all 100 million keys, just to bring you that list of 5.

After Riak version 1.1, when the 'list keys backpressure' feature was
introduced, and given that you use the streaming version of list keys,
iterating through all the keys in a cluster is not completely disastrous.
But it's also not an operation to be done frivolously -- it's reserved for
when there's no other alternative, such as in cases where you have to
perform logical backup of your bucket (extract the whole thing to disk).

You're probably wondering what you should do instead of list keys, to
simulate the SELECT operation you're used to in relational databases.
The answer is - use an index of some sort. (Incidentally, with relational
databases, you're usually using an index, too -- very rare are the cases
where you're doing SELECT * on a table with no primary index).  This means
one of three things:
1) Use Secondary Indexes to keep track of a subset of keys that you care
about, that you'll want listed.
2) Use Riak Search (which also builds an index of the documents it cares
about), to keep track of a subset of documents you care about.
3) If the subset is small and you know what you're doing, keep track of the
keys yourself -- you can do things like store that list of keys in a json
object and store the list in riak, etc.

Dmitri



On Thu, Apr 25, 2013 at 12:38 PM, n6mac41717 <csh at stanfordalumni.org> wrote:

> I know it's been over two years since this post, and I'm wondering if the
> latest version of Riak has made improvements to list keys--I tried the
> query
> with "keys=true" and I didn't seem to have TSA/octomom-related wait times.
>
> I was originally hoping that I could get a list of keys via the RESTful API
> which led me to this thread.  In other words, a GET url/bucket/key will
> indeed return what I shoved into the bucket at that key, but I was hoping
> that a GET url/bucket (I guess to be truly RESTful, I should make the
> bucket
> plural) would return the keys.
>
> Thoughts?
>
> Thanks in advance, Chuck
>
>
> Alexander Sicular wrote
> > Hi Thomas,
> >
> > This is a topic that has come up many times. Lemme just hit a couple of
> > high notes in no particular order:
> >
> > - If you must do a list keys op on a bucket, you must must must use
> > "?keys=stream". True will block on the coordinating node until all nodes
> > return their keys. Stream will start sending keys as soon as the first
> > node returns.
> >
> > - "list keys" is one of the most expensive native operations you can
> > perform in Riak. Not only does it do a full key scan of all the keys in
> > your bucket, but all the keys in your cluster. It is obnoxiously
> expensive
> > and only more so as the number of keys in your cluster grows. There has
> > been discussions about changing this but everything comes with a cost
> > (more open file descriptors) and I do not believe a decision has been
> made
> > yet.
> >
> > -Riak is in no way a relational system. It is, in fact, about as opposite
> > as you can get. Incidentally, "select *" is generally not recommended in
> > the Kingdom of Relations and regarded as wasteful. You need a bit of a
> > mind shift from relational world to have success with nosql in general
> and
> > Riak in particular.
> >
> > -There are no native indices in Riak. By default Riak uses the bitcask
> > backend. Bitcask has many advantages but one disadvantage is that all
> keys
> > (key length + a bit of overhead) must fit in ram.
> >
> > -Do not use "?keys=true". Your computer will melt. And then your face.
> >
> > -As of Riak 0.14 your m/r can filter on key name. I would highly
> recommend
> > that your data architecture take this into account by using keys that
> have
> > meaningful names. This will allow you to not scan every key in your
> > cluster.
> >
> > -Buckets are analogous to relational tables but only just. In Riak, you
> > can think of a bucket as a namespace holder (it is used as part of the
> > default circular hash function) but primarily as a mechanism to
> > differentiate system settings from one group of keys to the next.
> >
> > -There is no penalty for unlimited buckets except for when their settings
> > deviate from the system defaults. By settings I mean things like hooks,
> > replication values and backends among others.
> >
> > -One should list keys by truth if one enjoys sitting in parking lots on
> > the freeway on a scorching summers day or perhaps waiting in a TSA line
> at
> > your nearest international point of embarkation surrounded by octomom
> > families all the while juggling between the grope or the pr0n slideshow.
> > If that is for you, use "?keys=true".
> >
> > -Virtually everything in Riak is transient. Meaning, for the most part
> > (not including the 60 seconds or so of m/r cache), there is no caching
> > going on in Riak outside of the operating system. Ie. your subsequent
> > queries will do more or less the same work as their predecessors. You
> need
> > to cache your own results if you want to reuse them... quickly.
> >
> >
> >
> > Oh, there's more but I'm pretty jelloed from last night. Welcome to the
> > fold, Thomas. Can I call you Tom?
> >
> > Cheers,
> > -Alexander Sicular
> >
> > @siculars
> >
> > On Jan 22, 2011, at 10:19 AM, Thomas Burdick wrote:
> >
> >> I've been playing around with riak lately as really my first usage of a
> >> distributed key/value store. I quite like many of the concepts and
> >> possibilities of Riak and what it may deliver, however I'm really stuck
> >> on an issue.
> >>
> >> Doing the equivalent of a select * from sometable in riak is seemingly
> >> slow. As a quick test I tried...
> >>
> >> http://localhost:8098/riak/mytable?keys=true
> >>
> >> Before even iterating over the keys this was unbearably slow already.
> >> This took almost half a second on my machine where mytable is completely
> >> empty!
> >>
> >> I'm a little baffled, I would assume that getting all the keys of a
> table
> >> is an incredibly common task?  How do I get all the keys of a table
> >> quickly? By quickly I mean a few milliseconds or less as I would expect
> >> of even a "slow" rdbms with an empty table, even some tables with 1000's
> >> of items can get all the primary keys of a sql table in a few
> >> milliseconds.
> >>
> >> Tom Burdick
> >>
> >> _______________________________________________
> >> riak-users mailing list
> >>
>
> > riak-users at .basho
>
> >> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> >
> >
> > _______________________________________________
> > riak-users mailing list
>
> > riak-users at .basho
>
> > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
>
>
>
>
> --
> View this message in context:
> http://riak-users.197444.n3.nabble.com/Getting-all-the-Keys-tp2308764p4027757.html
> Sent from the Riak Users mailing list archive at Nabble.com.
>
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20130425/1e551108/attachment.html>


More information about the riak-users mailing list