Expected vs Actual Bucket Behavior

Justin Sheehy justin at basho.com
Tue Jul 20 15:02:55 EDT 2010


Hi, Eric!  Thanks for your thoughts.

On Tue, Jul 20, 2010 at 12:39 PM, Eric Filson <efilson at gmail.com> wrote:

> I would think that this requirement,
> retrieving all objects in a bucket, to be a _very_ common
> place occurrence for modern web development and perhaps (depending on
> requirements) _the_ most common function aside from retrieving a single k/v
> pair.

I tend to see people that mostly try to write applications that don't
select everything from a whole bucket/table/whatever as a very
frequent occurrence, but different people have different requirements.
 Certainly, it is sometimes unavoidable.

> In my mind, this seems to leave the only advantage to buckets in this
> application to be namespacing... While certainly important, I'm fuzzy on
> what the downside would be to allowing buckets to exist as a separate
> partition/pseudo-table/etc... so that retrieving all objects in a bucket
> would not need to read all objects in the entire system

The namespacing aspect is a huge advantage for many people.  Besides
the obvious way in which that allows people to avoid collisions, it is
a powerful tool for data modeling.  For example, sets of 1-to-1
relationships can be very nicely represented as something like
"bucket1/keyA, bucket2/keyA, bucket3/keyA", which allows related items
to be fetched without any intermediate queries at all.

One of the things that many users have become happily used to is that
buckets in Riak are generally "free"; they come into existence on
demand, and you can use as many of them as you want in the above or
any other fashion.  This is in essence what conflicts with your
desire.  Making buckets more fundamentally isolated from each other
would be difficult without incurring some incremental cost per bucket.

> I might recommend a hybrid
> solution (based in my limited knowledge of Riak)... What about allowing a
> bucket property named something like "key_index" that points to a key
> containing a value of "keys in bucket".  Then, when calling GET
> /riak/bucket, Riak would use the key_index to immediately reduce its result
> set before applying m/r funcs.  While I understand this is essentially what
> a developer would do, it would certainly alleviate some code requirements
> (application side) as well as make the behavior of retrieving a bucket's
> contents more "expected" and efficient.

A much earlier incarnation of Riak actually stored bucket keylists
explicitly in a fashion somewhat like what you describe.  We removed
this as one of our biggest goals is predictable and understandable
behavior in a distributed systems sense, and a model like this one
turns each write operation into at least two operations.  This isn't
just a performance issue, but also adds complexity.  For instance, it
is not immediately obvious what should be returned to the client if a
data item write succeeds, but the read/write of the index fails?

Most people using distributed data systems (including but not limited
to Riak) do explicit data modeling, using things like key identity as
above, or objects that contain links to each other (Riak has great
support for this) or other data modeling means to plan out their
expected queries in advance.

> Anyway, information is pretty limited on riak right now, seeing as how it's
> so new, but talk in my development circles is very positive and lively.

Please do let us know any aspects of information on Riak that you
think are missing.  We think that between the wiki, the web site, and
various other materials, the information is pretty good.  Riak's been
open source for about a year, and in use longer than that; while there
are many things much older than Riak, we don't see relative youth as a
reason not to do things right.

Thanks again for your thoughts, and I hope that this helps with your
understanding.

-Justin




More information about the riak-users mailing list