performance impact of many buckets?

Bryan Fink bryan at basho.com
Thu Oct 8 09:51:51 EDT 2009


On Wed, Oct 7, 2009 at 11:18 PM, Brian Hammond <brian at brianhammond.com> wrote:
> I'm considering writing a Twitter clone (the post-modern "Hello World" of
> nosql) as a means to learn the ins and outs of Riak.

Excellent choice.  I did the same with a very early version of Riak,
and I definitely learned some things in the process.

> Here's some off-the-top-of-my-head ideas on how to design something like (an
> incomplete) Twitter.  Please comment on what is and isn't a good idea
> design-wise due to performance implications or feature usage or lack
> thereof.
>
> Alright, let's just start with Users.
>
> Either:
> 1) each user is a bucket (/jiak/brian); or
> 2) each user is a document in the 'users' bucket (/jiak/users/brian).
>
> Thoughts?  Any implications of having a "large" number of buckets, or having
> a "large" number of documents/keys in a single bucket?  Any general design
> guidelines here?

I'd suggest making each user a document in the 'users' bucket.  Riak
can support "large" numbers of buckets, in certain situations
(native-erlang client, default bucket parameters), but when using the
HTTP interface ("Jiak"), using many buckets will cause the cluster's
ringstate to grow.  A very large ringstate isn't necessarily a
problem, but could have performance implications.  We have plans to
move the bucket metadata out of the cluster's ringstate at some point,
but we haven't done it yet.

There is absolutely no problem in Riak with a "large" number of
documents/keys in a single bucket.

> Following and Followers.  User A follows B, C and is followed by D, E.  I
> suppose this could be links in the user's document.  Perhaps the links would
> be to the other user's documents and the link tag per link either
> 'following' or 'follower'.  Thoughts?

Sounds fantastic to me.  Excellent use of link tags, imho.

> Tweets.  Either links in the user's document with link tag 'tweet' or
> perhaps stored in the user's document directly.  Thoughts?

I think this comes down to the classic fight between normalized and
denormalized data.  Depending on your common access patterns, one or
the other may be "better".  I might even recommend a split solution,
where each tweet is stored in its own document, but also a copy of a
user's most recent tweets are stored inside that user's document.

> Feel free to extend this very trivial model if you feel it would better
> explain certain things about treating Riak well.

You might also consider using the timestamp of the tweet as the tag of
its link.  This way, you could sort tweet links in chronological
order, and choose time-ranges of them before actually requesting the
objects from Riak.

-Bryan



More information about the riak-users mailing list