Buckets versus Documents - Limits

Rusty Klophaus rusty at basho.com
Fri Feb 5 10:26:40 EST 2010


Hi Jason,

Just got word that some backends (currently only the Innostore-based
backend, but perhaps more in the future) keep a separate file open per
bucket. So add that as the third and biggest reason to model your data using
a reasonably low number of buckets, depending on the backend that you
choose.

Best,
Rusty

On Thu, Feb 4, 2010 at 11:21 AM, Rusty Klophaus <rusty at basho.com> wrote:

> Hi Jason,
>
> Great questions.
>
> For the use case you described, I would recommend having two buckets, one
> for artists and one for albums, with a link from artist to album, and
> possibly a link back. In most scenarios you should consider a bucket like a
> table.
>
> There are two main reasons:
>
> First, it conforms to the design expected by map/reduce and linkwalking.
> You can run a map/reduce across all keys in a bucket to operate on all
> artists or all albums. This is much harder to do if each artist is in a
> separate bucket. And with linkwalking, you can select the links to visit
> based on the bucket name. This is only useful when you have a well-known
> bucket name that is consistent across the data to which you are linking.
>
> Second, you can customize a bucket by setting bucket parameters to
> configure things like:
>
> - How many replicas to store (n_val)
> - Whether to propagate conflicting edits through to the client (allow_mult)
> - What link function to use (linkfun)
> - etc.
>
> Generally, you want these customizations to apply to all data of the same
> type. Plus, it's easier to manage these customizations on a smaller number
> of buckets.
>
> That said, the number of buckets is limited only by physical resources
> ***unless*** you customize the bucket. If you leave the default bucket
> settings in place, then a bucket takes no additional overhead, allowing you
> to create millions of buckets. If you customize the bucket, then the
> bucket's properties are stored in the ringstate (as you noted) so it's a bad
> idea to have a large number of buckets with non-standard configuration.
> (Note that it is possible to override the default bucket configuration by
> setting 'default_bucket_props' in the app.config file.)
>
> Hope that helps.
>
> Best,
> Rusty
>
> On Thu, Feb 4, 2010 at 10:31 AM, Jason Tanner <jt4websites at googlemail.com>wrote:
>
>> Hi,
>>
>> Lets say I had 100 million albums generated by 5 million artists.
>>
>> This could be modelled in riak in a number of ways.
>>
>> For example, having 2 buckets, one for albums, one for artists and linking
>> documents in the two buckets.
>>
>> Alternatively, I could have a bucket per artist containing the albums they
>> created.
>>
>> Obviously there are other ways to model this as well.
>>
>> My point, is to try and identify the limitations in Riak with regards to
>> its design choices so that I in turn can design my stuff with that in mind.
>>
>> Are there any penalties to consider when having large numbers of buckets
>> compared to documents in the buckets?
>>
>> I read somewhere about bucket information being kept in the ringstate, and
>> although I didn't fully understand the implications of that I kind of
>> guessed it meant that perhaps having huge numbers of buckets was not a good
>> idea.
>>
>> Is this true ? Is there a point at which having a lot of buckets would
>> actually penalise you in terms of performance ?
>>
>> Jason
>>
>> _______________________________________________
>> riak-users mailing list
>> riak-users at lists.basho.com
>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20100205/2919ee91/attachment.html>


More information about the riak-users mailing list