Millions of buckets?

pablochacin pablochacin at
Tue Aug 9 08:27:13 EDT 2011


I'm resurrecting this topic because a  have a similar request of storing in
riak streams of items coming from sources with unique ids. In your post you

Ryan Kennedy wrote:
> At Yammer we have a notion of streams (notifications is one of our
> streams). Each stream has a list of stream items. For instance, "Bob
> liked your message" or "Jenny replied to your message" or "Charlie
> mentioned you in a thread". Each stream item has a uniquely generated,
> monotonically increasing ID. That's great, that gives us something to
> sort and dedupe on. We store the stream items for a user in a single
> key/value. Each stream type has it's own bucket. To get to my
> notifications, I would fetch /riak/notifications/ryan. To keep things
> simple (and bounded) we only store the most recent 1,000 or so stream
> items for each user. Older notifications age out of the system as
> newer ones replace them. That's fine…for nearly all of our users 1,000
> notifications would represent a significant amount of calendar time.
> More than they could be expected to page back through.

What confuses me is the phrase "We store the stream items for a user in a
single key/value". Does this mean all the items are put together under a
single key? if so, when a new item arrives, you need to read the key, update
and re-write. Doesn't this affect performance

In my case, I need to maintain a very hight write throughput, so I would
prefer not to update. Would it be efficient storing/retrieving items under a
bucket on the form /source/period and use the timestamp as key, where the
period may be configurable for the application and will probably be in the
order of minutes. In that way all items from a source will in the same
bucket. However, this will lead to millions of buckets very quickly.

Another option would be to batch the items (which are very short) and store
them as an object under /source/period as has been discussed in this thread:
High volume data series storage and queries 

Thanks in advance

View this message in context:
Sent from the Riak Users mailing list archive at

More information about the riak-users mailing list