Lots of sparse columns. Efficient like Cassandra? Some measures of my dataset

Sean Cribbs sean at basho.com
Wed Jul 17 08:53:57 EDT 2013


Just to add to Jeremiah's comments, I think you should consider whether you
will be mostly retrieving:

1) all 1000 columns
2) some subset of columns
3) single columns

That will greatly influence how you design your keyspace. Remember, with
Riak it's just key-value in the end. This is one of my favorite examples of
building a column-like system on top of pure key-value, Boundary's
"Kobayashi" system: https://vimeo.com/42902962


On Wed, Jul 17, 2013 at 7:25 AM, Jeremiah Peschka <
jeremiah.peschka at gmail.com> wrote:

>
>
> --
> Jeremiah Peschka - Founder, Brent Ozar Unlimited
> MCITP: SQL Server 2008, MVP
> Cloudera Certified Developer for Apache Hadoop
>
> On Jul 17, 2013, at 4:38 AM, gbrits <gbrits at gmail.com> wrote:
>
> > Somewhere (can't find it now) I've read that Riak, like Cassandra could
> be
> > classified as a column store.
>
> That is incorrect. Riak is a key value database where the value is an
> opaque blob.
>
> >
> > This is just a name of course but what I understand from Cassandra is
> that
> > this allows for space-efficient encoding of column-values. Basically
> storage
> > is surrounded around columns instead of rows, allowing for different
> > persistence strategies on a per-column, or column-family, basis.
> Moreover,
> > it would allow for zero storage overhead for non-existent column values.
> > I.e: basically allowing for efficient storage of sparse data-sets.
> >
> > Does Riak have this property as well?
>
> No. Riak will happily store whatever you throw at it. That being said,
> most good serialization libraries will leave off nullable properties.
>
> >
> > More specifically, I've got a datastructure on paper with the following
> > properties, when mapped to riak nomenclature:
> >
> > - ~ 1.000.000 keys (will not grow)
> > - ~ 1.000 columns.  (may grow)
> > - 1 particular key has a median of ~50 columns. In other words the entire
> > set is ~ 95% sparse.
> > - Wherever a key has a value for a particular column, that value is
> always
> > exactly a String (base 255) of 4KB length.
> > - the 4KB values themselves are pretty 'sparse' so would benefit a lot
> from
> > run-length encoding. Is this supported out of the box?
>
> See above.
>
> >
> > Given these properties how would Riak hold up? Hard to say of course, but
> > I'm looking for some general advice.
>
> Riak objects should be no more than ~10MB for performance reasons. You
> should be safe.
>
> >
> > Thanks.
> >
> >
> >
> >
> > --
> > View this message in context:
> http://riak-users.197444.n3.nabble.com/Lots-of-sparse-columns-Efficient-like-Cassandra-Some-measures-of-my-dataset-tp4028367.html
> > Sent from the Riak Users mailing list archive at Nabble.com.
> >
> > _______________________________________________
> > riak-users mailing list
> > riak-users at lists.basho.com
> > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>



-- 
Sean Cribbs <sean at basho.com>
Software Engineer
Basho Technologies, Inc.
http://basho.com/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20130717/f1e747e8/attachment.html>


More information about the riak-users mailing list