Lots of sparse columns. Efficient like Cassandra? Some measures of my dataset

Jeremiah Peschka jeremiah.peschka at gmail.com
Wed Jul 17 08:25:06 EDT 2013



--
Jeremiah Peschka - Founder, Brent Ozar Unlimited
MCITP: SQL Server 2008, MVP
Cloudera Certified Developer for Apache Hadoop

On Jul 17, 2013, at 4:38 AM, gbrits <gbrits at gmail.com> wrote:

> Somewhere (can't find it now) I've read that Riak, like Cassandra could be
> classified as a column store. 

That is incorrect. Riak is a key value database where the value is an opaque blob.

> 
> This is just a name of course but what I understand from Cassandra is that
> this allows for space-efficient encoding of column-values. Basically storage
> is surrounded around columns instead of rows, allowing for different
> persistence strategies on a per-column, or column-family, basis. Moreover,
> it would allow for zero storage overhead for non-existent column values.
> I.e: basically allowing for efficient storage of sparse data-sets.
> 
> Does Riak have this property as well?

No. Riak will happily store whatever you throw at it. That being said, most good serialization libraries will leave off nullable properties.

> 
> More specifically, I've got a datastructure on paper with the following
> properties, when mapped to riak nomenclature:
> 
> - ~ 1.000.000 keys (will not grow)
> - ~ 1.000 columns.  (may grow)
> - 1 particular key has a median of ~50 columns. In other words the entire
> set is ~ 95% sparse.
> - Wherever a key has a value for a particular column, that value is always
> exactly a String (base 255) of 4KB length.
> - the 4KB values themselves are pretty 'sparse' so would benefit a lot from
> run-length encoding. Is this supported out of the box?

See above.

> 
> Given these properties how would Riak hold up? Hard to say of course, but
> I'm looking for some general advice. 

Riak objects should be no more than ~10MB for performance reasons. You should be safe. 

> 
> Thanks. 
> 
> 
> 
> 
> --
> View this message in context: http://riak-users.197444.n3.nabble.com/Lots-of-sparse-columns-Efficient-like-Cassandra-Some-measures-of-my-dataset-tp4028367.html
> Sent from the Riak Users mailing list archive at Nabble.com.
> 
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com




More information about the riak-users mailing list