Lots of sparse columns. Efficient like Cassandra? Some measures of my dataset

gbrits gbrits at gmail.com
Wed Jul 17 10:03:10 EDT 2013


Each key-column value actually already is a rollup of a sparse matrix
(which is why the uncompressed key-column values are always exactly the
same length when they exist)
Having just watched that great talk (thanks) it's extremely similar to how
the guys at Boundary are rolling up their data. Validates the approach
which is awesome!

Having just learned from that same talk that when using LevelDB keys don't
have to remain in mem, I'm just going with the logical <keyIndex, colIndex>
as my new aggregated keys, each having a rolledup sparse matrix as a value.
Hope that made any sense.

Anyway, this feels great!


2013/7/17 Sean Cribbs-2 [via Riak Users] <
ml-node+s197444n4028370h73 at n3.nabble.com>

> Just to add to Jeremiah's comments, I think you should consider whether
> you will be mostly retrieving:
>
> 1) all 1000 columns
> 2) some subset of columns
> 3) single columns
>
> That will greatly influence how you design your keyspace. Remember, with
> Riak it's just key-value in the end. This is one of my favorite examples of
> building a column-like system on top of pure key-value, Boundary's
> "Kobayashi" system: https://vimeo.com/42902962
>
>
> On Wed, Jul 17, 2013 at 7:25 AM, Jeremiah Peschka <[hidden email]<http://user/SendEmail.jtp?type=node&node=4028370&i=0>
> > wrote:
>
>>
>>
>> --
>> Jeremiah Peschka - Founder, Brent Ozar Unlimited
>> MCITP: SQL Server 2008, MVP
>> Cloudera Certified Developer for Apache Hadoop
>>
>> On Jul 17, 2013, at 4:38 AM, gbrits <[hidden email]<http://user/SendEmail.jtp?type=node&node=4028370&i=1>>
>> wrote:
>>
>> > Somewhere (can't find it now) I've read that Riak, like Cassandra could
>> be
>> > classified as a column store.
>>
>> That is incorrect. Riak is a key value database where the value is an
>> opaque blob.
>>
>> >
>> > This is just a name of course but what I understand from Cassandra is
>> that
>> > this allows for space-efficient encoding of column-values. Basically
>> storage
>> > is surrounded around columns instead of rows, allowing for different
>> > persistence strategies on a per-column, or column-family, basis.
>> Moreover,
>> > it would allow for zero storage overhead for non-existent column values.
>> > I.e: basically allowing for efficient storage of sparse data-sets.
>> >
>> > Does Riak have this property as well?
>>
>> No. Riak will happily store whatever you throw at it. That being said,
>> most good serialization libraries will leave off nullable properties.
>>
>> >
>>
>

> > More specifically, I've got a datastructure on paper with the following
>> > properties, when mapped to riak nomenclature:
>> >
>> > - ~ 1.000.000 keys (will not grow)
>> > - ~ 1.000 columns.  (may grow)
>> > - 1 particular key has a median of ~50 columns. In other words the
>> entire
>> > set is ~ 95% sparse.
>> > - Wherever a key has a value for a particular column, that value is
>> always
>> > exactly a String (base 255) of 4KB length.
>> > - the 4KB values themselves are pretty 'sparse' so would benefit a lot
>> from
>> > run-length encoding. Is this supported out of the box?
>>
>> See above.
>>
>> >
>> > Given these properties how would Riak hold up? Hard to say of course,
>> but
>> > I'm looking for some general advice.
>>
>> Riak objects should be no more than ~10MB for performance reasons. You
>> should be safe.
>>
>> >
>> > Thanks.
>> >
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> http://riak-users.197444.n3.nabble.com/Lots-of-sparse-columns-Efficient-like-Cassandra-Some-measures-of-my-dataset-tp4028367.html
>> > Sent from the Riak Users mailing list archive at Nabble.com.
>> >
>> > _______________________________________________
>> > riak-users mailing list
>> > [hidden email] <http://user/SendEmail.jtp?type=node&node=4028370&i=2>
>>
>> > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>
>> _______________________________________________
>> riak-users mailing list
>> [hidden email] <http://user/SendEmail.jtp?type=node&node=4028370&i=3>
>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>
>
>
>
> --
> Sean Cribbs <[hidden email]<http://user/SendEmail.jtp?type=node&node=4028370&i=4>
> >
> Software Engineer
> Basho Technologies, Inc.
> http://basho.com/
>
> _______________________________________________
> riak-users mailing list
> [hidden email] <http://user/SendEmail.jtp?type=node&node=4028370&i=5>
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
>
> ------------------------------
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://riak-users.197444.n3.nabble.com/Lots-of-sparse-columns-Efficient-like-Cassandra-Some-measures-of-my-dataset-tp4028367p4028370.html
>  To unsubscribe from Lots of sparse columns. Efficient like Cassandra?
> Some measures of my dataset, click here<http://riak-users.197444.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=4028367&code=Z2JyaXRzQGdtYWlsLmNvbXw0MDI4MzY3fDExNjk3MTIyNTA=>
> .
> NAML<http://riak-users.197444.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




--
View this message in context: http://riak-users.197444.n3.nabble.com/Lots-of-sparse-columns-Efficient-like-Cassandra-Some-measures-of-my-dataset-tp4028367p4028373.html
Sent from the Riak Users mailing list archive at Nabble.com.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20130717/202e01d5/attachment.html>


More information about the riak-users mailing list