combining Riak (CS) and Spark/shark by speaking over s3 protocol

Geert-Jan Brits gbrits at
Wed Jul 31 04:43:20 EDT 2013


Not sure if I understand the "renaming objects"-problem in Riak CS. Can you

 "I believe smaller object sizes would not be nearly as efficient as
working with plain Riak if only because of the overhead incurred by Riak
CS". Does this mean lack of efficiency in disk storage, in-mem or both?
Moreover I'm having this nagging thought that having to dig through the
manifest to find the blocks will severely impact read latency for
(multi-key) lookups as opposed to the normal bitcask / levelDB lookup.  Is
this correct?


2013/7/30 Dan Kerrigan <dan.kerrigan at>

> Geert-Jan -
> We're currently working on a somewhat similar project to integrate Flume
> to ingest data into Riak CS for later processing using Hadoop.  The
> limitations of HDFS/S3, when using the s3:// or s3n:// URIs, seem to
> revolve around renaming objects (copy/delete) in Riak CS.  If you can avoid
> that, this link should work fine.
> Regarding how data is stored in Riak CS, the data block storage is Bitcask
> with manifest storage being held in LevelDB.  Riak CS is optimized for
> larger object sizes and I believe smaller object sizes would not be nearly
> as efficient as working with plain Riak if only because of the overhead
> incurred by Riak CS. The benefits of Riak generally carry over to Riak CS
> so there shouldn't be any need to worry about losing raw power.
> Respectfully -
> Dan Kerrigan
> On Tue, Jul 30, 2013 at 2:21 PM, gbrits <gbrits at> wrote:
>> This may be totally missing the mark but I've been reading up on ways to
>> do
>> fast iterative processing in Storm or Spark/shark, with the ultimate goal
>> of
>> results ending up in Riak for fast multi-key retrieval.
>> I want this setup to be as lean as possible for obvious reasons so I've
>> started to look more closely at the possible Riak CS / Spark combo.
>> Apparently, please correct if wrong, Riak CS sits on top of Riak and is
>> S3-api compliant. Underlying the db for the objects is levelDB (which
>> would
>> have been my choice anyway, bc of the low in-mem key overhead) Apparently
>> Bitcask is also used, although it's not clear to me what for exactly.
>> At the same time Spark (with Shark on top, which is what Hive is for
>> Hadoop
>> if that in any way makes things clearer) can use HDFS or S3 as it's so
>> called 'deep store'.
>> Combining this it seems, Riak CS and Spark/Shark could be a nice pretty
>> tight combo providing interative and adhoc quering through Shark + all the
>> excellent stuff of Riak through the S3 protocol which they both speak .
>> Is this correct?
>> Would I loose any of the raw power of Riak when going with Riak CS? Anyone
>> ever tried this combo?
>> Thanks,
>> Geert-Jan
>> --
>> View this message in context:
>> Sent from the Riak Users mailing list archive at
>> _______________________________________________
>> riak-users mailing list
>> riak-users at
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

More information about the riak-users mailing list