combining Riak (CS) and Spark/shark by speaking over s3 protocol

Mark Hamstra markhamstra at
Tue Jul 30 18:45:31 EDT 2013

Others have certainly found benefits in combining Spark/Shark with a
Dynamo-type KV-store.  With robust Hadoop Input/OutputFormats it's not too
difficult (e.g. see
this <>), and It may be possible to do
as you suggest with the s3 API of Riak CS.  What also may be worth
exploring is if Riak and Spark/Shark can rendezvous via
 That would be more of a research project right now, but it could end up
someplace interesting.

On Tue, Jul 30, 2013 at 1:24 PM, Dan Kerrigan <dan.kerrigan at>wrote:

> Geert-Jan -
> We're currently working on a somewhat similar project to integrate Flume
> to ingest data into Riak CS for later processing using Hadoop.  The
> limitations of HDFS/S3, when using the s3:// or s3n:// URIs, seem to
> revolve around renaming objects (copy/delete) in Riak CS.  If you can avoid
> that, this link should work fine.
> Regarding how data is stored in Riak CS, the data block storage is Bitcask
> with manifest storage being held in LevelDB.  Riak CS is optimized for
> larger object sizes and I believe smaller object sizes would not be nearly
> as efficient as working with plain Riak if only because of the overhead
> incurred by Riak CS. The benefits of Riak generally carry over to Riak CS
> so there shouldn't be any need to worry about losing raw power.
> Respectfully -
> Dan Kerrigan
> On Tue, Jul 30, 2013 at 2:21 PM, gbrits <gbrits at> wrote:
>> This may be totally missing the mark but I've been reading up on ways to
>> do
>> fast iterative processing in Storm or Spark/shark, with the ultimate goal
>> of
>> results ending up in Riak for fast multi-key retrieval.
>> I want this setup to be as lean as possible for obvious reasons so I've
>> started to look more closely at the possible Riak CS / Spark combo.
>> Apparently, please correct if wrong, Riak CS sits on top of Riak and is
>> S3-api compliant. Underlying the db for the objects is levelDB (which
>> would
>> have been my choice anyway, bc of the low in-mem key overhead) Apparently
>> Bitcask is also used, although it's not clear to me what for exactly.
>> At the same time Spark (with Shark on top, which is what Hive is for
>> Hadoop
>> if that in any way makes things clearer) can use HDFS or S3 as it's so
>> called 'deep store'.
>> Combining this it seems, Riak CS and Spark/Shark could be a nice pretty
>> tight combo providing interative and adhoc quering through Shark + all the
>> excellent stuff of Riak through the S3 protocol which they both speak .
>> Is this correct?
>> Would I loose any of the raw power of Riak when going with Riak CS? Anyone
>> ever tried this combo?
>> Thanks,
>> Geert-Jan
>> --
>> View this message in context:
>> Sent from the Riak Users mailing list archive at
>> _______________________________________________
>> riak-users mailing list
>> riak-users at
> _______________________________________________
> riak-users mailing list
> riak-users at
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

More information about the riak-users mailing list