combining Riak (CS) and Spark/shark by speaking over s3 protocol

Mark Hamstra markhamstra at gmail.com
Tue Jul 30 18:45:31 EDT 2013


Others have certainly found benefits in combining Spark/Shark with a
Dynamo-type KV-store.  With robust Hadoop Input/OutputFormats it's not too
difficult (e.g. see
this<http://www.slideshare.net/EvanChan2/cassandra2013-spark-talk-final>and
this <http://tuplejump.github.io/calliope/>), and It may be possible to do
as you suggest with the s3 API of Riak CS.  What also may be worth
exploring is if Riak and Spark/Shark can rendezvous via
Tachyon<https://github.com/amplab/tachyon/wiki>.
 That would be more of a research project right now, but it could end up
someplace interesting.


On Tue, Jul 30, 2013 at 1:24 PM, Dan Kerrigan <dan.kerrigan at gmail.com>wrote:

> Geert-Jan -
>
> We're currently working on a somewhat similar project to integrate Flume
> to ingest data into Riak CS for later processing using Hadoop.  The
> limitations of HDFS/S3, when using the s3:// or s3n:// URIs, seem to
> revolve around renaming objects (copy/delete) in Riak CS.  If you can avoid
> that, this link should work fine.
>
> Regarding how data is stored in Riak CS, the data block storage is Bitcask
> with manifest storage being held in LevelDB.  Riak CS is optimized for
> larger object sizes and I believe smaller object sizes would not be nearly
> as efficient as working with plain Riak if only because of the overhead
> incurred by Riak CS. The benefits of Riak generally carry over to Riak CS
> so there shouldn't be any need to worry about losing raw power.
>
> Respectfully -
> Dan Kerrigan
>
>
> On Tue, Jul 30, 2013 at 2:21 PM, gbrits <gbrits at gmail.com> wrote:
>
>> This may be totally missing the mark but I've been reading up on ways to
>> do
>> fast iterative processing in Storm or Spark/shark, with the ultimate goal
>> of
>> results ending up in Riak for fast multi-key retrieval.
>>
>> I want this setup to be as lean as possible for obvious reasons so I've
>> started to look more closely at the possible Riak CS / Spark combo.
>>
>> Apparently, please correct if wrong, Riak CS sits on top of Riak and is
>> S3-api compliant. Underlying the db for the objects is levelDB (which
>> would
>> have been my choice anyway, bc of the low in-mem key overhead) Apparently
>> Bitcask is also used, although it's not clear to me what for exactly.
>>
>> At the same time Spark (with Shark on top, which is what Hive is for
>> Hadoop
>> if that in any way makes things clearer) can use HDFS or S3 as it's so
>> called 'deep store'.
>>
>> Combining this it seems, Riak CS and Spark/Shark could be a nice pretty
>> tight combo providing interative and adhoc quering through Shark + all the
>> excellent stuff of Riak through the S3 protocol which they both speak .
>>
>> Is this correct?
>> Would I loose any of the raw power of Riak when going with Riak CS? Anyone
>> ever tried this combo?
>>
>> Thanks,
>> Geert-Jan
>>
>>
>>
>> --
>> View this message in context:
>> http://riak-users.197444.n3.nabble.com/combining-Riak-CS-and-Spark-shark-by-speaking-over-s3-protocol-tp4028621.html
>> Sent from the Riak Users mailing list archive at Nabble.com.
>>
>> _______________________________________________
>> riak-users mailing list
>> riak-users at lists.basho.com
>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>
>
>
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20130730/eb4be2f4/attachment.html>


More information about the riak-users mailing list