combining Riak (CS) and Spark/shark by speaking over s3 protocol

gbrits gbrits at
Wed Jul 31 09:46:03 EDT 2013

I appreciate the clarification. Objectsize for the dataset I'm currently
investigating is 8KB (on the bit exactly) so that would be loads of
overhead when going the Riak CS route.
Upfront I already figured Riak directly was a more efficient way to go, but
getting a nice Riak + Spark/Shark integration going (through S3) is worth a
lot to me as well. Some experimenting to do I guess :)


2013/7/31 Dan Kerrigan [via Riak Users] <
ml-node+s197444n4028644h67 at>

> Geert-Jan
> Riak CS currently doesn't support the S3 Copy command.  Flume and Hadoop
> distcp create a temporary object and then attempts to Copy that object to
> it's permanent location.  Rename is a Copy then a Delete since the S3 API
> doesn't support Rename.
> Regarding efficiency, Riak CS block sizes are 1MB (100 MB object, 100 Riak
> Bitcask stored objects) so you can use the Bitcask calculator at [0] to get
> a rough estimate requirements to store your particular dataset. Regarding
> the impact to read latency, severe is probably not the right word but there
> is an impact.  Besides API support, your decision will, in part, come down
> to how large your object sizes are going to be.  The Riak FAQ [1] currently
> suggests that Riak Object sizes should be less than 10MB.  Riak CS on the
> other hand can handle object sizes up to 5TB.  If you are doing multi-key
> retrieves for lots of small objects, Riak looks like the right choice
> otherwise, go with Riak CS.  Some basic testing would go a long way to find
> the balance in your case.
> Respectfully -
> Dan Kerrigan
> [0]
> [1]
> On Wed, Jul 31, 2013 at 4:43 AM, Geert-Jan Brits <[hidden email]<http://user/SendEmail.jtp?type=node&node=4028644&i=0>
> > wrote:
>> Dan,
>> Not sure if I understand the "renaming objects"-problem in Riak CS. Can
>> you elaborate?
>>  "I believe smaller object sizes would not be nearly as efficient as
>> working with plain Riak if only because of the overhead incurred by Riak
>> CS". Does this mean lack of efficiency in disk storage, in-mem or both?
>> Moreover I'm having this nagging thought that having to dig through the
>> manifest to find the blocks will severely impact read latency for
>> (multi-key) lookups as opposed to the normal bitcask / levelDB lookup.  Is
>> this correct?
>> Best,
>> Geert-Jan
>> 2013/7/30 Dan Kerrigan <[hidden email]<http://user/SendEmail.jtp?type=node&node=4028644&i=1>
>> >
>>> Geert-Jan -
>>> We're currently working on a somewhat similar project to integrate Flume
>>> to ingest data into Riak CS for later processing using Hadoop.  The
>>> limitations of HDFS/S3, when using the s3:// or s3n:// URIs, seem to
>>> revolve around renaming objects (copy/delete) in Riak CS.  If you can avoid
>>> that, this link should work fine.
>>> Regarding how data is stored in Riak CS, the data block storage is
>>> Bitcask with manifest storage being held in LevelDB.  Riak CS is optimized
>>> for larger object sizes and I believe smaller object sizes would not be
>>> nearly as efficient as working with plain Riak if only because of the
>>> overhead incurred by Riak CS. The benefits of Riak generally carry over to
>>> Riak CS so there shouldn't be any need to worry about losing raw power.
>>> Respectfully -
>>> Dan Kerrigan
>>> On Tue, Jul 30, 2013 at 2:21 PM, gbrits <[hidden email]<http://user/SendEmail.jtp?type=node&node=4028644&i=2>
>>> > wrote:
>>>> This may be totally missing the mark but I've been reading up on ways
>>>> to do
>>>> fast iterative processing in Storm or Spark/shark, with the ultimate
>>>> goal of
>>>> results ending up in Riak for fast multi-key retrieval.
>>>> I want this setup to be as lean as possible for obvious reasons so I've
>>>> started to look more closely at the possible Riak CS / Spark combo.
>>>> Apparently, please correct if wrong, Riak CS sits on top of Riak and is
>>>> S3-api compliant. Underlying the db for the objects is levelDB (which
>>>> would
>>>> have been my choice anyway, bc of the low in-mem key overhead)
>>>> Apparently
>>>> Bitcask is also used, although it's not clear to me what for exactly.
>>>> At the same time Spark (with Shark on top, which is what Hive is for
>>>> Hadoop
>>>> if that in any way makes things clearer) can use HDFS or S3 as it's so
>>>> called 'deep store'.
>>>> Combining this it seems, Riak CS and Spark/Shark could be a nice pretty
>>>> tight combo providing interative and adhoc quering through Shark + all
>>>> the
>>>> excellent stuff of Riak through the S3 protocol which they both speak .
>>>> Is this correct?
>>>> Would I loose any of the raw power of Riak when going with Riak CS?
>>>> Anyone
>>>> ever tried this combo?
>>>> Thanks,
>>>> Geert-Jan
>>>> --
>>>> View this message in context:
>>>> Sent from the Riak Users mailing list archive at
>>>> _______________________________________________
>>>> riak-users mailing list
>>>> [hidden email] <http://user/SendEmail.jtp?type=node&node=4028644&i=3>
> _______________________________________________
> riak-users mailing list
> [hidden email] <http://user/SendEmail.jtp?type=node&node=4028644&i=4>
> ------------------------------
>  If you reply to this email, your message will be added to the discussion
> below:
>  To unsubscribe from combining Riak (CS) and Spark/shark by speaking over
> s3 protocol, click here<>
> .
> NAML<>

View this message in context:
Sent from the Riak Users mailing list archive at
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

More information about the riak-users mailing list