combining Riak (CS) and Spark/shark by speaking over s3 protocol

gbrits gbrits at gmail.com
Wed Jul 31 09:46:03 EDT 2013


I appreciate the clarification. Objectsize for the dataset I'm currently
investigating is 8KB (on the bit exactly) so that would be loads of
overhead when going the Riak CS route.
Upfront I already figured Riak directly was a more efficient way to go, but
getting a nice Riak + Spark/Shark integration going (through S3) is worth a
lot to me as well. Some experimenting to do I guess :)

Thanks,
Geert-Jan


2013/7/31 Dan Kerrigan [via Riak Users] <
ml-node+s197444n4028644h67 at n3.nabble.com>

> Geert-Jan
>
> Riak CS currently doesn't support the S3 Copy command.  Flume and Hadoop
> distcp create a temporary object and then attempts to Copy that object to
> it's permanent location.  Rename is a Copy then a Delete since the S3 API
> doesn't support Rename.
>
> Regarding efficiency, Riak CS block sizes are 1MB (100 MB object, 100 Riak
> Bitcask stored objects) so you can use the Bitcask calculator at [0] to get
> a rough estimate requirements to store your particular dataset. Regarding
> the impact to read latency, severe is probably not the right word but there
> is an impact.  Besides API support, your decision will, in part, come down
> to how large your object sizes are going to be.  The Riak FAQ [1] currently
> suggests that Riak Object sizes should be less than 10MB.  Riak CS on the
> other hand can handle object sizes up to 5TB.  If you are doing multi-key
> retrieves for lots of small objects, Riak looks like the right choice
> otherwise, go with Riak CS.  Some basic testing would go a long way to find
> the balance in your case.
>
> Respectfully -
> Dan Kerrigan
>
> [0] http://docs.basho.com/riak/latest/ops/building/planning/bitcask/
> [1]
> http://docs.basho.com/riak/latest/community/faqs/developing/#is-there-a-limit-on-the-file-size-that-can-be-stor
>
>
>
> On Wed, Jul 31, 2013 at 4:43 AM, Geert-Jan Brits <[hidden email]<http://user/SendEmail.jtp?type=node&node=4028644&i=0>
> > wrote:
>
>> Dan,
>>
>> Not sure if I understand the "renaming objects"-problem in Riak CS. Can
>> you elaborate?
>>
>>  "I believe smaller object sizes would not be nearly as efficient as
>> working with plain Riak if only because of the overhead incurred by Riak
>> CS". Does this mean lack of efficiency in disk storage, in-mem or both?
>> Moreover I'm having this nagging thought that having to dig through the
>> manifest to find the blocks will severely impact read latency for
>> (multi-key) lookups as opposed to the normal bitcask / levelDB lookup.  Is
>> this correct?
>>
>> Best,
>> Geert-Jan
>>
>>
>> 2013/7/30 Dan Kerrigan <[hidden email]<http://user/SendEmail.jtp?type=node&node=4028644&i=1>
>> >
>>
>>> Geert-Jan -
>>>
>>> We're currently working on a somewhat similar project to integrate Flume
>>> to ingest data into Riak CS for later processing using Hadoop.  The
>>> limitations of HDFS/S3, when using the s3:// or s3n:// URIs, seem to
>>> revolve around renaming objects (copy/delete) in Riak CS.  If you can avoid
>>> that, this link should work fine.
>>>
>>> Regarding how data is stored in Riak CS, the data block storage is
>>> Bitcask with manifest storage being held in LevelDB.  Riak CS is optimized
>>> for larger object sizes and I believe smaller object sizes would not be
>>> nearly as efficient as working with plain Riak if only because of the
>>> overhead incurred by Riak CS. The benefits of Riak generally carry over to
>>> Riak CS so there shouldn't be any need to worry about losing raw power.
>>>
>>> Respectfully -
>>> Dan Kerrigan
>>>
>>>
>>> On Tue, Jul 30, 2013 at 2:21 PM, gbrits <[hidden email]<http://user/SendEmail.jtp?type=node&node=4028644&i=2>
>>> > wrote:
>>>
>>>> This may be totally missing the mark but I've been reading up on ways
>>>> to do
>>>> fast iterative processing in Storm or Spark/shark, with the ultimate
>>>> goal of
>>>> results ending up in Riak for fast multi-key retrieval.
>>>>
>>>> I want this setup to be as lean as possible for obvious reasons so I've
>>>> started to look more closely at the possible Riak CS / Spark combo.
>>>>
>>>> Apparently, please correct if wrong, Riak CS sits on top of Riak and is
>>>> S3-api compliant. Underlying the db for the objects is levelDB (which
>>>> would
>>>> have been my choice anyway, bc of the low in-mem key overhead)
>>>> Apparently
>>>> Bitcask is also used, although it's not clear to me what for exactly.
>>>>
>>>> At the same time Spark (with Shark on top, which is what Hive is for
>>>> Hadoop
>>>> if that in any way makes things clearer) can use HDFS or S3 as it's so
>>>> called 'deep store'.
>>>>
>>>> Combining this it seems, Riak CS and Spark/Shark could be a nice pretty
>>>> tight combo providing interative and adhoc quering through Shark + all
>>>> the
>>>> excellent stuff of Riak through the S3 protocol which they both speak .
>>>>
>>>> Is this correct?
>>>> Would I loose any of the raw power of Riak when going with Riak CS?
>>>> Anyone
>>>> ever tried this combo?
>>>>
>>>> Thanks,
>>>> Geert-Jan
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://riak-users.197444.n3.nabble.com/combining-Riak-CS-and-Spark-shark-by-speaking-over-s3-protocol-tp4028621.html
>>>> Sent from the Riak Users mailing list archive at Nabble.com.
>>>>
>>>> _______________________________________________
>>>> riak-users mailing list
>>>> [hidden email] <http://user/SendEmail.jtp?type=node&node=4028644&i=3>
>>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>>>
>>>
>>>
>>
>
> _______________________________________________
> riak-users mailing list
> [hidden email] <http://user/SendEmail.jtp?type=node&node=4028644&i=4>
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
>
> ------------------------------
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://riak-users.197444.n3.nabble.com/combining-Riak-CS-and-Spark-shark-by-speaking-over-s3-protocol-tp4028621p4028644.html
>  To unsubscribe from combining Riak (CS) and Spark/shark by speaking over
> s3 protocol, click here<http://riak-users.197444.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=4028621&code=Z2JyaXRzQGdtYWlsLmNvbXw0MDI4NjIxfDExNjk3MTIyNTA=>
> .
> NAML<http://riak-users.197444.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




--
View this message in context: http://riak-users.197444.n3.nabble.com/combining-Riak-CS-and-Spark-shark-by-speaking-over-s3-protocol-tp4028621p4028647.html
Sent from the Riak Users mailing list archive at Nabble.com.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20130731/417f581f/attachment.html>


More information about the riak-users mailing list