combining Riak (CS) and Spark/shark by speaking over s3 protocol

Dan Kerrigan dan.kerrigan at gmail.com
Wed Jul 31 08:32:28 EDT 2013


Geert-Jan

Riak CS currently doesn't support the S3 Copy command.  Flume and Hadoop
distcp create a temporary object and then attempts to Copy that object to
it's permanent location.  Rename is a Copy then a Delete since the S3 API
doesn't support Rename.

Regarding efficiency, Riak CS block sizes are 1MB (100 MB object, 100 Riak
Bitcask stored objects) so you can use the Bitcask calculator at [0] to get
a rough estimate requirements to store your particular dataset. Regarding
the impact to read latency, severe is probably not the right word but there
is an impact.  Besides API support, your decision will, in part, come down
to how large your object sizes are going to be.  The Riak FAQ [1] currently
suggests that Riak Object sizes should be less than 10MB.  Riak CS on the
other hand can handle object sizes up to 5TB.  If you are doing multi-key
retrieves for lots of small objects, Riak looks like the right choice
otherwise, go with Riak CS.  Some basic testing would go a long way to find
the balance in your case.

Respectfully -
Dan Kerrigan

[0] http://docs.basho.com/riak/latest/ops/building/planning/bitcask/
[1]
http://docs.basho.com/riak/latest/community/faqs/developing/#is-there-a-limit-on-the-file-size-that-can-be-stor



On Wed, Jul 31, 2013 at 4:43 AM, Geert-Jan Brits <gbrits at gmail.com> wrote:

> Dan,
>
> Not sure if I understand the "renaming objects"-problem in Riak CS. Can
> you elaborate?
>
>  "I believe smaller object sizes would not be nearly as efficient as
> working with plain Riak if only because of the overhead incurred by Riak
> CS". Does this mean lack of efficiency in disk storage, in-mem or both?
> Moreover I'm having this nagging thought that having to dig through the
> manifest to find the blocks will severely impact read latency for
> (multi-key) lookups as opposed to the normal bitcask / levelDB lookup.  Is
> this correct?
>
> Best,
> Geert-Jan
>
>
> 2013/7/30 Dan Kerrigan <dan.kerrigan at gmail.com>
>
>> Geert-Jan -
>>
>> We're currently working on a somewhat similar project to integrate Flume
>> to ingest data into Riak CS for later processing using Hadoop.  The
>> limitations of HDFS/S3, when using the s3:// or s3n:// URIs, seem to
>> revolve around renaming objects (copy/delete) in Riak CS.  If you can avoid
>> that, this link should work fine.
>>
>> Regarding how data is stored in Riak CS, the data block storage is
>> Bitcask with manifest storage being held in LevelDB.  Riak CS is optimized
>> for larger object sizes and I believe smaller object sizes would not be
>> nearly as efficient as working with plain Riak if only because of the
>> overhead incurred by Riak CS. The benefits of Riak generally carry over to
>> Riak CS so there shouldn't be any need to worry about losing raw power.
>>
>> Respectfully -
>> Dan Kerrigan
>>
>>
>> On Tue, Jul 30, 2013 at 2:21 PM, gbrits <gbrits at gmail.com> wrote:
>>
>>> This may be totally missing the mark but I've been reading up on ways to
>>> do
>>> fast iterative processing in Storm or Spark/shark, with the ultimate
>>> goal of
>>> results ending up in Riak for fast multi-key retrieval.
>>>
>>> I want this setup to be as lean as possible for obvious reasons so I've
>>> started to look more closely at the possible Riak CS / Spark combo.
>>>
>>> Apparently, please correct if wrong, Riak CS sits on top of Riak and is
>>> S3-api compliant. Underlying the db for the objects is levelDB (which
>>> would
>>> have been my choice anyway, bc of the low in-mem key overhead) Apparently
>>> Bitcask is also used, although it's not clear to me what for exactly.
>>>
>>> At the same time Spark (with Shark on top, which is what Hive is for
>>> Hadoop
>>> if that in any way makes things clearer) can use HDFS or S3 as it's so
>>> called 'deep store'.
>>>
>>> Combining this it seems, Riak CS and Spark/Shark could be a nice pretty
>>> tight combo providing interative and adhoc quering through Shark + all
>>> the
>>> excellent stuff of Riak through the S3 protocol which they both speak .
>>>
>>> Is this correct?
>>> Would I loose any of the raw power of Riak when going with Riak CS?
>>> Anyone
>>> ever tried this combo?
>>>
>>> Thanks,
>>> Geert-Jan
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://riak-users.197444.n3.nabble.com/combining-Riak-CS-and-Spark-shark-by-speaking-over-s3-protocol-tp4028621.html
>>> Sent from the Riak Users mailing list archive at Nabble.com.
>>>
>>> _______________________________________________
>>> riak-users mailing list
>>> riak-users at lists.basho.com
>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20130731/ec1c6b10/attachment.html>


More information about the riak-users mailing list