newbie questions and comments [NT]

Ian Plosker ian at basho.com
Fri Aug 26 16:15:32 EDT 2011


Joel,

Let me see if I can help you out. I've worked on a bioinformatics platform, processing reads from next-gen sequencers in a previous role. I didn't use Riak, but hopefully I can help you out.

> First, a word about my application and its data flow.  I'm doing
> bioinformatics for data streams that employ next-generation
> sequencing.  A typical data set for me consists of 35 million 100-base
> records (reads), and I work with about 100 of them (~3 billion
> records) at any given time.  If an entire sequencing center were run
> through my application, and if each read were a bucket that would
> amount to about a terabyte of data each day and half a billion buckets
> per hour.  I've been developing analysis methods that can scale to
> this size, taking web search technology (lucene, so far) as the basis.
> Indexing has worked so well that I now have to take the database
> portion more seriously.  I care a lot about scaling and flow, not
> so much about reliability, and very little about downtime.

First off, I have a few questions. Are you planning to use riaksearch for lucene-style queries? If so, are you planning to implement a custom extractor? Further if you're using a custom extractor, what will you be extracting (ngrams? of what length? and what sensitivity? etc)? You should note that riaksearch indexes are termed partitioned, so you'll have to be careful with common words; they could result in hotspots.

> Questions about riak:
> 
> 1.  In terms of number of buckets, what's the largest riak database
> that the list has seen? How were the individual servers sized?

There's no limit to the number of buckets. The only limitation is that buckets with non-default configurations, will have their configurations gossiped around the ring. If you have a lot of buckets with custom configurations, your gossip can get quite large, which can be a problem. Be aware, there's nothing special about buckets, they're simply namespaces with optional configuration.

> 2. Expiring and deleting data is a critical function for me.  The
> current docs say "there is no straightforward way of deleting a
> bucket".  Does anybody had experience with large-scale deletions?
> Would deletion just leave a compaction issue anyway?

You're right that deletion of entire buckets is non-trivial.

> 4. How can I estimate the memory footprint per key when using bitcask
> for, say, a 16-byte bucket and 2-byte key IDs?

Take a look at this: https://help.basho.com/entries/335865-if-the-size-of-key-index-exceeds-the-amount-of-memory-how-does-bitcask-handle-it

> Comments:
> 
> 1. Key filters look great, do more of that.  I'd love to see javascript
>   filters.  It would also be great to have a dynamic component to key
>   filtering by putting a limited amount (say, 64 bits) of information
>   from each bucket into memory that could be accessed by the key filters,
>   even if the consistency of the information weren't guaranteed.

> 4. Bucket and key listing via the HTTP interface hang for me on 0.14.2
>   under erlang 13, even with only 2 or zero buckets.  This does not
>   inspire confidence.

Bucket, key listing, and key filters must scan the entire key space. It's speed isn't dependent on the number of buckets, but rather the number of keys. Also, note that shipping data from the Erlang VM to the JS VM and then back is very expensive, so JS key filters would be slow.

> 
> 2. Secondary indices also look good for my use case.  Features that I'd
>   like to see: ability to plug in user-specified analyzers, indexer
>   support for integer values, ability to read the index directly (for
>   histogramming and linking), ability to delete an index.

I can't talk in too much detail about secondary indices (2i), but it's on the user to pull out the data that goes into the index. 2i indexes metadata fields that are prefixed with x-riak-index and suffixed with the data type, either _bin or _int (warning: this is subject to change). 

> 5. I was able to find patches for 0.14.2 to run under erlang 14 that
>   enabled riak to compile, but it had run-time errors related to
>   NIFs when I ran the python tests, perhaps because I made the
>   NIF compat header files all be the same.  /riak/tests worked,
>   but I gave up on 0.14.2.

If you must run Riak under the latest Erlang release, you can use the master branch on github. There's no particular reason to use R14 with the 0.14 series; it was designed and tested under R13.

> 7. Building: "make" and "make install" > "make rel", IMHO.

make install suggests that the application will be installed into /usr/local or --prefix. make rel simple builds a Riak release. This is common among Erlang applications.

> 9. Docs: Depending on which page I stumbled across first, it seemed that
>   I should start by either doing "make rel" or "make devrel".  The latter
>   starts riak at a different port than the former and I wasted
>   some time before I found the correct port. It would be good to
>   point the new user to a riak browser so that they can see a different
>   view of the changes they are making than the one they get from reading
>   curl output.

make devrel (development release) is used for building test clusters on a single machine thus the non-default ports. make rel builds a release of Riak.

Ian Plosker
Developer Advocate
Basho Technologies

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20110826/bd727b3f/attachment.html>


More information about the riak-users mailing list