Joel R. Berendzen
joelb at lanl.gov
Fri Aug 26 15:09:50 EDT 2011
I've been reading about riak and started testing it, and I have some
questions. Apologies in advance for length.
First, a word about my application and its data flow. I'm doing
bioinformatics for data streams that employ next-generation
sequencing. A typical data set for me consists of 35 million 100-base
records (reads), and I work with about 100 of them (~3 billion
records) at any given time. If an entire sequencing center were run
through my application, and if each read were a bucket that would
amount to about a terabyte of data each day and half a billion buckets
per hour. I've been developing analysis methods that can scale to
this size, taking web search technology (lucene, so far) as the basis.
Indexing has worked so well that I now have to take the database
portion more seriously. I care a lot about scaling and flow, not
so much about reliability, and very little about downtime.
Questions about riak:
1. In terms of number of buckets, what's the largest riak database
that the list has seen? How were the individual servers sized?
2. Expiring and deleting data is a critical function for me. The
current docs say "there is no straightforward way of deleting a
bucket". Does anybody had experience with large-scale deletions?
Would deletion just leave a compaction issue anyway?
3. As an alternative to deleting buckets, I could segment databases
into pieces that could be retired en bloc. Does anybody have
experience with running multiple (10, say) riak instances per machine?
4. How can I estimate the memory footprint per key when using bitcask
for, say, a 16-byte bucket and 2-byte key IDs?
5. It would be helpful to make estimates of but also the optimal
storage (RAM/Memory Appliance/SSD/spinning metal) balance.
(a) Does anybody have estimates of performance improvements using SSDs?
I haven't been able to find them on the blog or mailing list,
despite comments left saying that such estimates were being made.
(b) Does anybody have numbers about tradeoffs in performance
versus RAM size? Aside from one brief discussion on this list,
I haven't been able to find much.
(c) RAM cacheing of specific files? HugeTLB?
filters. It would also be great to have a dynamic component to key
filtering by putting a limited amount (say, 64 bits) of information
from each bucket into memory that could be accessed by the key filters,
even if the consistency of the information weren't guaranteed.
2. Secondary indices also look good for my use case. Features that I'd
like to see: ability to plug in user-specified analyzers, indexer
support for integer values, ability to read the index directly (for
histogramming and linking), ability to delete an index.
3. Python client: seems to throw errors in most tests with riak 0.14.2
under erlang 13. This does not inspire confidence.
4. Bucket and key listing via the HTTP interface hang for me on 0.14.2
under erlang 13, even with only 2 or zero buckets. This does not
5. I was able to find patches for 0.14.2 to run under erlang 14 that
enabled riak to compile, but it had run-time errors related to
NIFs when I ran the python tests, perhaps because I made the
NIF compat header files all be the same. /riak/tests worked,
but I gave up on 0.14.2.
6. It took me some searching to find the source head, although later
I realized it was on the top menu at the basho site. Perhaps
it would be a good thing to put in the install instructions for
the curious and menu-challenged. No problems building, bucket
and key listings work, but the python client tests still throw
lots of errors. Basic operations in python seem to work, though.
7. Building: "make" and "make install" > "make rel", IMHO.
8. Filesystem layout: It's pointless to argue about filesystem layout,
but modifying the build to put things where my distro (gentoo) and
I think they should go seems more daunting than it ought to be.
There were several barricades here, from the version number and
the separate filesystem layout in the erts directory to uncertainties
about whether choices in app.config are compile-time or run-time
or both. "make devrel" ought to give some clues, but it doesn't
separate out things which need to be shared from those that don't.
9. Docs: Depending on which page I stumbled across first, it seemed that
I should start by either doing "make rel" or "make devrel". The latter
starts riak at a different port than the former and I wasted
some time before I found the correct port. It would be good to
point the new user to a riak browser so that they can see a different
view of the changes they are making than the one they get from reading
Applied Modern Physics Group (P-21), MS D454
Los Alamos National Laboratory
Los Alamos, NM 87545
More information about the riak-users