Riak and SEC Filings

Elias Levy fearsome.lucidity at gmail.com
Tue Nov 8 15:19:11 EST 2011

On Tue, Nov 8, 2011 at 7:15 AM, <riak-users-request at lists.basho.com> wrote:

> Date: Tue, 8 Nov 2011 07:08:59 -0500
> From: Hector Castro <hectcastro at gmail.com>
> I'm currently in the process of evaluating solutions to index the contents
> of ~1TB of SEC (Securities and Exchange Commission) documents.  File sizes
> vary between a few KB to a couple hundred KB.  I started evaluating Riak
> first because ease of setting up and expanding a cluster are primary
> requirements (ElasticSearch is also probably going to get evaluated, along
> with Solr).
> Below I have a few specific questions that I was hoping people could help
> with:
>        * In going through the search querying documentation, I haven't
> found a way to extract a section of a result containing matches.  Something
> similar to Google's search results page where you see an excerpt of the
> webpage contents that match your query.  Is something like this built-in so
> that it doesn't have to be done by the application?
>        * Given that the documents total ~1TB of storage (not including the
> generated indexes), does something like decreasing the n_val make sense?
>  Mostly the documents are bulk inserted on a daily or weekly basis ? other
> than that all of the operations are read-only.
> Other than these specific questions, if anyone can provide general insight
> on issues that would arise from a dataset like this within Riak, please
> feel free to mention them.

I would opt for ElasticSearch.  ES is document sharded, whereas RS is term
sharded.  The later means your query time is bounded by the performance of
the worst term within a single vnode.  In our experience this has been a
problem. RS also has some rough edges. E.g. the Solr API does not expose a
timeout parameter, which means if you query takes longer that the hardcoded
timeout you'll have to modify your query or resort to using search inputs
into a MR job, which does not expose the Solr API options. Which has some
issues of its own, like the fact that the sort option sorts after slicing,
not before, which means its useless for paging results by anything other
than sorting by score.  I think you'll find that ES, being based on Lucene,
will give you more tokenizer options.

RS is great to add some search to your data stored within Riak, but if you
want a search engine, use something more specialized like ES.

Elias Levy
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20111108/42dba9af/attachment.html>

More information about the riak-users mailing list