Riak and SEC Filings
hectcastro at gmail.com
Tue Nov 8 07:08:59 EST 2011
I'm currently in the process of evaluating solutions to index the contents of ~1TB of SEC (Securities and Exchange Commission) documents. File sizes vary between a few KB to a couple hundred KB. I started evaluating Riak first because ease of setting up and expanding a cluster are primary requirements (ElasticSearch is also probably going to get evaluated, along with Solr).
Below I have a few specific questions that I was hoping people could help with:
* In going through the search querying documentation, I haven't found a way to extract a section of a result containing matches. Something similar to Google's search results page where you see an excerpt of the webpage contents that match your query. Is something like this built-in so that it doesn't have to be done by the application?
* Given that the documents total ~1TB of storage (not including the generated indexes), does something like decreasing the n_val make sense? Mostly the documents are bulk inserted on a daily or weekly basis – other than that all of the operations are read-only.
Other than these specific questions, if anyone can provide general insight on issues that would arise from a dataset like this within Riak, please feel free to mention them.
More information about the riak-users