Riak and SEC Filings

Hector Castro hectcastro at gmail.com
Tue Nov 8 07:08:59 EST 2011


I'm currently in the process of evaluating solutions to index the contents of ~1TB of SEC (Securities and Exchange Commission) documents.  File sizes vary between a few KB to a couple hundred KB.  I started evaluating Riak first because ease of setting up and expanding a cluster are primary requirements (ElasticSearch is also probably going to get evaluated, along with Solr).  

Below I have a few specific questions that I was hoping people could help with:

	* In going through the search querying documentation, I haven't found a way to extract a section of a result containing matches.  Something similar to Google's search results page where you see an excerpt of the webpage contents that match your query.  Is something like this built-in so that it doesn't have to be done by the application?
	* Given that the documents total ~1TB of storage (not including the generated indexes), does something like decreasing the n_val make sense?  Mostly the documents are bulk inserted on a daily or weekly basis – other than that all of the operations are read-only.

Other than these specific questions, if anyone can provide general insight on issues that would arise from a dataset like this within Riak, please feel free to mention them.



