High volume data series storage and queries

Pablo Chacin pablochacin at gmail.com
Tue Aug 9 03:13:21 EDT 2011


>From Rial Search documentation

>Search queries use the same syntax as Lucene, and support most Lucene
operators including term searches, field searches, >boolean operators,
grouping, lexicographical range queries, and wildcards (at the end of a word
only)

Besides, and this is something I'm looking right now, we would need some
geographical queries.


On Tue, Aug 9, 2011 at 3:20 AM, Paul O <pcotec at gmail.com> wrote:

> Pablo, is risk search able to do range queries?
>
> On Mon, Aug 8, 2011 at 5:17 PM, Pablo Chacin <pablochacin at gmail.com>wrote:
>
>> I'm facing a similar (yet not such extreme) use case. I'm also considering
>> a similar strategy, but I was thinking about using riak search instead of a
>> rdbs for the secondary indexes.
>>
>> On Mon, Aug 8, 2011 at 10:25 PM, Paul O <pcotec at gmail.com> wrote:
>>
>>> Hi Jeremiah,
>>>
>>> This is for a yet-to-exist system, so the existing data characteristics
>>> are not that important.
>>>
>>> The volume of data would be something like : average 10 events per second
>>> per source meaning about 320 million events per source, for tens of
>>> thousands of sources, potentially hundreds of thousands.
>>>
>>> Data retention policy would be in the range of years, probably 5 years.
>>>
>>> Most of the above-mentioned are averages, some sources might be sampled
>>> even hundreds of times per second. There is also a layer of creating
>>> aggregates for "regressive granularity" (a la RRD) but it's a bit less of a
>>> concern (i.e. the same strategy I'm describing could be used for storing the
>>> aggregates.)
>>>
>>> The strategy I've described tries to make the most common query (time
>>> range per source with a max number of elements) predictable and as
>>> performant as possible. I.e. for any range I know at most three batches need
>>> to be read from Riak (or equivalent) so I can say that, if reading a batch
>>> takes 20 ms and the initial query takes 10 ms I can predictably respond to
>>> most such requests under 100 ms.
>>>
>>> So as long as I can benchmark individual aspects of the strategy I hope
>>> to a predictable query cost and an idea of how to grow the system.
>>>
>>> As for the read to write ration I don't have an exact estimate (the
>>> system will be generic and consumption applications will be built on top of
>>> it) but the system is expected to be a lot more write intensive than read
>>> intensive. Most data might go completely unused, some data might be rather
>>> "hot" so additional caching might be implemented later but I'm trying to
>>> design the underlying system so at least some performance axioms are
>>> computable.
>>>
>>> Does this clarify or confuses further?
>>>
>>> Regards,
>>>
>>> Paul
>>>
>>> On Mon, Aug 8, 2011 at 3:32 PM, Jeremiah Peschka <
>>> jeremiah.peschka at gmail.com> wrote:
>>>
>>>> It sounds like a potentially interesting use case.
>>>>
>>>> The questions that immediately enter my head are:
>>>> * How much data do you currently have?
>>>> * How much data do you plan to have?
>>>> * Do you have a data retention policy? If so, what is it? How do you
>>>> plan to implement it?
>>>> * What's the anticipated rate of growth per day? Week? Year?
>>>> * What type of queries will you have? Is it a fixed set of queries? Is
>>>> it a decision support system?
>>>> * What does your read to write ratio look like?
>>>>
>>>> Your plan to support Riak with a hybrid system isn't that out of whack;
>>>> it's very doable.
>>>>
>>>> You can certainly do the type of querying you've described through
>>>> careful choice of key names, sorting in memory, and only using the first N
>>>> data points in a given Map Reduce query result. The main reason to not
>>>> perform range queries in Riak is that they'll result in full key space scans
>>>> across the Riak cluster. If you're using bitcask as your backend then it's
>>>> an in memory scan, otherwise you're doing a much more costly scan from disk.
>>>> And, since key names are hashed as they are partitioned across the cluster,
>>>> you're not going to get the benefit of sequential disk scan performance like
>>>> you might get with a traditional database.
>>>>
>>>> The only thing that worries me is the phrase "should grow more than what
>>>> a 'vanilla' RDBMS would support". Are you thinking 1TB? 10TB? 50TB? 500TB?
>>>> I'm trying to get a handle on what size and performance characteristics
>>>> you're looking for before diving into how to look at your system vs. saying
>>>> "Hell if I know, does someone else on the list have a good idea?"
>>>>
>>>> ---
>>>> Jeremiah Peschka - Founder, Brent Ozar PLF, LLC
>>>> Microsoft SQL Server MVP
>>>>
>>>> On Aug 8, 2011, at 11:21 AM, Paul O wrote:
>>>>
>>>> > Hello Riak enthusiasts,
>>>> >
>>>> > I am trying to design a solution for storing time series data coming
>>>> from a very large number of potential high-frequency sources.
>>>> >
>>>> > I thought Riak could be of help, though based on what I read about it
>>>> I can't use it without some other layer on top of it.
>>>> >
>>>> > The problem is I need to be able to do range queries over this data,
>>>> by the source. Hence, I want to be able to say "give me the N first data
>>>> points for source S between time T1 and time T2."
>>>> >
>>>> > I need to store this data for a rather long time, and the expected
>>>> volume should grow more than what a "vanilla" RDBMS would support.
>>>> >
>>>> > Another thing to note is that I can restrict the number of data points
>>>> to be returned by a query, so no query would return more than MaxN data
>>>> points.
>>>> >
>>>> > I thought about doing this the following way:
>>>> >
>>>> > 1. bundle date time series in batches of MaxN, to ensure that any
>>>> query would require reading at most two batches. The batches would be store
>>>> inside Riak.
>>>> > 2. Store the start-time, end-time, size and Riak batch ID in a MySQL
>>>> (or PostgreSQL) DB.
>>>> >
>>>> > My thinking is such a strategy would allow me to persist data in Riak
>>>> and linearly grow with the data, and the index would be kept in a RDBM for
>>>> fast range queries.
>>>> >
>>>> > Does it sound sensible to use Riak this way? Does this make you
>>>> laugh/cry/shake your head in disbelief? Am I overlooking something from Riak
>>>> which would make all this much better?
>>>> >
>>>> > Thanks and best regards,
>>>> >
>>>> > Paul
>>>> > _______________________________________________
>>>> > riak-users mailing list
>>>> > riak-users at lists.basho.com
>>>> > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>>>
>>>>
>>>
>>> _______________________________________________
>>> riak-users mailing list
>>> riak-users at lists.basho.com
>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20110809/7ef226ee/attachment.html>


More information about the riak-users mailing list