Is Riak suitable for s small-record write-intensive billion-records application?
yassen.tis at gmail.com
Fri Oct 19 01:59:20 EDT 2012
[Not sure if this went to the list or only to the last poster, so here
it is again]
Dmitri, Guido, Jeremiah, Reid, Pavel, Les, Daniil -- thank you guys!
Riak community responsiveness is amazing. Never seen such.
Now to the question of Riak applicability for our scenario.
* The most important requirement is that the system shall never ever
go down or under-perform. We should be able to add/remove nodes
easily. Two or three nodes are planned for the start and I expect them
to be enough, but in case they are not, we should be able to add
another one easily. (I guess here is where Riak shines.)
* The per-record data overhead does not frighten me -- if I push
estimations to extreme limits, this results in disk space consumption
of several hundreds of Gigs which does not seem to be a problem
(provided both Riak and the file system can handle that).
* Lookups: it is mainly about primary key lookups and these of the
sort "is the key there", no range selects needed. Would be nice if we
can do queries with time ranges, as records are supposed to carry
couple of timestamps, but that's not mandatory.
* Consistency: the case when a request comes with a primary key that
already exists will be rare; the case when a request comes with the
same key within seconds is very unlikely, but unfortunately
theoretically possible. If this happens, then having two records with
the same key would be very bad and I guess I need to resolve this
somehow. Whatever the solution, it needs to be symmetric, that is, all
nodes must be equivalent.
We are prepared to write some code (golang if appropriate) but it
should be about a week work, as we don't have more time. I think of a
sort of self-made in-memory hash at application level, which does a
quick lock to check for the key existence + conditional insert, then
unlock. Done properly, this could be quick on a good hardware with
plenty of RAM (I hope).
> Do you have requirements around query performance - e.g. a read should take 200ms?
The whole cluster must be able to process up to a hundred requests per
second. Nominal load is expected to be a couple of tens of requests.
There are more things to be done with the data which will be
time-consuming (<= 500ms). Roughly, performance of up to 50-60
requests per second per node seems to be enough.
More information about the riak-users