Data modeling a write-intensive comment storage cluster

fxmy wang fxmywc at gmail.com
Mon Jan 27 01:27:07 EST 2014


Thanks for the response, Jeremiah.



> > Then here are my questions:
> > 1) To get better writing throughput, is it right to set the w=1?
>
> This will improve perceived throughput at the client, but it won't improve throughput at the server.

Thank you for clarifying this for me :D

> > 2) What's the best way to query these comments? In this use case, I don't need to retrieve all the comments in one bucket, but just the latest few hundreds comments( if there are so many) based on the time they are posted.
> >
> > Right now I'm thinking of using line-walking and keeping track of the latest comment so I can trace backwards to get the latest 500 comments ( for example). And when new comment comes, point the line to the old latest, then update new latest comment mark.
> >
>
> I wouldn't use link-walking. IIRC this uses MapReduce under the covers. You could use a single key to store the most recent comment.

What's bad about MapReduce?
Since there will be another cache layer lays on top of the cluster, so
the read operation is relatively quite infrequent. That's why I choose
to use link-walking.

> You can get the most recent n keys using secondary index queries on the $bucket index, sorting, and pagination.
I'm not sure what you mean here =.=
How can I query most recent n keys using 2i ? Should I put timestamp
-----like by every hour----- in 2i on the coming comments , then when
it comes to queries, just try to query 2i by the hour segment? This
seems a little blind because some videos could be long time before got
commented again.  Querying based on time segmentation seems like
shooting in the dark to me :\

And doc says listing keys operation should not used in production, so
it's a no go either :\


> > So in the scenario above, is it possible that after one client has written on nodeA ,modified the latest-mark and another client on nodeB not yet sees the change thus points the line to the old comment, resulting a "branch" in the line?
> > If this could happen, then what can be done to avoid it? Are there any better ways to store&query those comments? Any reply is appreciated.
>
> You can avoid siblings by serializing all of your writes through a single writer. That's not a great idea since you lose many of Riak's benefits.
> You could also use a CRDT with a register type. These tend toward the last writer.

My goal is to form kind of a single-line-relationship based on
timestamp through the keys under high concurrent write pressure. And
through this relationship I can easily pick out the last
hundreds/thousands comments.
As Jeremiah said, serializing all of writes through a single writer
can avoid siblings totally. And note that we don't have key clashing
problems here ------ every comment holds an unique key. What we want
is single-line-relationship. So how about this:

Multiple erlang-pb clients just do the writes and don't care about the
lining up.
Using post-commit hooks to notify one special global registered
process( which should be running in the riak cluster?) that "here
comes a new comment, line it up when it's appropriate".
Is this feasible? And if it is , how should i prepare for the cluster
partition & rejoin scenario when network fails?

> The point is that you need to decide how you want to deal with this type of scenario - it's going to happen. In a worst case; you lose a write briefly.

Hopefully the method above could avoid this :)

Please everyone, share your thoughts please. _(:3JZ)_

B.R.




More information about the riak-users mailing list