Slow performance using linkwalk, help wanted

Karsten Thygesen karthy at
Tue Nov 9 05:01:46 EST 2010


OK, we will use a larger ringsize next time and will consider a data reload.

Regarding the metrics: the servers are dedicated to Riak use and it not used for anything else. They are new HP servers with 8 cores each and 4x146GB 10K RPM SAS disks in a contatenated mirror setup. We use Solaris with ZFS as filesystem and I have turned off atime update in the data partition.

The pool is built as such:

  pool: pool01
 state: ONLINE
 scrub: scrub completed after 0h0m with 0 errors on Tue Oct 26 21:25:05 2010

        NAME          STATE     READ WRITE CKSUM
        pool01        ONLINE       0     0     0
          mirror-0    ONLINE       0     0     0
            c0t0d0s7  ONLINE       0     0     0
            c0t1d0s7  ONLINE       0     0     0
          mirror-1    ONLINE       0     0     0
            c0t2d0    ONLINE       0     0     0
            c0t3d0    ONLINE       0     0     0

errors: No known data errors

so it is as fast as possible. 

However - we use the ZFS default blocksize, which is 128Kb - is that optimal with bitcask as backend? It is rather large, but what is optimal with bitcask?

The cluster is 4 servers with gigabit connection located in the same datacenter on the same switch. The loadbalancer is a Zeus ZTM, which does quote a few http optimizations including extended reuse of http connections and we usually see far better response times using the loadbalancer than using a node directly.

When we run the test, each riak node is only about 100% cpu loaded (which on solaris means, that it only uses one of the 8 cores). We have seen spikes in the 160% area, but everything below 800% is not cpu bound. So all-in-all, the cpuload is between 5 and 10%.

Regarding IO, we see about 30 operations/sec pr. disk and we can expect around 300 ops/sec, so again - the disks is only about 10% loaded.

So something else is the reason here, and we are quite lost. The current numbers makes Riak useless in our setup, so we need to get to the bottom of this...

Any ideas?

Best regards,

On Nov 8, 2010, at 23:27 , Kevin Smith wrote:

> The ring size constrains the parallelism for certain operations, such as MapReduce. Link walking is turned into a map phase internally. A small ring size can cause a stampede effect as many bucket/key pairs get hashed to the same vnode. 64 is a decent ring size but I'd personally opt for a larger ring size in the 128 or 256 range. The reason for this is a) it spreads out bucket/key pairs around the cluster in the case of large #'s of documents and b) provides room to expand the cluster in the future. Unfortunately, changing the ring size is fairly invasive and would require a data reload.
> However, your overall performance seems slow to me. Do these timings reflect connecting via the load balancer? If so, could you re-run the write and link walk timings connecting directly to the cluster? It would also be good to know what the system metrics (load, memory usage, etc) on the nodes look like during the test.
> --Kevin
> On Nov 8, 2010, at 3:44 PM, Jan Buchholdt wrote:
>> Kevin -
>> The allow_mult is set to false. I'm quite sure that we doesn't omit the old entries vclock in the update. A typical vclock for an Person entry that have been updated 363 times (adding 363 links) have a length of 581 characters.
>> We haven't changed the number of partitions in the cluster. 64 is the number. What would you recommend considering that we have about 5 million people with 120 million documents?
>> Another information is that the first time I do a link walk (using curl) on a total idle cluster it takes 2.71 second for a person with 363 documents. If I repeat the request it takes 319 milliseconds. I would expect that the performance would be almost the same.
>> If I run my performance test with 20 treads, that randomly pick a Person from 5 millions, the minimum time is 2.8s, average 6.7s, 90% 8.9s and max 12.4s.
>> Would changing the ring_creation_size changing the read time to values near your test performance? Is there any way to change the ring_creation_size whiteout destroying our data? It takes 1-2 days to bootstrap all our data. Our write is down to about 500 documents/second. A bit disappointing but good enough for our application.
>> --
>> Jan Buchholdt
>> Software Pilot
>> Trifork A/S
>> Cell +45 50761121
>> On 2010-11-08 17:51, Kevin Smith wrote:
>>> Jan -
>>> I've run some tests using a 8 GB, 4-core Linux box I had handy along with my MBP as a client using riak-java-client over HTTP. For the test I configured a user record as you described linked to 250 1KB entries in a separate bucket named "documents". I spun up 5 Java threads to simulate 5 concurrent users. Each thread performed the link walk from the user to the documents 2500 times. From that I was able to observe the follow performance (all times in milliseconds):
>>> Average runtime: 124
>>> 99th percentile: 220
>>> 99.5th percentile: 263
>>> 99.9th percentile: 949
>>> The large difference between the 99.5th and 99.9th seems to correlate to the beginning of the run so I think those times might reflect the time required for Java's server JIT to fully kick in as well as GC times to stabilize.
>>> I was able to reduce performance by triggering "vector clock explosion". Setting a bucket's "allow_mult" value to true and then overwriting existing entries with new values while omitting the old entries' vector clock information causes the object's vector clock data to bloat which will impact read times. Is there any chance this is occurring in your application?
>>> Another possibility is the number of partitions in your cluster is not large enough to provide good parallelization for your workload. What's the value of ring_creation_size in your cluster's app.config? Riak will run with a default ring size of 64 partitions if the entry isn't present.
>>> --Kevin
>>> On Nov 8, 2010, at 9:45 AM, Jan Buchholdt wrote:
>>>> Kevin
>>>> We are using HTTP, (have tried PB without any performance gain) and
>>>> using riak-java-client as client lib.
>>>> --
>>>> Jan Buchholdt
>>>> Software Pilot
>>>> Trifork A/S
>>>> Cell +45 50761121
>>>> On 2010-11-08 14:20, Kevin Smith wrote:
>>>>> Jan -
>>>>> Which protocol (HTTP or protocol buffers) and client lib are you using?
>>>>> --Kevin
>>>>> On Nov 8, 2010, at 6:36 AM, Jan Buchholdt wrote:
>>>>>> We are evaluating Riak for a project, but having a hard time making it fast enough for our need.
>>>>>> Our model is very simple and looks like this:
>>>>>> ---------------------                         * ---------------------
>>>>>> |       Person      | ------------------------>    |   Document        |
>>>>>> ---------------------                           ---------------------
>>>>>> We have a set of persons and each person can have many documents.
>>>>>> Our typical queries are:
>>>>>> Get an overview of all the persons documents. This query returns the person along with a subset of data from all the persons documents.
>>>>>> Get document by id.
>>>>>> Our requirements are that these quires should be performed under in under 100millis when we have 10 requests per second or less load.
>>>>>> The size of the data:
>>>>>> A document is approximately 1 kb
>>>>>> No data for a persons except the personidentifier
>>>>>> Around 6 million persons.
>>>>>> Each person has from from 0 to a couple of thousand documents.
>>>>>> All in all we have 120 mio documents.
>>>>>> Most persons don't have more than 1 to 10 documents, but then we have some few "heavy" persons having 500 to 1000 documents.
>>>>>> Riak setup:
>>>>>> 4 Nodes.
>>>>>> Hardware configuration for each node:
>>>>>> HP ProLiant DL360 G7
>>>>>> 18 gb ram
>>>>>> SAS discs
>>>>>> Intel(R) Xeon(R) CPU E5620 @ 2.40GHz Proc 1
>>>>>> Solaris 10 update 9
>>>>>> We use the default bitcask storage engine
>>>>>> We replicate data to 3 machines when it is written.
>>>>>> Reads are read from just one machine
>>>>>> We tried implementing our datamodel using Riak links as described below:
>>>>>> Persons are stored in a person bucket using their person identifier as key
>>>>>> /person/
>>>>>> {personid}
>>>>>> Documents are saved in another bucket
>>>>>> /document/
>>>>>> {documented}
>>>>>> At each person we store links to the persons documents.
>>>>>> We are having problems with the query fetching all the documents for a person.  Reading all the documents for a person is done using a link walk. The linkwalk start reading all the document keys using the personid. It then fetches all documents.
>>>>>> For persons with 1 - 5 documents the response times are often over 100 mills. And for the "heavy" persons with many documents response times are several seconds. But we are very new to Riak and are probably using a wrong approach.
>>>>>> Below are our thoughts (having almost no experience with Riak):
>>>>>> The chosen datamodel is good for writes. Writing a new document results in 3 operations against Riak. Writing the document using its id as key. Reading the Person to get all the persons document links. Append the new document's key to the persons links and write back the person.
>>>>>> Reading, using linkwalk, is slow because it is expensive to fetch many documents even though the linkwalk can read their keys right away by reading the links for the person. Even though we have 4 nodes and linkwalks are parallelized many documents need to be retrieved from one node. Having to fetch for example 100 documents on one node (one disc) is expensive. We do not know how data is stored but are afraid Riak is doing a lot of disk seeks.
>>>>>> We are considering another more denormalized approach where we write all the documents for a person in one "blob". But then we are afraid our writes become slow, because when adding a new document the blob must be read, the new document inserted and the blob written back.
>>>>>> We could really need some input. Is our assumptions wrong? (we have not yet dug into the problems). Is there a good datamodel for our requirements? etc?.
>>>>>> We haven't looked at Riak search at all. Maybe it could solve some of our problems.
>>>>>> --  --
>>>>>> Jan Buchholdt
>>>>>> Software Pilot
>>>>>> Trifork A/S
>>>>>> Cell +45 50761121
>>>>>> _______________________________________________
>>>>>> riak-users mailing list
>>>>>> riak-users at
>>>> _______________________________________________
>>>> riak-users mailing list
>>>> riak-users at
> _______________________________________________
> riak-users mailing list
> riak-users at

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 1919 bytes
Desc: not available
URL: <>

More information about the riak-users mailing list