Riak CS/Stanchion troubleshooting (Retrieval of user record)

Kazuhiro Suzuki kaz at basho.com
Wed Nov 25 05:14:13 EST 2015


Hi,

> I think that I need to extend my Riak cluster with more nodes to increase performance.

It can be a simple solution if your cluster just faced over capacity
on most/all nodes. However, it would be better to make sure a fact by
reading logs, and monitoring system resources.

> I think that it is not haproxy's timeouts issue. Am I right?

It depends on what you want to see in logs. If you prefer to see
obvious timeout error message in logs when a Riak cluster slows down,
you should set client timeout to the same with server timeout. If not,
you will see disconnected error in logs instead of timeout error as
you saw. IIRC, 'cD' in your tcplogs means a haproxy closed connections
due to client timeout.

BTW, haproxy's docs says:

> In TCP mode (and to a lesser extent, in HTTP mode), it is highly recommended that the
> client timeout remains equal to the server timeout in order to avoid complex situations to debug.


> I want to mention again taht I have about 1000 rps to Riak CS, average object size is 10 Kb.

It seems to fit a usecase of Riak, not Riak CS, if you don't need S3
API and multi tenancy. Riak CS is designed for a large object over a
few MB and multi-tenancy, and makes some performance overhead for
achieving the features.


On Thu, Nov 19, 2015 at 8:04 PM, Vladyslav Zakhozhai
<v.zakhozhai at smartweb.com.ua> wrote:
> Hi,
>
> Sorry for my long silence. Kazhuiro thank you for your answer. The situation
> is more clear. I think that I need to extend my Riak cluster with more nodes
> to increase performance. The reason for my opinion is:
>
> Nov 19 11:42:40 localhost haproxy[24678]: 172.18.103.31:49608
> [19/Nov/2015:11:42:40.137] riak riak_backend/viper 3/5/113 1471 --
> 8191/2594/2594/138/0 0/0
> Nov 19 11:42:41 localhost haproxy[24678]: 172.18.108.170:44517
> [19/Nov/2015:11:41:42.264] riak riak_backend/serpent 1/0/58806 5982 cD
> 8191/2849/2849/155/0 0/0
> Nov 19 11:59:46 localhost haproxy[24678]: 172.18.102.39:42919
> [19/Nov/2015:11:42:14.566] riak riak_backend/mussurana 1/0/1052250 1484789
> -- 3134/2888/2888/154/0 0/0
> Nov 19 12:07:26 localhost haproxy[24678]: 172.18.40.2:44946
> [19/Nov/2015:11:42:14.508] riak riak_backend/rattler 1/0/1511814 2471638 cD
> 3172/2888/2888/161/0 0/0
> Nov 19 12:17:56 localhost haproxy[24678]: 172.18.103.30:58654
> [19/Nov/2015:11:42:40.141] riak riak_backend/mamba 3/1/2116572 3383878 cD
> 2988/2886/2886/166/0 0/0
> Nov 19 12:23:55 localhost haproxy[24678]: 172.18.40.4:59089
> [19/Nov/2015:11:41:39.831] riak riak_backend/eggeater 1/0/2535854 4109579 CD
> 3020/2888/2888/153/0 0/0
> Nov 19 12:38:54 localhost haproxy[24678]: 172.18.40.4:37536
> [19/Nov/2015:11:41:47.533] riak riak_backend/cobra 1/0/3427457 3387298 --
> 2983/2886/2886/159/0 0/0
> Nov 19 12:50:37 localhost haproxy[24678]: 172.18.102.39:51870
> [19/Nov/2015:11:41:49.413] riak riak_backend/lora 1/0/4128262 6445878 --
> 2989/2889/2889/164/0 0/0
>
> I think that it is not haproxy's timeouts issue. Am I right?
>
> Regarding to HAProxy config I have the following config for Riak pb:
> frontend riak
>     bind    172.18.108.170:8087
>
>     mode    tcp
>     option  tcplog
>     option  contstats
>
>     timeout client 30s
>
>     default_backend     riak_backend
>
> backend riak_backend
>     mode    tcp
>     balance roundrobin
>     option  tcpka
>     option  srvtcpka
>     option  httpchk GET /ping
>
>     timeout server 60s
>
>     server rinkhals rinkhals.pleiad.uaprom:8087 weight 1 maxconn 1024 check
> port 8090
>     server chuckwalla chuckwalla.pleiad.uaprom:8087 weight 1 maxconn 1024
> check port 8090
>  and so on...
>
> And config for Riak CS:
> frontend riakcs
>     bind    193.34.169.1:80
>     mode    http
>     option  contstats
>     option  httplog
>     option  http-server-close
>     timeout client      30s
>
>     default_backend riakcs_backend
>
> backend riakcs_backend
>     mode http
>     balance roundrobin
>
>     option httpchk GET /riak-cs/ping
>     option redispatch
>     retries 3
>
>     timeout server 60s
>     timeout connect 60s
>     timeout http-request 60s
>
>     server rinkhals rinkhals.pleiad.uaprom:8080 weight 1 maxconn 1024 check
> port 8000
>     server chuckwalla chuckwalla.pleiad.uaprom:8080 weight 1 maxconn 1024
> check port 8000
>     and so on...
>
> And default section:
> defaults
>         log     global
>         option  dontlognull
>         retries 3
>         option redispatch
>         maxconn 8192
>         timeout connect 5000
>         timeout client 4h
>         timeout server 4h
>         balance leastconn
>
> I want to mention again taht I have about 1000 rps to Riak CS, average
> object size is 10 Kb.
>
> Thank you.
>
>
> On Mon, Nov 16, 2015 at 4:01 AM Kazuhiro Suzuki <kaz at basho.com> wrote:
>>
>> Hi,
>>
>> ha_proxy's timeout settings often causes disconnected errors on a Riak
>> CS deployment by high work load. termination_stat [1] in tcplog [2]
>> lets you know if timeout happens or not.
>>
>> > 2015-11-13 13:13:09.514 [error]
>> > <0.11264.1387>@riak_cs_wm_common:maybe_create_user:222 Retrieval of user
>> > record for s3 failed. Reason: disconnected
>>
>> This means Riak CS failed to read a user data from Riak for
>> authentication due to a disconnected error.
>>
>> > Riak CS adds, removes, gets properties through Stanchion service. Am I
>> > right? I can't exactly understand where is my bottleneck - Riak, Riak CS or
>> > Stanchion.
>>
>> Mainly Stanchion is only used to update/delete data of users and
>> buckets. To inspect a node, Riak S2/CS 2.1 introduced new metrics
>> including various latencies and counters, which help to identify
>> bottleneck.
>>
>> > When we need authenticated access for reading object from bucket do we
>> > need Stanchion? If not I can't understand why I had a lot of error during
>> > getting objects from Riak CS.
>>
>> Authenticated access is always necessary but a read request of user
>> data for auth is issued from Riak CS to Riak directly, not through
>> Stanchion.
>>
>> > P. S. Sometimes when there is some issues with Riak CS - Stanchion
>> > connectivity I need to restart Riak CS.
>>
>> Riak CS 1.5.0 has connection pool leak problem [3]. You might hit the
>> issue...
>>
>> [1]: https://cbonte.github.io/haproxy-dconv/configuration-1.5.html#8.5
>> [2]: https://cbonte.github.io/haproxy-dconv/configuration-1.5.html#8.2.2
>> [3]:
>> http://docs.basho.com/riakcs/latest/cookbooks/Riak-CS-Release-Notes/#Riak-CS-1-5-2
>>
>> On Sat, Nov 14, 2015 at 2:04 AM, Vladyslav Zakhozhai
>> <v.zakhozhai at smartweb.com.ua> wrote:
>> >
>> > Hello.
>> >
>> > I have Riak CS cluster with 18 nodes. On each node there is Riak CS and
>> > Riak
>> > service and one Stanchion node.
>> >
>> > Versions:
>> > Riak 1.4.12
>> > Riak CS 1.5.0
>> > Stanchion 1.5.0
>> >
>> > Riak CS and Riak allocated behind HAProxy balancers:
>> >
>> > WAN -> HAProxy -> Riak CS nodes -> HAProxy -> Riak nodes.
>> > ans
>> > Stanchion -> HAProxy -> Riak
>> >
>> > Today due a spike of traffic load (about 1000 rps) on the cluster 50% of
>> > Riak CS returned HTTP 500 and 503 (querying /riak-cs/ping resource also
>> > was
>> > not successful).
>> >
>> > In Riak CS logs I've seen the following messages:
>> >
>> > 2015-11-13 13:13:09.514 [error]
>> > <0.11264.1387>@riak_cs_wm_common:maybe_create_user:222 Retrieval of user
>> > record for s3 failed. Reason: disconnected
>> >
>> > In Riak CS logs I see the following:
>> > 2015-11-13 17:31:52.995 [error] <0.11254.6534> Lager event handler
>> > error_logger_lager_h exited with reason
>> >
>> > {'EXIT',{{badmatch,["/buckets/uaprom-image/objects/272547384_cid1322007_pid183135512-26a7c1f3.jpg",{error,{error,{badmatch,{error,closed}},[{webmachine_request,recv_unchunked_body,3,[{file,"src/webmachine_request.erl"},{line,471}]},{webmachine_request,call,2,[{file,"src/webmachine_request.erl"},{line,193}]},{wrq,stream_req_body,2,[{file,"src/wrq.erl"},{line,121}]},{riak_cs_wm_object,handle_normal_put,2,[{file,"src/riak_cs_wm_object.erl"},{line,341}]},{riak_cs_wm_common,accept_body,2,[{file,...},...]},...]}},...]},...}}
>> >
>> > I suspect that there were problem between Riak CS - Stanhion or Stanhion
>> > -
>> > Riak. I have no clear idea in Stanchion troubleshooting. The main reason
>> > is
>> > the following. Stanhion works fine, service is up (answers on ping
>> > command).
>> > But it is very laconic: there is almost nothing in console and error
>> > logs
>> > (even with debug log level).
>> >
>> > Riak CS adds, removes, gets properties through Stanchion service. Am I
>> > right? I can't exactly understand where is my bottleneck - Riak, Riak CS
>> > or
>> > Stanhion.
>> >
>> > When we need authenticated access for reading object from bucket do we
>> > need
>> > Stanchion? If not I can't understand why I had a lot of error during
>> > getting
>> > objects from Riak CS.
>> >
>> > Thank you in advance.
>> >
>> > P. S. Sometimes when there is some issues with Riak CS - Stanchion
>> > connectivity I need to restart Riak CS.
>> >
>> >
>> > _______________________________________________
>> > riak-users mailing list
>> > riak-users at lists.basho.com
>> > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>> >
>>
>>
>>
>> --
>> Kazuhiro Suzuki | Basho Japan KK



-- 
Kazuhiro Suzuki | Basho Japan KK




More information about the riak-users mailing list