Riak CS/Stanchion troubleshooting (Retrieval of user record)

Vladyslav Zakhozhai v.zakhozhai at smartweb.com.ua
Thu Nov 19 06:04:10 EST 2015


Hi,

Sorry for my long silence. Kazhuiro thank you for your answer. The
situation is more clear. I think that I need to extend my Riak cluster with
more nodes to increase performance. The reason for my opinion is:

Nov 19 11:42:40 localhost haproxy[24678]: 172.18.103.31:49608
[19/Nov/2015:11:42:40.137] riak riak_backend/viper 3/5/113 1471 --
8191/2594/2594/138/0 0/0
Nov 19 11:42:41 localhost haproxy[24678]: 172.18.108.170:44517
[19/Nov/2015:11:41:42.264] riak riak_backend/serpent 1/0/58806 5982 cD
8191/2849/2849/155/0 0/0
Nov 19 11:59:46 localhost haproxy[24678]: 172.18.102.39:42919
[19/Nov/2015:11:42:14.566] riak riak_backend/mussurana 1/0/1052250 1484789
-- 3134/2888/2888/154/0 0/0
Nov 19 12:07:26 localhost haproxy[24678]: 172.18.40.2:44946
[19/Nov/2015:11:42:14.508] riak riak_backend/rattler 1/0/1511814 2471638 cD
3172/2888/2888/161/0 0/0
Nov 19 12:17:56 localhost haproxy[24678]: 172.18.103.30:58654
[19/Nov/2015:11:42:40.141] riak riak_backend/mamba 3/1/2116572 3383878 cD
2988/2886/2886/166/0 0/0
Nov 19 12:23:55 localhost haproxy[24678]: 172.18.40.4:59089
[19/Nov/2015:11:41:39.831] riak riak_backend/eggeater 1/0/2535854 4109579
CD 3020/2888/2888/153/0 0/0
Nov 19 12:38:54 localhost haproxy[24678]: 172.18.40.4:37536
[19/Nov/2015:11:41:47.533] riak riak_backend/cobra 1/0/3427457 3387298 --
2983/2886/2886/159/0 0/0
Nov 19 12:50:37 localhost haproxy[24678]: 172.18.102.39:51870
[19/Nov/2015:11:41:49.413] riak riak_backend/lora 1/0/4128262 6445878 --
2989/2889/2889/164/0 0/0

I think that it is not haproxy's timeouts issue. Am I right?

Regarding to HAProxy config I have the following config for Riak pb:
frontend riak
    bind    172.18.108.170:8087

    mode    tcp
    option  tcplog
    option  contstats

    timeout client 30s

    default_backend     riak_backend

backend riak_backend
    mode    tcp
    balance roundrobin
    option  tcpka
    option  srvtcpka
    option  httpchk GET /ping

    timeout server 60s

    server rinkhals rinkhals.pleiad.uaprom:8087 weight 1 maxconn 1024 check
port 8090
    server chuckwalla chuckwalla.pleiad.uaprom:8087 weight 1 maxconn 1024
check port 8090
 and so on...

And config for Riak CS:
frontend riakcs
    bind    193.34.169.1:80
    mode    http
    option  contstats
    option  httplog
    option  http-server-close
    timeout client      30s

    default_backend riakcs_backend

backend riakcs_backend
    mode http
    balance roundrobin

    option httpchk GET /riak-cs/ping
    option redispatch
    retries 3

    timeout server 60s
    timeout connect 60s
    timeout http-request 60s

    server rinkhals rinkhals.pleiad.uaprom:8080 weight 1 maxconn 1024 check
port 8000
    server chuckwalla chuckwalla.pleiad.uaprom:8080 weight 1 maxconn 1024
check port 8000
    and so on...

And default section:
defaults
        log     global
        option  dontlognull
        retries 3
        option redispatch
        maxconn 8192
        timeout connect 5000
        timeout client 4h
        timeout server 4h
        balance leastconn

I want to mention again taht I have about 1000 rps to Riak CS, average
object size is 10 Kb.

Thank you.


On Mon, Nov 16, 2015 at 4:01 AM Kazuhiro Suzuki <kaz at basho.com> wrote:

> Hi,
>
> ha_proxy's timeout settings often causes disconnected errors on a Riak
> CS deployment by high work load. termination_stat [1] in tcplog [2]
> lets you know if timeout happens or not.
>
> > 2015-11-13 13:13:09.514 [error]
> <0.11264.1387>@riak_cs_wm_common:maybe_create_user:222 Retrieval of user
> record for s3 failed. Reason: disconnected
>
> This means Riak CS failed to read a user data from Riak for
> authentication due to a disconnected error.
>
> > Riak CS adds, removes, gets properties through Stanchion service. Am I
> right? I can't exactly understand where is my bottleneck - Riak, Riak CS or
> Stanchion.
>
> Mainly Stanchion is only used to update/delete data of users and
> buckets. To inspect a node, Riak S2/CS 2.1 introduced new metrics
> including various latencies and counters, which help to identify
> bottleneck.
>
> > When we need authenticated access for reading object from bucket do we
> need Stanchion? If not I can't understand why I had a lot of error during
> getting objects from Riak CS.
>
> Authenticated access is always necessary but a read request of user
> data for auth is issued from Riak CS to Riak directly, not through
> Stanchion.
>
> > P. S. Sometimes when there is some issues with Riak CS - Stanchion
> connectivity I need to restart Riak CS.
>
> Riak CS 1.5.0 has connection pool leak problem [3]. You might hit the
> issue...
>
> [1]: https://cbonte.github.io/haproxy-dconv/configuration-1.5.html#8.5
> [2]: https://cbonte.github.io/haproxy-dconv/configuration-1.5.html#8.2.2
> [3]:
> http://docs.basho.com/riakcs/latest/cookbooks/Riak-CS-Release-Notes/#Riak-CS-1-5-2
>
> On Sat, Nov 14, 2015 at 2:04 AM, Vladyslav Zakhozhai
> <v.zakhozhai at smartweb.com.ua> wrote:
> >
> > Hello.
> >
> > I have Riak CS cluster with 18 nodes. On each node there is Riak CS and
> Riak
> > service and one Stanchion node.
> >
> > Versions:
> > Riak 1.4.12
> > Riak CS 1.5.0
> > Stanchion 1.5.0
> >
> > Riak CS and Riak allocated behind HAProxy balancers:
> >
> > WAN -> HAProxy -> Riak CS nodes -> HAProxy -> Riak nodes.
> > ans
> > Stanchion -> HAProxy -> Riak
> >
> > Today due a spike of traffic load (about 1000 rps) on the cluster 50% of
> > Riak CS returned HTTP 500 and 503 (querying /riak-cs/ping resource also
> was
> > not successful).
> >
> > In Riak CS logs I've seen the following messages:
> >
> > 2015-11-13 13:13:09.514 [error]
> > <0.11264.1387>@riak_cs_wm_common:maybe_create_user:222 Retrieval of user
> > record for s3 failed. Reason: disconnected
> >
> > In Riak CS logs I see the following:
> > 2015-11-13 17:31:52.995 [error] <0.11254.6534> Lager event handler
> > error_logger_lager_h exited with reason
> >
> {'EXIT',{{badmatch,["/buckets/uaprom-image/objects/272547384_cid1322007_pid183135512-26a7c1f3.jpg",{error,{error,{badmatch,{error,closed}},[{webmachine_request,recv_unchunked_body,3,[{file,"src/webmachine_request.erl"},{line,471}]},{webmachine_request,call,2,[{file,"src/webmachine_request.erl"},{line,193}]},{wrq,stream_req_body,2,[{file,"src/wrq.erl"},{line,121}]},{riak_cs_wm_object,handle_normal_put,2,[{file,"src/riak_cs_wm_object.erl"},{line,341}]},{riak_cs_wm_common,accept_body,2,[{file,...},...]},...]}},...]},...}}
> >
> > I suspect that there were problem between Riak CS - Stanhion or Stanhion
> -
> > Riak. I have no clear idea in Stanchion troubleshooting. The main reason
> is
> > the following. Stanhion works fine, service is up (answers on ping
> command).
> > But it is very laconic: there is almost nothing in console and error logs
> > (even with debug log level).
> >
> > Riak CS adds, removes, gets properties through Stanchion service. Am I
> > right? I can't exactly understand where is my bottleneck - Riak, Riak CS
> or
> > Stanhion.
> >
> > When we need authenticated access for reading object from bucket do we
> need
> > Stanchion? If not I can't understand why I had a lot of error during
> getting
> > objects from Riak CS.
> >
> > Thank you in advance.
> >
> > P. S. Sometimes when there is some issues with Riak CS - Stanchion
> > connectivity I need to restart Riak CS.
> >
> >
> > _______________________________________________
> > riak-users mailing list
> > riak-users at lists.basho.com
> > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> >
>
>
>
> --
> Kazuhiro Suzuki | Basho Japan KK
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20151119/cee3f5ec/attachment-0002.html>


More information about the riak-users mailing list