Xmpp-websocket failure when many new connections

While load testing with MalleusJitsificus, we noticed that websocket connections to /xmpp-websocket sometimes fail when a moderate number (>30) of new participants are connecting at about the same time.

Observations when running malleus.sh with --join-delay=100 --conferences=1 --participants=50:

  • a few of the participants will fail with “org.openqa.selenium.WebDriverException: unknown error: net::ERR_CONNECTION_CLOSED”.

  • In nginx error logs, there are indications that prosody is sometimes not accepting connections from nginx. So out of the 50 participant connections, we might see 2 or 3 rejections with:

    ... recv() failed (104: Connection reset by peer) while proxying upgraded connection, client: 10.1.28.158, server: meet.mydomain, request: "GET /xmpp-websocket?room=loadtest0 HTTP/1.1", upstream: "http://127.0.0.1:5280/xmpp-websocket?prefix=&room=loadtest0", host: " meet.mydomain"
    
  • Nothing unusual in prosody logs as far as I can tell

  • If I join the loadtest room manually at about the same time, I sometimes see the dreaded “Connection Error” page but with nothing in console logs. All works as usual if I reload the page after a few seconds.

  • CPU usage of prosody process remains relatively low throughout.

The errors don’t occur if we add a long delay between joins, e.g. --join-delay=3000, so the issue appears to not be down to the number of active connections but concurrent new connects.

Any idea what could be amiss, or how I can investigate further?

1 Like

Your prosody config? Are you using epoll?

Try setting epoll netwrok backend and

network_settings = {
  tcp_backlog = 511;
}

Does that change anything?

I see default is 32
https://prosody.im/doc/ports

Yup. Using epoll. I can see “Prosody is using the libevent epoll backend for connection handling” in logs.

Here’s an exerpt of my prosody.conf:

use_libevent = true
network_backend = "epoll"
use_ipv6 = false

network_settings = {
  tcp_backlog = 511;
}

cross_domain_websocket = true;
consider_websocket_secure = true;
smacks_max_unacked_stanzas = 5;
smacks_hibernation_time = 60;
smacks_max_hibernated_sessions = 1;
smacks_max_old_sessions = 1;

limits = {
  c2s = {
    rate = "10kb/s";
  };
  s2sin = {
    rate = "30kb/s";
  };
}

Comment this one and see whether it changes anything.

Same issue with use_libevent commented out :frowning:

This really sounds like tcp_backlog not being applied … hum

Hum, seems the docs are old as I see the default is 128:

Maybe enable-debug and see what happens with those that are closed …

And you are on the latest prosody right?

I believe so. 0.11.9-1~bionic1

OK. Will be back on this in an hour. I will try to dig deeper in to what prosody is doing.

It certainly does sound like tcp_backlog could be the culprit.

I’ll update when I know more. Appreciate your help. Thanks.

did you try to raise this parameter ?

I did. Tripled it to 30kb/s and still experienced similar failure rate.

how about 300 ?

Good point I was about to ask can you test with rate = "512kb/s";

Thanks @gpatel-fr @damencho. Bumped it to 512kb/s. Was about to report success when I managed to launch 50 participants without errors. Then I reran the test and its started failing again :cry:

Not really sure what to make of this. Also tried restarting prosody between runs, but with same results.

I can consistently launch 40 participants with no errors, so we’re definitely hitting some threshold. Just need to figure out what/where …

tcp_backlog appears to be set on the socket correctly.

# ss -l | grep 5280
tcp    LISTEN   0        511         0.0.0.0:5280                                                0.0.0.0:*

And the 511 Send-Q value does change accordingly when I change tcp_backlog in prosody config.

Haven’t found anything interesting in logs yet. Even with network_settings.debug = true.

Will keep looking.

in this case try to raise it again…1024 !

:smiley:

Happy to give it a go. Might even remove the limits module altogether to rule that out.
One would hope that if 10kb/s limit would allow 30 connections, a 512kb/s limit should surely be enough to handle 50…

Anyway, no harm in trying. Go big or go home. 2048!

Alas, still seeing errors with 2048kb/s limits and with limits module commented out.