Xmpp-websocket failure when many new connections

nothing suspicious in syslog ?

I’m afraid not.

maybe the limit is that jicofo don’t want to stress bridge ?

// The assumed time that an endpoint takes to start contributing fully to the load on a bridge. To avoid allocating
// a burst of endpoints to the same bridge, the bridge stress is adjusted by adding the number of new endpoints
// in the last [participant-rampup-time] multiplied by [average-participant-stress].
participant-rampup-interval = 20 seconds

Looking back AFAICT there is nothing in your posts about the number of JVB you use in your testing.

I’m trying to get a baseline performance for the instance type we’re using, so reduced down to 1 JVB (standalone, not on same host as prosody).

I haven’t really focused on JVB or Jicofo since failures are always when establishing connection to xmpp-websocket (with matching error in nginx error logs where it cannot relay request to prosody).

Also, there are no other users on the deployment and JVB is a long way from being stressed (I’m monitoring packet rates, cpu, mem, …).

Perhaps I shouldn’t rule out JVB too soon. But as far as I can tell, it’s ticking along happily.

Do you have some modules doing network requests on client connections or something in prosody … maybe try disabling some modules you may find a faulty one …

Good point. We have already disabled all the custom modules that make network calls, but might also be worth disabling all but essential modules to see if it makes a difference.

If that doesn’t work, I might just throw in the towel for today and start with a fresh instance tomorrow :smiley:

Still seeing xmpp-websocket connection failures even with all but essential modules disabled :frowning:

Maybe try to play with buffer parameters for Nginx websockets like in this post ?

1 Like

Good shout @gpatel-fr . This does appear to be limits on the nginx side, rather than prosody.

After a bunch of tweaks on the nginx side, I’ve now managed to launch ~110 concurrent new participants without errors.

The most telling evidence that this is an nginx connection limit rather than prosody is that the threshold falls drastically if I don’t use the load-test version of the UI; in my current setup, static resources are still served of nginx so the full UI increases connections to nginx but not to prosody.

Will update with config changes I’ve made once I’ve done some more test and worked out which tweaks I’ve added actually made the difference.

1 Like

I don’t think I’ve got to the bottom of this yet, but I need to park this for now so here’s a summary of where I think I’m at.

The issue: Connections were being rejected when there are lots of simultaneous new connections e.g. lots of new participants joining at roughly the same time. This resulted in some participants failing to join.

Observations I found helpful:

  • Running malleus using full web UI instead of --use-load-test significantly increased error rate which indicates that bottleneck is more likely nginx (which also hosts static resources for the app) than prosody.
  • Adding more CPUs to the host raises the threshold where error starts to occur, even though CPU usage for nginx/prosody/jicofo process were not very high when error occurred.
  • There are no failures if slight delays are added between participants joining. This means the issue here is the number of concurrent new connections it can handle rather than total active connections.
    • This helps if we’re load testing with 1 conference many participants, but does not help with starting many conferences with few participants because the --join-delay option in malleus injects delays between participants joining per conference but not between 1st participant of each conferences. In other words, if we start with --conferences=200 --participants=3 the first 200 participants will join at about the same time.

The remedy: At the end of the day, it appears it is down to tuning nginx to make the most of the host, and provision enough resources to handle the anticipated load. Obvious in hindsight :smiley:

The following was what I did to increase the threshold where errors start to occur to the point high enough for me to complete my load testing.

A lot of the config tweaks below were based on this post. And I’m yet to pinpoint which made the most difference, but they all appeared to have positive impact.

  1. Increasing the number of connections each nginx worker will accept.
    a. In /etc/nginx/nginx.conf

    worker_processes 1;
    # The default was "auto" which meant use all procs on the host.
    # Setting this explicitly means we can leave other proc free for prosody and jicofo
    
    worker_rlimit_nofile 30000;
    # max number of open files descriptors for worker process. 
    # I set this to worker_connections*2 based on assumption that a connection might require two
    # file descriptors in the case of proxy calls, one for upstream and one for downstream
    
    events {
        worker_connections 15000;
        # Max simultaneous connections per worker  
    } 
    

    b. Increase fd limits for nginx. In /etc/security/limits.conf, add:

    # this should match worker_rlimit_nofile set above
    # www-data will be the user nginx is running as.
    www-data        soft    nofile          30000
    www-data        hard    nofile          50000
    
  2. Increase proxy buffer sizes for /xmpp-websocket. In /etc/nginx/sites-available/jitsi.conf:

    # xmpp websockets
    location = /xmpp-websocket {
        proxy_pass http://127.0.0.1:5280/xmpp-websocket?prefix=$prefix&$args;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
        proxy_set_header Host $http_host;
        tcp_nodelay on;
        # Added the following
        proxy_buffer_size 512k; 
        proxy_buffers 16 512k; 
        proxy_busy_buffers_size  512k;
    }
    
  3. Increase c2s limits in prosody:

    limits = {
      c2s = {
        rate = "1024kb/s";  -- increased from 10kb/s
      };
      s2sin = {
        rate = "30kb/s";
      };
    }
    
  4. Add more JMS instances in front of the same group of JVB.

    • the intent of my exercise was to load test a single JVB with many small conferences. By running multiple shards (fronted by haproxy) all connected to by the same JVB, I could multiple the number of conferences I could start simultaneously before failures start to occur.
  5. Don’t waste nginx connections on static resources (css, js, images, sound).

    • For load testing with malleus, using the load test UI helps a lot
    • In a large prod deployment, one would move these files to a CDN and reference them by changing /usr/share/jitsi-meet/base.html

So, nothing ground breaking. But hopefully helpful as a starting point if anyone is experiencing similar issues.

:sweat_smile:

2 Likes

Doesn’t “unlimited_jids = {...}” (which is default in current stable) have the same effect?

I have the following for this

/etc/systemd/system/nginx.service.d/override.conf

[Service]
LimitNOFILE=32768
1 Like

Instead of limiting worker_process (which is auto by default) maybe installing prosody to a seperate server will be better

1 Like

Not sure if unlimited_jid help with websocket requests from participants.

I don’t think so either; unlimited is for known Jids (jicofo, jvb, jibri) while participants’s JID are random.

https://prosody.im/doc/modules/mod_limits

unlimited_jids	{}	Set of JIDs exempt from limits (added in 0.12)

IIUC unlimited_jids doesn’t work on 0.11.x too

aha. That’s definitely something worth keeping in mind. Thanks.

that’s surprising a bit that it is in Jitsi docs then. Let’s see…

1b534392d (Kim Alvefur  2019-04-02 20:38:51 +0200 126) function module.add_host(module)
1b534392d (Kim Alvefur  2019-04-02 20:38:51 +0200 127)  local unlimited_jids = module:get_option_inherited_set("unlimited_jids", {});
1b534392d (Kim Alvefur  2019-04-02 20:38:51 +0200 128) 
e632119ff (Kim Alvefur  2019-04-02 21:22:20 +0200 129)  if not unlimited_jids:empty() then
1b534392d (Kim Alvefur  2019-04-02 20:38:51 +0200 130)          module:hook("authentication-success", function (event)
1b534392d (Kim Alvefur  2019-04-02 20:38:51 +0200 131)                  local session = event.session;
1b534392d (Kim Alvefur  2019-04-02 20:38:51 +0200 132)                  local jid = session.username .. "@" .. session.host;
1b534392d (Kim Alvefur  2019-04-02 20:38:51 +0200 133)                  if unlimited_jids:contains(jid) then
5d73586b4 (Kim Alvefur  2021-07-29 20:11:48 +0200 134)                          unlimited(session);
1b534392d (Kim Alvefur  2019-04-02 20:38:51 +0200 135)                  end
1b534392d (Kim Alvefur  2019-04-02 20:38:51 +0200 136)          end);
9dd9cb439 (Kim Alvefur  2021-07-29 20:16:11 +0200 137) 
9dd9cb439 (Kim Alvefur  2021-07-29 20:16:11 +0200 138)          module:hook("s2sout-established", function (event)
9dd9cb439 (Kim Alvefur  2021-07-29 20:16:11 +0200 139)                  local session = event.session;
9dd9cb439 (Kim Alvefur  2021-07-29 20:16:11 +0200 140)                  if unlimited_jids:contains(session.to_host) then
9dd9cb439 (Kim Alvefur  2021-07-29 20:16:11 +0200 141)                          unlimited(session);
9dd9cb439 (Kim Alvefur  2021-07-29 20:16:11 +0200 142)                  end
9dd9cb439 (Kim Alvefur  2021-07-29 20:16:11 +0200 143)          end);
9dd9cb439 (Kim Alvefur  2021-07-29 20:16:11 +0200 144) 
9dd9cb439 (Kim Alvefur  2021-07-29 20:16:11 +0200 145)          module:hook("s2sin-established", function (event)
9dd9cb439 (Kim Alvefur  2021-07-29 20:16:11 +0200 146)                  local session = event.session;
9dd9cb439 (Kim Alvefur  2021-07-29 20:16:11 +0200 147)                  if session.from_host and unlimited_jids:contains(session.from_host) then
9dd9cb439 (Kim Alvefur  2021-07-29 20:16:11 +0200 148)                          unlimited(session);
9dd9cb439 (Kim Alvefur  2021-07-29 20:16:11 +0200 149)                  end
9dd9cb439 (Kim Alvefur  2021-07-29 20:16:11 +0200 150)          end);
9dd9cb439 (Kim Alvefur  2021-07-29 20:16:11 +0200 151) 
1b534392d (Kim Alvefur  2019-04-02 20:38:51 +0200 152)  end

0.12 restriction could apply to s2s.

Should be working on 0.11. That’s why we have a copy in jitsi-meet jitsi-meet/mod_limits_exception.lua at master · jitsi/jitsi-meet · GitHub

1 Like

Instead of limiting worker_process ( which is auto by default ) maybe installing prosody to a seperate server will be better

Agreed with this. For a scalable setup there is absolutely no reason to funnel everything through nginx. Let nginx do the static files (or ditch nginx entirely and put them on S3 or similar) and handle the rest separately.

The only reason for the default “everything through nginx” approach is simplicity in small installations.

1 Like