Hitting hard limit around 600 participants, then start dropping constantly suggestions?

Ah! I had been so focused on watching the Jitsi and Prosody logs I had stopped looking at the nginx ones:

=> /var/log/nginx/error.log <==
2021/05/26 00:49:19 [alert] 757#757: *14448 768 worker_connections are not enough while connecting to upstream, client: 3.141.39.164, server: lmtgt1.dev2dev.net, request: "GET /xmpp-websocket?room=loadtest49 HTTP/1.1", upstream: "http://127.0.0.1:5280/xmpp-websocket?prefix=&room=loadtest49", host: "lmtgt1.dev2dev.net", referrer: "https://lmtgt1.dev2dev.net/loadtest49"

==> /var/log/nginx/access.log <==
3.141.39.164 - - [26/May/2021:00:49:19 +0000] "GET /xmpp-websocket?room=loadtest49 HTTP/1.1" 500 600 "https://lmtgt1.dev2dev.net/loadtest49" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.85 Safari/537.36"

==> /var/log/nginx/error.log <==
2021/05/26 00:49:19 [alert] 757#757: *14449 768 worker_connections are not enough while connecting to upstream, client: 3.133.120.158, server: lmtgt1.dev2dev.net, request: "GET /xmpp-websocket?room=loadtest17 HTTP/1.1", upstream: "http://127.0.0.1:5280/xmpp-websocket?prefix=&room=loadtest17", host: "lmtgt1.dev2dev.net"

==> /var/log/nginx/access.log <==
3.133.120.158 - - [26/May/2021:00:49:19 +0000] "GET /xmpp-websocket?room=loadtest17 HTTP/1.1" 500 600 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.85 Safari/537.36"
18.223.241.112 - - [26/May/2021:00:49:19 +0000] "GET /loadtest41 HTTP/1.1" 200 20996 "https://lmtgt1.dev2dev.net/loadtest41" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.85 Safari/537.36"

Edited /etc/nginx/nginx.conf from 768 workers to 2000…
events {
worker_connections 2000; #increased by Hawke for larger capacity scaling
# multi_accept on;
}

restarted nginx, restarted load test…

That immediately made it so that 10 attendees in the rooms (Shoud be 12 in loadtest0 and 10 in the rest).

Saw the following errors now in nginx:

2021/05/26 00:54:36 [alert] 5460#5460: *10586 socket() failed (24: Too many open files) while connecting to upstream, client: 13.58.225.151, server: lmtgt1.dev2dev.net, request: "GET /colibri-ws/default-id/88005dd23b712f1b/c52b7bdc?pwd=ug4v1o6dc548rccjeu55t4efn HTTP/1.1", upstream: "http://127.0.0.1:9090/colibri-ws/default-id/88005dd23b712f1b/c52b7bdc?pwd=ug4v1o6dc548rccjeu55t4efn", host: "lmtgt1.dev2dev.net"
2021/05/26 00:54:36 [alert] 5460#5460: *10587 socket() failed (24: Too many open files) while connecting to upstream, client: 18.222.20.104, server: lmtgt1.dev2dev.net, request: "GET /colibri-ws/default-id/2c374845c031e5f2/86976c5e?pwd=4gdmuef8v6pqbfh71lnlqsam0p HTTP/1.1", upstream: "http://127.0.0.1:9090/colibri-ws/default-id/2c374845c031e5f2/86976c5e?pwd=4gdmuef8v6pqbfh71lnlqsam0p", host: "lmtgt1.dev2dev.net"
2021/05/26 00:54:36 [alert] 5460#5460: *10588 socket() failed (24: Too many open files) while connecting to upstream, client: 18.188.6.169, server: lmtgt1.dev2dev.net, request: "GET /colibri-ws/default-id/9849ca00fb60edb7/8617953d?pwd=50i3c8esh847ttrkndqh471mk0 HTTP/1.1", upstream: "http://127.0.0.1:9090/colibri-ws/default-id/9849ca00fb60edb7/8617953d?pwd=50i3c8esh847ttrkndqh471mk0", host: "lmtgt1.dev2dev.net"
2021/05/26 00:54:36 [crit] 5462#5462: accept4() failed (24: Too many open files)

Edited /etc/security/limits.conf
to add the following:
nginx soft nofile 30000
nginx hard nofile 50000

edited (again) /etc/nginx/nginx.conf to increase workers, and add rlimit from above number.

events {
worker_connections 2000;
# multi_accept on;
worker_rlimit_nofile 300000
}

rebooted

now seeing this in nginx logs:

HTTP/1.1", upstream: "http://127.0.0.1:5280/xmpp-websocket?prefix=&room=loadtest81", host: "lmtgt1.dev2dev.net"
2021/05/26 00:59:08 [error] 5463#5463: *9913 recv() failed (104: Connection reset by peer) while proxying upgraded connection, client: 3.133.129.17, server: lmtgt1.dev2dev.net, request: "GET /xmpp-websocket?room=loadtest0 HTTP/1.1", upstream: "http://127.0.0.1:5280/xmpp-websocket?prefix=&room=loadtest0", host: "lmtgt1.dev2dev.net"
2021/05/26 01:04:47 [error] 5459#5459: *34 recv() failed (104: Connection reset by peer) while proxying upgraded connection, client: 96.79.202.21, server: lmtgt1.dev2dev.net, request: "GET /colibri-ws/default-id/d74ad537a372a336/0c18074b?pwd=6nrfkf9ohkbfj6mbkvhr7k2slo HTTP/1.1", upstream: "http://127.0.0.1:9090/colibri-ws/default-id/d74ad537a372a336/0c18074b?pwd=6nrfkf9ohkbfj6mbkvhr7k2slo", host: "lmtgt1.dev2dev.net"
2021/05/26 01:04:47 [error] 5459#5459: *30 recv() failed (104: Connection reset by peer) while proxying upgraded connection, client: 96.79.202.21, server: lmtgt1.dev2dev.net, request: "GET /colibri-ws/default-id/d74ad537a372a336/8831aa6c?pwd=6qfum0gc1uu9ubkl35fk1c3b2h HTTP/1.1", upstream: "http://127.0.0.1:9090/colibri-ws/default-id/d74ad537a372a336/8831aa6c?pwd=6qfum0gc1uu9ubkl35fk1c3b2h", host: "lmtgt1.dev2dev.net"
2021/05/26 01:05:18 [emerg] 591#591: unexpected "}" in /etc/nginx/nginx.conf:10

and jitsi is running but I can’t connect to it via web…

ah, some things in the wrong place and missing semicolon, cleaned up looks like this now in nginx.conf:

events {
worker_connections 2000;
# multi_accept on;
}

http {

    ##
    # Basic Settings
    ##

    sendfile on;
    tcp_nopush on;

Okay, now nginx working again, and no errors yet (Before next load test).
Jitsi log no errors with the 2 laptop users connected.

start load test of 950 (+2) users… on this m5a.4xl (32cpu 64gb ram) single instance running all core jitsi services, no add-ons…
running load, 10 participants per room, see that 1 is sending video clearly and smoothly, and loadtest0 has 12 people because of the two laptops both sending audio and video successfully in that room… so far no users dropping
jitsi and nginx log files calm still…
video remains clear, smooth, steady…
and 5 minute load test ends without a glitch!

YES! SUCCESS AT LAST! (at least for this hurdle, onward to the next :stuck_out_tongue: ).

Thank you so very much for your help. Greatly appreciated!

I hope my overhshare step by step here helps out any others that run into anything similar in the future.

Now I just need AWS to raise that limit from 1k to 5k spot instances, and then can try the 5k load test.

Meanwhile, now need to increase the pressure with the 1k users, add more simultaneous video senders, add audio, add CC, add recording, etc. Can do a lot at this level for now. Onward and forward.
Thank you very much again @damencho arking this as solved shortly!

4 Likes

@rpgresearch This is an excellent breakdown. We are very interested in knowing how things go when you are further along.

1 Like

@rpgresearch
Congratulation and thanks for your experience sharing.

What you mean by load test ?

real 950 users

or

special tool or script that simulate concurrent users?

And seems to missing this line :slight_smile:

worker_rlimit_nofile 300000;

1 Like

@iDLE Set of scripts combined with Malleus Jitsificus, plus 1-2 real users to observe (later test iterations will be adding other volunteers to fill out qualitative data).

Yes. Fortunately I discovered that fiddling around, but thank you for following up.
I was able to break through the limit up to the 1,000 spot Instance roof set by AWS, and Jitsi purred along nicely with that single system (Though was getting up there in cpu resources). Definitely seeing the diminishing returns in the vertical scaling, and the sweet spot for the higher-end price-wise appears to be around c5.4xl (after that doubles in price per hour). This is all very useful information for planning and budgeting across different departments and use-cases.
I have had a request for a spot instance limit raise in for a few weeks now to get that AWS limit raised, still waiting.
Meanwhile in chatting with others, thanks to a co-workers suggestion to give another shot, I took a chance and started trying to ramp up the number of nodes per instance again (attempts last year were too unreliable), by boosting the cpu and mem settings. Still unstable/unreliable at 4 nodes (selenium 3) per instance, but so far stable at 3 nodes per instance at 500 simulated users appears stable. Ramping up as far as that will go.
Later this week, after these all-in-one server baselines are as far as I can take them, I am starting work on learning/implementing/testing a single scaling/“cluster” approach (later will be trying out and learning the Octo and Kubernetes options and load testing those). Odds are you’ll be seeing a lot more of me asking questions in the group soon. :slight_smile: Thanks again for the help and the friendly community, greatly appreciated!

1 Like

Still haven’t gotten AWS to raise the 1,000 spot instance limit (they keep checking in with me ever few days to say they are looking into it). Meanwhile, I’m trying to ramp up the number of nodes per instance. I raised the cpu=512 mem=1024 per instance to cpu=1024 mem=4096. And the selenium_nodes up to 4. Unfortunately at 4 it becomes unreliable for just a 200 users baseline test. 3 was a little unstable, but I raised the hub cpu=2048 mem=4096 and doubled it, and so far 3 nodes per instance is stable at 400 (I’m trying to get to the target goal of 5,000 simulated users if possible, or at least as close as I can get this AWS environment to run reliably). I have a mandate for a system setup that this summer must support 5k reliably, including with closed captions, recording, etc. That is relatively easily do-able. But then by this Fall it needs to support 20k users with similar add-on loads. So I have a lot of ramping up to do. I am happy to share as much info as I’m allowed, and will keep folks updates. Especially since as I scale up and run into the next bottleneck, I’m sure I’ll be checking in. :slight_smile: Happy Jitsi-ing!

Hi @rpgresearch Thank you for this extensive and excellent description and the help from the Jitsi team.
Can I ask: did you also test the maximum amount of users per room. For example 100-200 users with Video/Audio off.

Yes, I have been testing to try to increase the estimate calculations for a capacity and costs calculator matrix with so far 250 different combinations of rooms, participants, with and without recording, with and without closed captions, etc. all-in-one instance, versus simple cluster (core + multiple JVBs), and more complex cluster setups. In coming weeks will be adding OCTO, HA Proxy, containerization, Kubernetes, etc. as additional configuration considerations.
I am still each week filling in the load test results related to slots for each of these configurations. Here is a screenshot of just a small portion of this matrix:

As I continue to gather data, I modify the formulae accordingly to try to improve accuracy. Right now there is a lot of theoretical ballparking for the higher numbers, but the smaller setups around ~1,000 participants numbers are starting to solidify. I have the load-texting currently working for the all-in-one server setup up to 2,500 users so far. Meanwhile building up infrastructure for simple and complex clusters to begin load testing those components individually, bit by bit. Anything specific you were needing @Johan66 ? Happy Jitsi-ing!

I am very interested in your results. This is amazing work!!

Finally AWS raised my Fargate Spot Instance limit from 1,000. They refused to raise it to 5,000, they only raised it to 2,500. They want me to instead try to use multiple AZ’s, but that is also more expensive, so I am not sure whether I will be authorized to do so, will see.

However I managed to squeeze stable 2 users per instance with very maxxed “hardware” settings (which is getting expensive, around $140/day when running load tests all day at this maxxed level). 3 users per instance is still too unstable for reliable results even at maximum hardware. In theory I should be able to start ratcheting up from the current number. I found previously that 2,500 Jitsi all-in-one is really struggling, 2000 is more usable with the current settings. Cranking the Jitsi hardware up to 24x made no difference. Now that I can crank up the simulated users higher, I will be posting logs to see there are any further tweaks can be made to the all-in-one setup to see if it can be squeezed any higher.

Meanwhile I have been setting up various cluster configurations and will begin load testing those this upcoming sprint (2 week sprint begins this coming Wednesday).

I will be sure to keep sharing the numbers as I can.
Cheers!

5 Likes

What is the purpose of this, if you don’t mind me asking?

At 2,500 participants (or even 100+), I have to admit, I’d give up on video-conferencing and just focus on streaming, with maybe a text-chat solution.

Can you imagine needing to ensure all participants are muted, behaving, etc.

I am fascinated by this account of using Jitsi at scale, but also rather confused by it.

@Lewiscowles1986 this is for a K-12 school system. That 2,500 (goals is 5,000) is total number of participants for the server, not for a single room.
Participants per room range being test ranges between 10 to 200 with a variety of sender ratios (1 per room, 5 per room, 20 per room, no more than 75 video per room, but only one audio sender at a time). But do have to also load test later handling video/audio recording of each room, and closed captions for each room (later). The average class size they have (we see from their logs) is 45-65 students per room.
The teachers mandate that ALL STUDENTS MUST HAVE THEIR CAMERAS ON AND SENDING at all times so the teachers can see if the students appear to be paying attention.
Only 1 audio speaker at a time.
In order to speak, student should raise their hand, and then teacher can unmute the student to allow them to speak.
Does that help clarify?
So, this means a LOT of video senders, but not a lot of audio senders. The bottleneck is so many video senders.
The larger room sizes (200, 400, 1,000, 3,000) are for web events / conferences, where they will only have very few concurrent video senders (maybe a panel of 6-12 people at most). Everyone else needs to be in the room to be able raise their hand, change their emojis, chat in the text, and if raising hand to speak, the moderator unmutes the participants to speak. So for the very large 100+ participants that is more webinar, and could use streaming options (though would have to self-hosted, they can’t use public services like Youtube it would violate privacy policies and laws for minors on video), and they do not currently have an in-house streaming service.
I am evaluating options like Matrix-Synapse to address a lot more chat features, but they are not excited by “yet another platform” to have to setup and support, so we’re supposed to try to do as much as possible with what it there, though Matrix hasn’t been ruled out yet (hopefully will know by August/September). Matrix would solve a huge number of chat-related and file-related, bot, automation, and other features. I’ve been using Matrix in my own communities for about a year now with Jitsi and very much like it personally).
Ultimately have to be able to support a MINIMUM of 20,000 video senders during peak school hours, likely grow to 40-65k soon. Those are more appropriate for clustered setups, but for some of the tasks a sufficiently tuned all-in-one server might be sufficient for and far less expensive, so need to know the full range of options and configurations to meet their many different use cases they need addressed. Currently all being done with Zoom and Google Hangout, so in-house Jitsi setup (can’t be hosted elsewhere, only their AWS or their on-prem) needs to be price competitive, better quality, and more reliable. They would like to be completely moved to Jitsi by summer 2022. So I have a LOT of analysis, planning, R&D, testing, release phases, to work through.
That help clear things up? Cheers!

3 Likes

I’m very interested in the results for k8s clusters. I have a project of my own that will require average of 50 - 100 users per conference, with room to scale in the future. But in a more webinar like state (so only 1 user uses video, voice, screenshare and with a Jibri instance). And I also have the same problem as @rpgresearch regarding the laws for minors on video. So Youtube etc are also out of the question.

And I must say that your doing the community a great favor with publishing your analysis and all. This can help a lot of people. Keep up the great work :+1:

1 Like

For large total numbers of participants you’re always going to need some form of horizontal scaling.

Due to synchronisation and overheads around context-switching and caches, the gains you get in achievable packet rate from adding CPUs to a single JVB instance get smaller as the number of CPUs gets higher (i.e. adding CPUs to a single instance exhibits diminishing returns, an effect common to a lot of multithreaded software). You can help with this by choosing a concurrent GC for the JVM, but there’s synchronisation in JVB itself so you’ll always have this effect to some extent.

As a result, a single JVB instance won’t be able to handle more than a certain packet rate no matter how much extra hardware you throw at it.

If you are dead set on using larger servers rather than adding more servers, you can scale horizontally within a single server, by running multiple JVB instances per server. If you’re doing this, make sure you have enough RAM, consider an IP address per JVB to make ICE configuration simpler, and pin specific CPUs for each JVB to reduce context-switching overhead and cache contention. If you’re already familiar with them, a container system like containerd, systemd-nspawn or docker helps with simplifying the needed config and providing some structure to make ongoing maintenance easier. If you end up needing more than one server anyway (and at 20,000 video senders, you will) you may consider an orchestration system like k8s for those containers. By using Fargate, you’re basically doing this but outsourcing the management of the containers and the underlying servers to AWS, which is perfectly sensible but comes with some extra cost (especially in their egress traffic charges).

With horizontal scaling 20k simultaneous users is not really a challenge; we use ~11k simultaneous users (150 conferences of 75 participants each, 7-10 conferences per JVB node depending on region) to test our automatic scale-out and scale-in and it works flawlessly. 20k would work exactly the same way, just with more servers.

1 Like

Hi @jbg

Do you use OCTO for this?

We do use Octo, but it’s not really necessary for the described load test; with 75 participants per conference it’s not really necessary to split a conference across multiple bridges.

2 Likes

I think perhaps even more important than the CPU load and RAM is the bandwidth availability when using a single large server as opposed to multiple smaller servers. The first limiting factor in SFU is the available bandwidth, which is why hosting multiple JVB instances on the same server (even with separate IPs). makes little sense if they’re all still using the same bandwidth.

2 Likes

It depends. Most hosting providers oversell their bandwidth and market you a meaningless “port speed”, in which case you’re absolutely right, adding a second JVB on the same server is just going to contend the limited bandwidth more.

But, if you get proper connectivity (e.g. we use a bond of 2x 10GbE on our servers and are colocated at Internet exchange points, so plenty of upstream capacity is available), you will generally hit JVB’s limitations on how much CPU capacity it can make use of long before you’ll hit upstream bandwidth limitations. In this situation it absolutely makes sense to run multiple JVBs on the same server in order to make use of the resources available.

2 Likes

@jbg, thank you kindly for your feedback and suggestions. Very much appreciate you enumerating
We are actively testing all of those configurations and comparing cost/performance for each configuration.

An example of some of the Jitsi architecture variants is listed here (document is out of date from internal doc, but it is general idea of the variance):

https://www2.techtalkhawke.com/news/jitsi-architecture-variants

Some projects I’m scoping will only need a moderately scalable single instance, maybe with some extra manual JVB instances or containerized/auto-scaled JVBs.

Some of the other projects will be able to use the AWS utilities, while other phases of the project are on-prem and so ZERO AWS features will be available.

Where the extra scaling is needed for the larger rooms with many senders, will be testing out Octo.
For the main project (20k+ senders), already planned on bringing in HA Proxy Octo and Kubernetes for the on-prem version.

The all-in-one testing is just to see how far that can be taken for those projects that don’t want to deal with a lot of server instances or containerization (smaller projects), and a range of mid-size to larger projects that are okay with just a few core instances plus limited containerization, versus those happy to embrace full k8 implementation.

Last year I ran a few conventions around 20k simultaneous users, but they were very large rooms with very few senders, with only 4 servers. The challenges I have to enumerate in detail are:

  • 20k concurrent senders (heavy JVB load)
  • 10-100% concurrent rooms recording (heavy jibri load/container-count)
  • 10-100% concurrent closed captioning (heavy jigasi and vosk loads/container-count)

Thanks again for the helpful feedback and suggestion, always very much appreciated!
As always I will continue to share my findings in detail as much as I am allowed. Cheers!

1 Like