Jitsi-Meet (docker) for a large event - The aftermath! (Server sizing tips)

As promised, here is the result of our big event using our own Jitsi-Meet deployment.

We were waiting for 2300 participants during the day, with around 250 connected simultaneously at all time. EVERYONE had to have their cameras running and there were around 15 rooms opened at all times with one person sharing his screen (presentation) and webcam at the same time.
There were “breakout rooms” for small workshops, around 4 per main room. This meant that most rooms generated 4 other rooms for about 20 minutes per hour and most users forgot to close their camera in the main room so they had 2 streams going out at the same time.

The setup:
All servers were Amazon T3a.xlarge

  • AMD EPYC 7000 series clocked at 2,5Ghz
  • 4 vCores (the “V” is important here, see below)
  • 16gb RAM
  • 5 Gbps bandwidth burst

Main server:
Ubuntu 20.04
Jitsi-meet-docker with custom setups:
720p forced (simulcast disabled)
All last-n functions disabled (everyone should receive all streams)
P2P disabled

(5x) extra JVBs
Jitsi-videobridge2 on bare Metal Ubuntu 20.04
Communication between bridges and main instance used internal Amazon network (using private IP instead of website address… I was completely unable to do it otherwise)

Reasons for this particular configuration:
We chose jitsi-meet-docker for the main instance for it’s flexibility and ease of deployment
We chose jitsi-videobridge2 on bare-metal for the additionnal JVBs for the simplicity of the setup.

We disabled simulcast and last-n because it simply did not work in the docker version and everyone was getting 180p resolution otherwise! That being said, I am still working on the issue with some nice folks here on the forum: https://community.jitsi.org/t/urgent-low-image-quality-after-update

We forced 720p because some users are using virtual cameras and on-screen composition with green-screen for their presentations (instead of screen sharing, the webcam contains the presentation). We did not go up to 1080p because for that particular event, the conferences were windowed inside an i-Frame on our website.

The result of this setup
Overall, the event went great, except for a particular moment in the morning when one “batch” of people were nearing the end of their conferences and another “batch” joined.
This was our major issue of the day:
At that moment, the quality went bad, the main server was hit by 100% CPU usage and could not keep up. I was able to react fast and the solution was to shut-down the JVB container on the main server to lower load and let the extra JVBs handle video. Previous testing ensured that closing a JVB and automatically falling back on another one would only give a one second glitch on the user end. The quality went back to normal.

Upon further inspection, here are the causes

1 - EC2 Virtual Cores:
While Amazon AWS EC2 servers are great general usage servers, they are not that good at constant workloads. As soon as an EC2 instance reaches constant CPU usage levels over 40-50%, there is a runaway effect that make it slowly creep up to 100% and become unresponsive.
This seems to be due to the fact that these are virtual cores (shared with others) and have no Hyperthreading… This means that they take the instruction as they come and do not pre-fetch anything.
We are actually looking to go back to OVH for a dedicated server as this was giving us a lot more performance for a low monthly fee. Anyways, 4 Vcores are not enough for the XMPP, Jicofo, etc if you add the JVB to the same machine…The internet and internal network Bandwidth was more than sufficient.

1b - Extra JVBs not getting used:
I had to reset extra JVBs several times during the day because the main server seemed to have become unaware that it had them available to work with. The most I could get was load on 3 JVBs and the 2 others would not get anything, ever. Most times, the server would only use 2 and load one of them a lot more than the other… I am probably doing something wrong, but I can’t tell what!

Other minor issues

  • The “speaker stat” window shows “0” all the time, no stat is ever updated. From what I could read, this affects only the docker version of Jitsi-meet… I have yet to sort this out!
  • The youtube video share was unreliable. I could not confirm that problem on our side but many users have pointed it… We are still digging through data, but I suspect a browser / system issue… Some users were using MacOS and Catalina is a bit pile of trouble for webRTC. Some others were using Edge… Poor souls!
  • I kept having errors related to Colibri… this will need further investigation. Something about ports.
  • When in full screen, if you pull-out the chat window, the video frame moves away instead of resizing and does not return to it’s place … Window needs to be reset (resize, change view mode) to revert to normal. This was reported here: https://github.com/jitsi/jitsi-meet/issues/7889 and is reported as fixed on the main project, but does not seem to be fixed in the docker repo
  • Bandwidth estimations only works on one side (upload from what I understand)… That could be linked to the quality issue when using Simulcast… We need to dig deeper into this! (see image below)
    image

Hope this can help someone! I will post my detailed setup and config files later this week, as soon I am finished “depersonalizing” them (removing sensitive server information)

@damencho, I am putting your name here if ever that can interest you… I don’t know who else in the dev team could be interested (I don’t even know who is in charge of the docker version nowadays!)

4 Likes

Great writeup.

Thanks for sharing!

this is pretty bad. If you were trying to sell docker, you have not succeeded with me, I’ll keep my Linux containers :slight_smile:

Joke aside, if you have such situations, you could benefit by having 2 persons available, when you have to solve urgent problems you don’t have time to monitor the event to understand what could be the problems.
In this case you could have monitored prosody to see if the connection was lost with the missing jvbs.
You have the jicofo log but possibly interesting infos could have been available by setting logging to debug level and after the event it’s too late to do that.

Thanks for sharing your experience, always interesting.

We did pull-out the JVB and Prosody Logs several times, even resetting them from time to time.
The extra JVBs would connect, but I could not see them disconnect anywhere… They were just not getting used. Then, I would restart the service on the JVBs that had no session, and it would get some traffic. Then, if the traffic fell down because we were between 2 “rush hours”, and after the load would ramp back up, I always got only 2 JVBs loaded…
When I saw the load coming back, I would restart the JVBs on the machines that were not getting load and I could get them to take a bit of load… But not always.

I will need to investigate a bit further. Maybe start reading about Octo or other ways to evenly distribute the load…

What do you call a Linux container?
Initially, the docker version of Jitsi-meet is a lot more complicated to setup, in my opinion, as there is a lot of stuff that is not in the original example files that I had to diagnose… But once I got it working, I really enjoyed it! It is a lot simpler to deploy, modify, redeploy, etc. I have built a bash script to push my configuration back in 2 seconds and it worked wonderfully.

It’s jicofo that is taking the decisions here, Prosody is just an intermediary and the jvbs are just receiving the users who are sent by Jicofo.

see here