Server load for many simultaneous meetings

I also feel like platform does not work anymore for more than 3-4 people in a meeting. Disabling the audio and video and 10 other settings to get 15 people to join a meeting is not a viable solution some have proposed.

Setting channelLastN to 3 on a 8 core dedicated server with 16 GB RAM and 10 gig network on AWS for a group of 20 people meeting is not a video conferencing platform.

And, I’m aware this is free and opensource and nobody is beholden is provide any support. But it appears many people are running into issues just making this work.

1 Like

@damencho I feel great dissatisfaction on the part of serious users :wink:

I’m using jitsi for 2 weeks (aka zero expertise, just user) in a university setting…
Without much tinkering it definitely can handle more than 3-4 people.
It works for us for 12-16 people easy. Haven’t tried more…

I arrived at jitsi from Nextcloud Talk. That one can handle 3-4 people at most…

Hi, I can’t agree with you. We have jitsi installed on 3 servers, first is web frontend+jicofo+prosody and two JVB2 servers. All 3 servers ale 8-10 years old xedon boxes with 4-12 cores@2.3 or 2.5GHz. From my experience jistis can handle over 200 people on one JVB2 but must be separated over more rooms with max. 10-13 people. With 15+ people in one room problems begins for us and I’ve not found any solution so far. :frowning:

Kind regards,

Milan

@kallenp @Canton With all due respect, many many people are having no trouble whatsoever getting their Jitsi installations to work. Especially given the fact that you’ve said outright that you’ve not got the skillset or experience to troubleshoot this, I’m a little baffled at why you dont also presume that it’s somehow something you’ve done wrong, and instead are flinging accusations at the open source team.

What you’re describing looks at first glance as though you’ve got a network and/or IO problem on your server somehow. Unfortunately there is no single “just look here” suggestion, as the problem could be in one of any number of hundreds of little places.

I can confirm that we are running Jitsi without really any headache on a 4 core virtual machine with 8GB of RAM. There’s no secret sauce going on. Make sure your hardware is reliable and then start with the quick install, which does everything for you and doesn’t require you to go configuring everything by hand, and report back, maybe? The way the thread was progressing, I cant imagine anyone is going to be particularly inclined to want to help, unless you change your approach!

3 Likes

Debugging and finding the origin of server load issues is greatly simplified if you can instrument your server and produce a flame graph that highlight what part of the code that is consuming all CPU cycles.

The Netflix Technology Blog describe how Netflix have worked to improve the JVM in order to produce Java mixed-mode flame graphs, a complete visualization of CPU usage on server side, by implementing the JDK option: -XX:+PreserveFramePointer

If you can provide a flamegraph of your jitsi videobridge server misbehaving then everyone can look at it and contemplate what is causing the 500% CPU use and where to look to fix it.

3 Likes

Hi @voxter, have you no issues with 15 or more participants sending video in one room? What was your biggest room attendance?

Thank you,

Milan

It create funny graphical thing, but still hard to understand it.
Results from a bridge only handling 1 conference with 3 participants (without any issue):


And the svg version (not allowed to be directly upload here): https://covid19.framadrop.org/lufi/r/P9MyqoPtha#0zvyWgVuahw1REQApzcZURbxt+/M4NjOmSr55SgduUY=

I’m not really sure CPU is the key of our issues. We see issues even when the CPU does not seems to be overloaded.

Edit:
Second test with 20 participants in 6 conf. 15 Video endpoints.
Linux load at ~1.00 (With 8vcpu, so nothing amazingly high)
During the capture I was seeing may logs like this in jvb.log:

TransportCcEngine.tccReceived#157: TCC packet contained received sequence numbers: 22926-23141. Couldn't find packet detail for the seq nums: 22926-23141. Latest seqNum was 21591, size is 1000. Latest RTT is 13.014161 ms.

SVG: https://covid19.framadrop.org/lufi/r/BgNUGo0rk0#xmqe6vq8NdG+aycCkyR5wyZD/mXxic3zhfsTu8+m7nM=

2 Likes

Thank you for generating the flame graphs. The svg files are helpful because i can load it into the browser, point at a block and it will print the full function name, super useful!

1st issue: swapper
Both graphs show that your system spend 20% time swapping,
All the rightmost pure red spikes are caused by linux kernel swapping!
All the leftmost pure red spikes are writes waiting, thus possibly also related to swapping.
This is unexpected and is a root of poor performance! Thus by tuning the system swappiness or simply disable all swap
sudo swapoff -a
you should see reduced CPU usage and improved performance! :bulb:

2nd issue logging!
In the left top green hill we can see that 7% of CPU time is spent at logging.
org/jitsi/util/Logger:::Trace + org/jitsi/util/Logger:::Log
Maybe the logger can be tuned down , or possibly disable logging entirely for the
org/jitsi/videobridge/Conference:::SendOut function. It is this function alone that cause most time spent logging. … This is actually quite odd to see Logger on the stack called from Conference SendOut because SendOut itself do not log… https://github.com/jitsi/jitsi-videobridge/blob/master/src/main/java/org/jitsi/videobridge/Conference.java#L1170
So what we do see is that the Java JIT compiler have optimized things here…
Some serious contemplation needed :face_with_monocle: :thinking:

Try attack those two issues first, and then generate a new flame graph.

If you want to learn how to read flame graphs and how to pinpoint what function cause all the waits then watch this presentation:

3rd issue: changed behaviour
So by comparing the two svg flamegraphs you sent I can clearly see that the server is having issues under high load because the second, and third, peak of the big grean hill, seen in the first graph under normal server load, is not seen at all under heavy load!
It is as if the programs behavior has changed and ThreadPoolExecutor:::runWorker do not have time to do any java/util/concurrent/FutureTask:::run , and this cause issues beyond my imagination :exploding_head: :upside_down_face: :slight_smile:

5 Likes

Thanks @xranby for your time and help.
I’m surprised by the swapper usage. This server has more than enought RAM. To be sure, I’ve disabled swap as you suggest. But you will see in the folowing flamegraphs that swapper is still present…

$ free -m
              total        used        free      shared  buff/cache   available
Mem:           7975         965        4532           0        2477        6710
Swap:             0           0           0

I’ve switch log level to WARNING (instead of INFO), but it does not seems to change anything in the graph. And even if no logs are comming in jvb.log, we still see a quite important part of Logger::log.

I’ve performed 7 different perf profiling.
During all these tests, I was in a 3 participants video conference, handled by the bridge under perf profiling… and I did not noticed any audio or video issue.
I will try to re-do the test under really high load, during the coming week.

name Linux load Download Upload Conf Participants Sending audio Sending video jvb logs
low-load.svg 0.3 (0,04 per core) 1,5Mbps 400pps 1Mbps 430pps 3 7 3 3 No log comming in jvb.log
light-load.svg 0.8 (0,1 per core) 5Mbps 1,5kpps 8Mbps 2,7kpps 3 11 8 9 No log comming in jvb.log
medium-load.svg 2 (0,25 per core) 22Mbps 3,9kpps 30Mbps 7,9kpps 5 19 13 17 Many Suspiciously high rtt value, Some TCC packet contained received sequence
medium-load-2.svg 3 (0,375 per core) 30Mbps 5,1kpps 39Mbps 9,8kpps 6 23 14 21 Many Suspiciously high rtt value, Some TCC packet contained received sequence
high-load.svg 4.5 (0,56 per core) 32Mbps 5,7kpps 43Mbps 9,8kpps 7 27 15 23 Many Suspiciously high rtt value, Some TCC packet contained received sequence
high-load-2.svg 4.5 (0,56 per core) 26Mbps 6,7kpps 22Mbps 12kpps 7 49 35 21 No log comming in jvb.log
high-load-3.svg 4 (0,5 per core) 35Mbps 7,7kpps 34Mbps 10kpps 12 43 32 25 Many Unable to find encoding matching packet! (~70 per second, all linked to same participant)

SVG file available here: https://nuage.hadoly.fr/s/YpyNzmjwktGjpiJ

2 Likes

Thank you, your server profiling and load numbers are super useful!

All 7 graphs you posted today looks much more healthy with reduced jvb logging and disabled swap.

Seeing a high CPU percentage here as swapper calls the do_idle inside the kernel is OK as it puts the CPU into powersaving mode when the CPU has nothing to do.
Skärmbild från 2020-05-04 06-00-46

Your new, high-load-2 & 3, flamegraph show that it is a different part of JVB that cause 6% CPU spent logging!
Lorg/jitsi/videobridge/cc/vp8/VP8AdaptiveTrackProjectionContext:::rewriteRtp
Skärmbild från 2020-05-04 05-47-55

In high-load-3 we can see that there is one additional percentage CPU spent in logging from:
Lorg/jitsi/nlj/transform/node/incoming/VideoParser:::transform

If the logging for these two functions can be tuned or disabled then the server may handle even more load :slight_smile:

Best if to understand why these two functions wants to log-a-lot, it may be an underlying issue that need to be looked into, and if that is fixed then no logging is needed.

I used the Ctrl-f search function to search for logging inside the flame graph. Searching makes all logger lines stand out in purple. The search function will also summarize total CPU time spent logging across all the JVB function branches.

Great, Looking forward to see how it looks under really high load!

Do you know what system this participant use? You should be able to see it by looking for the participants ip adress in the nginx log.

We have found some mac + chrome clients with non working hardware acceleration.
You can try disable the hardware accelerated H264 codec, because H264 still cause issues for some chrome mac users, and see if that changes anything inside the server logs.

No idea, this we need to triage to find the root cause for.

I see this warning in my JVB logs as well when I connect using an old laptop that run out of ram , and the client starts to swap. Tricks to reduce CPU and memory load on client side fix these kind of warnings for me and improve the experience using old client hardware.

By analyzing profile graphs on client side I have found 4 top config tricks to reduce client load.

4 Likes

Hi @xranby @ronquilstr, I really appreciate your work guys! I’ll try to add flame graphs from my installation later too. You have produced very useful information for community!

Thank you!

Kind regards,

Milan

1 Like

Good day @bbaldino @damencho @voxter @jcfischer
Our infrastructure is absolutely fine. What is special ? After installing Jitsi from the Debian Unstable repositories, everything works perfectly. But not from Debian Stable repositories. The configuration has been copied and tested on the same machine and same hardware. The only things that have been exchanged are the Jitsi binaries. This condition has been verified at several locations and various installations.

I apologize for the words addressed to the developers, but I do not agree with their statement.

If I want to drive, I don’t have to be a car mechanic or a designer, or an electrician, etc. :wink:

I consider our problem solved at this time.

Enjoy the rest of your day.

Only because you can drive a car, it does not mean that you are able to or that you should drive a car. Thats why we have driving schools and you learn how to do it, before you will do it ;-).

if the enginge stalls, its the cars fault, or the manufactors. but never your own…

but back to topic, i just needed to share my opinion…

Don’t stress :wink:

@migo @kallenp can you check if disabling the H264 codec makes the users that occasionally drop video on your servers turn stable?

Can you also check by comparing the nginx log what operatingsystem+browser the users that drop video are using?

try disable the hardware accelerate H264 video format
my experience is that all web-browsers do not support H264 and it is a common source of web-browser crashes or missing video if H264 for some reason is malfunctioning, in combination with webrtc, on the client. H264 is disabled on meet.jit.si, at least last time i checked. disable H264 makes chrome on mac happy, when firefox users join the meeting.

disableH264: true,

you may need to disable it in p2p mode as well

p2p: {
    disableH264: true,
}

Hi @xranby, thank you. I’ve disabled H264 support and will see what happens. Now we have high usage over 300 users online in 22 rooms, it is not easy to isolate users with concrete problems. :frowning:

JVB is complaining lot about:
2020-05-04 10:55:57.771 WARNING: [6045] [confId=ad48aefab6698d1d epId=58576179 gid=ff1dfc stats_id=Ciara-I1z conf_name=sebo] TransportCcEngine.tccReceived#157: TCC packet contained received sequence numbers: 54651-54656, 54729-54744, 54808-54809. Couldn’t find packet detail for the seq nums: 54651-54656, 54729-54744, 54808-54809. Latest seqNum was 56784, size is 1000. Latest RTT is 3176.397095 ms.
2020-05-04 10:55:57.783 WARNING: [6045] [confId=ad48aefab6698d1d epId=58576179 gid=ff1dfc stats_id=Ciara-I1z conf_name=sebo] TransportCcEngine.tccReceived#157: TCC packet contained received sequence numbers: 55114-55115, 55186-55208, 55264-55267. Couldn’t find packet detail for the seq nums: 55114-55115, 55186-55208, 55264-55267. Latest seqNum was 57087, size is 1000. Latest RTT is 777.956421 ms.
2020-05-04 10:55:57.786 WARNING: [6045] [confId=ad48aefab6698d1d epId=58576179 gid=ff1dfc stats_id=Ciara-I1z conf_name=sebo] TransportCcEngine.tccReceived#157: TCC packet contained received sequence numbers: 55402-55423, 55501-55502. Couldn’t find packet detail for the seq nums: 55402-55423, 55501-55502. Latest seqNum was 57198, size is 1000. Latest RTT is 1067.682251 ms.
2020-05-04 10:55:57.798 WARNING: [6042] [confId=ad48aefab6698d1d epId=58576179 gid=ff1dfc stats_id=Ciara-I1z conf_name=sebo] SendSideBandwidthEstimation.getRttMs#587: RTT suspiciously high (1068ms), capping to 1000ms.
2020-05-04 10:55:58.056 WARNING: [6047] [confId=ad48aefab6698d1d epId=58576179 gid=ff1dfc stats_id=Ciara-I1z conf_name=sebo] SendSideBandwidthEstimation.getRttMs#587: RTT suspiciously high (1287ms), capping to 1000ms.
2020-05-04 10:55:58.165 WARNING: [6054] [confId=dc4d16f432175822 epId=81e56500 gid=ffcb4d stats_id=Rosalee-ltt conf_name=historicbonesalignfortunately] TransportCcEngine.tccReceived#157: TCC packet contained received sequence numbers: 43757-43764. Couldn’t find packet detail for the seq nums: 43757-43764. Latest seqNum was 51897, size is 1000. Latest RTT is 16676.004517 ms.

Thank you,

Milan

Nice, 300 users using one server or total using all your three servers?

To find out what kind of operating-system and browser is in use by a user who have issues

  1. look for disconnects and reconnects in the JVB log, the IP address for the user is found there.
  2. search for the same IP address in the nginx log, the operating-system and browser used is logged there.
  3. if possible test using similar hardware and see if you can reproduce the issue.

Possitive RTT such as this one can in my experience get solved by reducing CPU load on the client user-interface. It could be network issues as well such as the end user sitting on a slow wifi.

1 Like

hi, it is sum for two jvbs:

root@virt1:~/scripts# ./stats.sh
jitsi_participants: 159
jitsi_conferences: 13
jitsi_largest_conference: 31
jitsi_endpoints_sending_video: 33
jitsi_endpoints_sending_audio: 16
jitsi_receive_only_endpoints: 107
jitsi_threads: 676
jitsi_total_failed_conferences: 0
jitsi_total_partially_failed_conferences: 0

Dropnute pakety v JVB: 0 0
Unable to find encoding matching packet: 1825
Negative rtt: 4478
Resource temporarily unavailable: 0
Couldn’t find packet detail for the seq nums: 4264
Zatazenie: 10:02:25 up 31 days, 12:08, 2 users, load average: 4.28, 3.59, 2.73
Netstat: 138108 receive buffer errors 41 send buffer errors

and

root@BackupStorage:~/scripts# ./stats.sh
jitsi_participants: 159
jitsi_conferences: 11
jitsi_largest_conference: 32
jitsi_endpoints_sending_video: 24
jitsi_endpoints_sending_audio: 14
jitsi_receive_only_endpoints: 112
jitsi_threads: 622
jitsi_total_failed_conferences: 0
jitsi_total_partially_failed_conferences: 1

Dropnute pakety v JVB: 0 0
Unable to find encoding matching packet: 145
Negative rtt: 53
Resource temporarily unavailable: 0
Couldn’t find packet detail for the seq nums: 5030
Zatazenie: 10:03:57 up 22 days, 22:13, 2 users, load average: 3.94, 4.18, 4.17
Netstat: 153656 receive buffer errors 19 send buffer errors

I’ll look for disconnect messages in jvb log.

Milan

1 Like

Hi, @xranby, what message exactly should I look for in jvb.log. I cant find disconnect/reconnect messages. I was part of 24 member room and enabling all cameras ended bad. :frowning: Can I PM you our jvb.log if you can find something? Or part of it like: cat /var/log/jitsi/jvb.log |grep " affected conference name"

Thank you,

Milan