Server load for many simultaneous meetings

Debugging and finding the origin of server load issues is greatly simplified if you can instrument your server and produce a flame graph that highlight what part of the code that is consuming all CPU cycles.

The Netflix Technology Blog describe how Netflix have worked to improve the JVM in order to produce Java mixed-mode flame graphs, a complete visualization of CPU usage on server side, by implementing the JDK option: -XX:+PreserveFramePointer

If you can provide a flamegraph of your jitsi videobridge server misbehaving then everyone can look at it and contemplate what is causing the 500% CPU use and where to look to fix it.

3 Likes

Hi @voxter, have you no issues with 15 or more participants sending video in one room? What was your biggest room attendance?

Thank you,

Milan

It create funny graphical thing, but still hard to understand it.
Results from a bridge only handling 1 conference with 3 participants (without any issue):


And the svg version (not allowed to be directly upload here): https://covid19.framadrop.org/lufi/r/P9MyqoPtha#0zvyWgVuahw1REQApzcZURbxt+/M4NjOmSr55SgduUY=

I’m not really sure CPU is the key of our issues. We see issues even when the CPU does not seems to be overloaded.

Edit:
Second test with 20 participants in 6 conf. 15 Video endpoints.
Linux load at ~1.00 (With 8vcpu, so nothing amazingly high)
During the capture I was seeing may logs like this in jvb.log:

TransportCcEngine.tccReceived#157: TCC packet contained received sequence numbers: 22926-23141. Couldn't find packet detail for the seq nums: 22926-23141. Latest seqNum was 21591, size is 1000. Latest RTT is 13.014161 ms.

SVG: https://covid19.framadrop.org/lufi/r/BgNUGo0rk0#xmqe6vq8NdG+aycCkyR5wyZD/mXxic3zhfsTu8+m7nM=

2 Likes

Thank you for generating the flame graphs. The svg files are helpful because i can load it into the browser, point at a block and it will print the full function name, super useful!

1st issue: swapper
Both graphs show that your system spend 20% time swapping,
All the rightmost pure red spikes are caused by linux kernel swapping!
All the leftmost pure red spikes are writes waiting, thus possibly also related to swapping.
This is unexpected and is a root of poor performance! Thus by tuning the system swappiness or simply disable all swap
sudo swapoff -a
you should see reduced CPU usage and improved performance! :bulb:

2nd issue logging!
In the left top green hill we can see that 7% of CPU time is spent at logging.
org/jitsi/util/Logger:::Trace + org/jitsi/util/Logger:::Log
Maybe the logger can be tuned down , or possibly disable logging entirely for the
org/jitsi/videobridge/Conference:::SendOut function. It is this function alone that cause most time spent logging. … This is actually quite odd to see Logger on the stack called from Conference SendOut because SendOut itself do not log… https://github.com/jitsi/jitsi-videobridge/blob/master/src/main/java/org/jitsi/videobridge/Conference.java#L1170
So what we do see is that the Java JIT compiler have optimized things here…
Some serious contemplation needed :face_with_monocle: :thinking:

Try attack those two issues first, and then generate a new flame graph.

If you want to learn how to read flame graphs and how to pinpoint what function cause all the waits then watch this presentation:

3rd issue: changed behaviour
So by comparing the two svg flamegraphs you sent I can clearly see that the server is having issues under high load because the second, and third, peak of the big grean hill, seen in the first graph under normal server load, is not seen at all under heavy load!
It is as if the programs behavior has changed and ThreadPoolExecutor:::runWorker do not have time to do any java/util/concurrent/FutureTask:::run , and this cause issues beyond my imagination :exploding_head: :upside_down_face: :slight_smile:

5 Likes

Thanks @xranby for your time and help.
I’m surprised by the swapper usage. This server has more than enought RAM. To be sure, I’ve disabled swap as you suggest. But you will see in the folowing flamegraphs that swapper is still present…

$ free -m
              total        used        free      shared  buff/cache   available
Mem:           7975         965        4532           0        2477        6710
Swap:             0           0           0

I’ve switch log level to WARNING (instead of INFO), but it does not seems to change anything in the graph. And even if no logs are comming in jvb.log, we still see a quite important part of Logger::log.

I’ve performed 7 different perf profiling.
During all these tests, I was in a 3 participants video conference, handled by the bridge under perf profiling… and I did not noticed any audio or video issue.
I will try to re-do the test under really high load, during the coming week.

name Linux load Download Upload Conf Participants Sending audio Sending video jvb logs
low-load.svg 0.3 (0,04 per core) 1,5Mbps 400pps 1Mbps 430pps 3 7 3 3 No log comming in jvb.log
light-load.svg 0.8 (0,1 per core) 5Mbps 1,5kpps 8Mbps 2,7kpps 3 11 8 9 No log comming in jvb.log
medium-load.svg 2 (0,25 per core) 22Mbps 3,9kpps 30Mbps 7,9kpps 5 19 13 17 Many Suspiciously high rtt value, Some TCC packet contained received sequence
medium-load-2.svg 3 (0,375 per core) 30Mbps 5,1kpps 39Mbps 9,8kpps 6 23 14 21 Many Suspiciously high rtt value, Some TCC packet contained received sequence
high-load.svg 4.5 (0,56 per core) 32Mbps 5,7kpps 43Mbps 9,8kpps 7 27 15 23 Many Suspiciously high rtt value, Some TCC packet contained received sequence
high-load-2.svg 4.5 (0,56 per core) 26Mbps 6,7kpps 22Mbps 12kpps 7 49 35 21 No log comming in jvb.log
high-load-3.svg 4 (0,5 per core) 35Mbps 7,7kpps 34Mbps 10kpps 12 43 32 25 Many Unable to find encoding matching packet! (~70 per second, all linked to same participant)

SVG file available here: https://nuage.hadoly.fr/s/YpyNzmjwktGjpiJ

2 Likes

Thank you, your server profiling and load numbers are super useful!

All 7 graphs you posted today looks much more healthy with reduced jvb logging and disabled swap.

Seeing a high CPU percentage here as swapper calls the do_idle inside the kernel is OK as it puts the CPU into powersaving mode when the CPU has nothing to do.
Skärmbild från 2020-05-04 06-00-46

Your new, high-load-2 & 3, flamegraph show that it is a different part of JVB that cause 6% CPU spent logging!
Lorg/jitsi/videobridge/cc/vp8/VP8AdaptiveTrackProjectionContext:::rewriteRtp
Skärmbild från 2020-05-04 05-47-55

In high-load-3 we can see that there is one additional percentage CPU spent in logging from:
Lorg/jitsi/nlj/transform/node/incoming/VideoParser:::transform

If the logging for these two functions can be tuned or disabled then the server may handle even more load :slight_smile:

Best if to understand why these two functions wants to log-a-lot, it may be an underlying issue that need to be looked into, and if that is fixed then no logging is needed.

I used the Ctrl-f search function to search for logging inside the flame graph. Searching makes all logger lines stand out in purple. The search function will also summarize total CPU time spent logging across all the JVB function branches.

Great, Looking forward to see how it looks under really high load!

Do you know what system this participant use? You should be able to see it by looking for the participants ip adress in the nginx log.

We have found some mac + chrome clients with non working hardware acceleration.
You can try disable the hardware accelerated H264 codec, because H264 still cause issues for some chrome mac users, and see if that changes anything inside the server logs.

No idea, this we need to triage to find the root cause for.

I see this warning in my JVB logs as well when I connect using an old laptop that run out of ram , and the client starts to swap. Tricks to reduce CPU and memory load on client side fix these kind of warnings for me and improve the experience using old client hardware.

By analyzing profile graphs on client side I have found 4 top config tricks to reduce client load.

4 Likes

Hi @xranby @ronquilstr, I really appreciate your work guys! I’ll try to add flame graphs from my installation later too. You have produced very useful information for community!

Thank you!

Kind regards,

Milan

1 Like

Good day @bbaldino @damencho @voxter @jcfischer
Our infrastructure is absolutely fine. What is special ? After installing Jitsi from the Debian Unstable repositories, everything works perfectly. But not from Debian Stable repositories. The configuration has been copied and tested on the same machine and same hardware. The only things that have been exchanged are the Jitsi binaries. This condition has been verified at several locations and various installations.

I apologize for the words addressed to the developers, but I do not agree with their statement.

If I want to drive, I don’t have to be a car mechanic or a designer, or an electrician, etc. :wink:

I consider our problem solved at this time.

Enjoy the rest of your day.

Only because you can drive a car, it does not mean that you are able to or that you should drive a car. Thats why we have driving schools and you learn how to do it, before you will do it ;-).

if the enginge stalls, its the cars fault, or the manufactors. but never your own…

but back to topic, i just needed to share my opinion…

Don’t stress :wink:

@migo @kallenp can you check if disabling the H264 codec makes the users that occasionally drop video on your servers turn stable?

Can you also check by comparing the nginx log what operatingsystem+browser the users that drop video are using?

try disable the hardware accelerate H264 video format
my experience is that all web-browsers do not support H264 and it is a common source of web-browser crashes or missing video if H264 for some reason is malfunctioning, in combination with webrtc, on the client. H264 is disabled on meet.jit.si, at least last time i checked. disable H264 makes chrome on mac happy, when firefox users join the meeting.

disableH264: true,

you may need to disable it in p2p mode as well

p2p: {
    disableH264: true,
}

Hi @xranby, thank you. I’ve disabled H264 support and will see what happens. Now we have high usage over 300 users online in 22 rooms, it is not easy to isolate users with concrete problems. :frowning:

JVB is complaining lot about:
2020-05-04 10:55:57.771 WARNING: [6045] [confId=ad48aefab6698d1d epId=58576179 gid=ff1dfc stats_id=Ciara-I1z conf_name=sebo] TransportCcEngine.tccReceived#157: TCC packet contained received sequence numbers: 54651-54656, 54729-54744, 54808-54809. Couldn’t find packet detail for the seq nums: 54651-54656, 54729-54744, 54808-54809. Latest seqNum was 56784, size is 1000. Latest RTT is 3176.397095 ms.
2020-05-04 10:55:57.783 WARNING: [6045] [confId=ad48aefab6698d1d epId=58576179 gid=ff1dfc stats_id=Ciara-I1z conf_name=sebo] TransportCcEngine.tccReceived#157: TCC packet contained received sequence numbers: 55114-55115, 55186-55208, 55264-55267. Couldn’t find packet detail for the seq nums: 55114-55115, 55186-55208, 55264-55267. Latest seqNum was 57087, size is 1000. Latest RTT is 777.956421 ms.
2020-05-04 10:55:57.786 WARNING: [6045] [confId=ad48aefab6698d1d epId=58576179 gid=ff1dfc stats_id=Ciara-I1z conf_name=sebo] TransportCcEngine.tccReceived#157: TCC packet contained received sequence numbers: 55402-55423, 55501-55502. Couldn’t find packet detail for the seq nums: 55402-55423, 55501-55502. Latest seqNum was 57198, size is 1000. Latest RTT is 1067.682251 ms.
2020-05-04 10:55:57.798 WARNING: [6042] [confId=ad48aefab6698d1d epId=58576179 gid=ff1dfc stats_id=Ciara-I1z conf_name=sebo] SendSideBandwidthEstimation.getRttMs#587: RTT suspiciously high (1068ms), capping to 1000ms.
2020-05-04 10:55:58.056 WARNING: [6047] [confId=ad48aefab6698d1d epId=58576179 gid=ff1dfc stats_id=Ciara-I1z conf_name=sebo] SendSideBandwidthEstimation.getRttMs#587: RTT suspiciously high (1287ms), capping to 1000ms.
2020-05-04 10:55:58.165 WARNING: [6054] [confId=dc4d16f432175822 epId=81e56500 gid=ffcb4d stats_id=Rosalee-ltt conf_name=historicbonesalignfortunately] TransportCcEngine.tccReceived#157: TCC packet contained received sequence numbers: 43757-43764. Couldn’t find packet detail for the seq nums: 43757-43764. Latest seqNum was 51897, size is 1000. Latest RTT is 16676.004517 ms.

Thank you,

Milan

Nice, 300 users using one server or total using all your three servers?

To find out what kind of operating-system and browser is in use by a user who have issues

  1. look for disconnects and reconnects in the JVB log, the IP address for the user is found there.
  2. search for the same IP address in the nginx log, the operating-system and browser used is logged there.
  3. if possible test using similar hardware and see if you can reproduce the issue.

Possitive RTT such as this one can in my experience get solved by reducing CPU load on the client user-interface. It could be network issues as well such as the end user sitting on a slow wifi.

1 Like

hi, it is sum for two jvbs:

root@virt1:~/scripts# ./stats.sh
jitsi_participants: 159
jitsi_conferences: 13
jitsi_largest_conference: 31
jitsi_endpoints_sending_video: 33
jitsi_endpoints_sending_audio: 16
jitsi_receive_only_endpoints: 107
jitsi_threads: 676
jitsi_total_failed_conferences: 0
jitsi_total_partially_failed_conferences: 0

Dropnute pakety v JVB: 0 0
Unable to find encoding matching packet: 1825
Negative rtt: 4478
Resource temporarily unavailable: 0
Couldn’t find packet detail for the seq nums: 4264
Zatazenie: 10:02:25 up 31 days, 12:08, 2 users, load average: 4.28, 3.59, 2.73
Netstat: 138108 receive buffer errors 41 send buffer errors

and

root@BackupStorage:~/scripts# ./stats.sh
jitsi_participants: 159
jitsi_conferences: 11
jitsi_largest_conference: 32
jitsi_endpoints_sending_video: 24
jitsi_endpoints_sending_audio: 14
jitsi_receive_only_endpoints: 112
jitsi_threads: 622
jitsi_total_failed_conferences: 0
jitsi_total_partially_failed_conferences: 1

Dropnute pakety v JVB: 0 0
Unable to find encoding matching packet: 145
Negative rtt: 53
Resource temporarily unavailable: 0
Couldn’t find packet detail for the seq nums: 5030
Zatazenie: 10:03:57 up 22 days, 22:13, 2 users, load average: 3.94, 4.18, 4.17
Netstat: 153656 receive buffer errors 19 send buffer errors

I’ll look for disconnect messages in jvb log.

Milan

1 Like

Hi, @xranby, what message exactly should I look for in jvb.log. I cant find disconnect/reconnect messages. I was part of 24 member room and enabling all cameras ended bad. :frowning: Can I PM you our jvb.log if you can find something? Or part of it like: cat /var/log/jitsi/jvb.log |grep " affected conference name"

Thank you,

Milan

In the JVB log this is what i see when connection is lost because a 13 year old dell client is overloaded (due to multitasking on the client side):
2020-05-04 06:21:30.852 INFO: [5789] [confId=9a6a956fc0a1ee71 gid=ff191b stats_id=Crawford-w8e conf_name=live ufrag=5417c1e7dr6qui epId=b6c56086 local_ufrag=5417c1e7dr6qui] ConnectivityCheckClient.processTimeout#857: timeout for pair: 192.168.1.1:10000/udp/host -> 192.168.1.123:59972/udp/prflx (stream-b6c56086.RTP), failing.
and then inside the nginx log i see the reconnect, operating system in use and the browser version:
192.168.1.123 - - [04/May/2020:06:21:55 +0200] "POST /http-bind?room= HTTP/2.0" 200 243 "" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36"

by combining the JVB log and the nginx log it should be possible to tell if some client os+browser combination are having more issues compared to others. When the JVB runs stable and clients disconnects then it may be issues on the client end.

1 Like

Thank you, I suspected those messages. :slight_smile: But I cant find those IPs in nginx log. :frowning: logging only localhost and errors, :frowning: need to change it first.

Can one or two participants with such a buggy browser/hw/connection kill whole room experience for all?

Look at the picture when I asked to enable cams, participants gets lagging and marked as inactive/broken (blue icon top left). :frowning:

Thank you,

Milan

using 20-30 people all with with video enabled inside one conference room you are quickly approaching what is possible using SFU webrtc videoconferencing.

When users are using tile view:
23 users all using 160p tileview sending video would using Selective Forwarding Unit ( SFU ) (videobridge) require
0.2mbit * 23 = requires an 4,6 Mbit download for all users.

23 users all 480p sending video would using Selective Forwarding Unit ( SFU ) (videobridge) require
0.7mbit * 23 = requires an 16 Mbit download for all users, people on slow link may start to see reduced video resolution.

If some user connects with a HD webcam and that users webbrowser only send the HD frames to the videobridge then the videobride forward that HD frame to all users. Thus for each user with a browser that only send HD the download requirements are increased by 2Mb for each such user. Now users who cant receive this much data will start to drop out.

Using full screen mode is ideal for SFU. The videobridge then only need to send the video for the active speaker to all viewers. Each user watching fullscreen view then only need to receive 0.7Mbit for 480p.

The server will likely still run stable regardless what happens.

If jitsi would extend the videobridge to have a MCU then the server requirements may increase while bandwidth to the end users get reduced for large conference rooms. Adding MCU support is not trivial, maybe possible to do with low latency by using gstreamer or opengl rendering on the server.

The upside of using SFU is low latency and reduced server requirements,
work great for up to 20 users with audio + video enabled,
work great for 70 users using audio only,
work great for 100 users with one speaker + video to many listeners & viewers.
Typical SFU latency stay below 50ms hence real-time!

Using MCU adds latency, require the server to do the mixing of all video streams,
The upside of MCU is lower end user download bandwidth.

5 Likes

Thanks for your analysis @xranby

The “unable to find encoding matching packet” does not seems to be linked to a specific browser or platform.

My feeling about “Suspiciously high rtt value” is also that it is related to high CPU on end user side. But it was just a feeling. Thanks for confirming that.
I already have set 480 resolution, Start muted and disableAudioLevels.
I do not understand what concretely do DISABLE_VIDEO_BACKGROUND.
About disableH264, I fear that it could have a negative impact on small hardware config where H264 hardware decoding currently works well…

Today, I discovered that jvb, with default deb packages config, is doing some really intensive logging in /tmp/jvb-series.log
I understand it could be usefull for debug, but as a default config, I am a bit surprised !
… and it seems to be linked with the logging things we see eating CPU on flamegraphs ! :slight_smile:

In /etc/jitsi/videobridge/logging.properties I have changed FileHandler.level to OFF:

java.util.logging.FileHandler.level = OFF

And a new flamegraph, under decent load (but not huge, the load was sadly correctly load-balanced between different jvb !)

20200504-high-load.svg
Download: 23Mbps / 5,5kpps
Upload: 41Mbps / 10,8kpps
Participants: 36
Conferences: 4
Sending_audio: 11
Sending_video: 22
Linux load: 5 (=0,63 per core)


SVG File: https://nuage.hadoly.fr/s/YpyNzmjwktGjpiJ

Did not noticed any audio quality issue on a conf running on this jvb.

Just to sum up actions realized on server side:

  • Disable swap
  • Default log level in logging.properties set to .level=WARNING
  • Disable time series logging by setting java.util.logging.FileHandler.level = OFF in logging.properties
5 Likes

Hi, @migo
Can you share you stats.sh script please?

1 Like