Server load for many simultaneous meetings

Thanks for your analysis @xranby

The “unable to find encoding matching packet” does not seems to be linked to a specific browser or platform.

My feeling about “Suspiciously high rtt value” is also that it is related to high CPU on end user side. But it was just a feeling. Thanks for confirming that.
I already have set 480 resolution, Start muted and disableAudioLevels.
I do not understand what concretely do DISABLE_VIDEO_BACKGROUND.
About disableH264, I fear that it could have a negative impact on small hardware config where H264 hardware decoding currently works well…

Today, I discovered that jvb, with default deb packages config, is doing some really intensive logging in /tmp/jvb-series.log
I understand it could be usefull for debug, but as a default config, I am a bit surprised !
… and it seems to be linked with the logging things we see eating CPU on flamegraphs ! :slight_smile:

In /etc/jitsi/videobridge/logging.properties I have changed FileHandler.level to OFF:

java.util.logging.FileHandler.level = OFF

And a new flamegraph, under decent load (but not huge, the load was sadly correctly load-balanced between different jvb !)

20200504-high-load.svg
Download: 23Mbps / 5,5kpps
Upload: 41Mbps / 10,8kpps
Participants: 36
Conferences: 4
Sending_audio: 11
Sending_video: 22
Linux load: 5 (=0,63 per core)


SVG File: https://nuage.hadoly.fr/s/YpyNzmjwktGjpiJ

Did not noticed any audio quality issue on a conf running on this jvb.

Just to sum up actions realized on server side:

  • Disable swap
  • Default log level in logging.properties set to .level=WARNING
  • Disable time series logging by setting java.util.logging.FileHandler.level = OFF in logging.properties
5 Likes

Hi, @migo
Can you share you stats.sh script please?

2 Likes

Hi, yes, it is very simple, you need to enable rest api in jvb first i think:

#!/bin/bash
#exit 0
XSTATS=$(curl -s -f -m 3 http://localhost:8080/colibri/stats)

for STAT in participants conferences largest_conference endpoints_sending_video endpoints_sending_audio receive_only_endpoints threads total_failed_conferences total_partially_failed_conferences; do
echo -e "\033[1;34mjitsi_{STAT}: \e[0;32m"(echo $XSTATS | jq “.$STAT”)
done

echo -e “\e[0;30m----------------”

j=grep Dropping /var/log/jitsi/jvb.log | wc -lw
echo -e “\033[1;34mDropnute pakety v JVB:\e[0;32m “$j”\e[0;30m”

k=cat /var/log/jitsi/jvb.log |grep "Unable to find encoding matching packet!" | wc -l
echo -e “\033[1;34mUnable to find encoding matching packet:\e[0;32m “$k”\e[0;30m”

l=cat /var/log/jitsi/jvb.log |grep "Negative rtt" | wc -l
echo -e “\033[1;34mNegative rtt:\e[0;32m “$l”\e[0;30m”

m=cat /var/log/jitsi/jvb.log |grep Resource | wc -l
echo -e “\033[1;34mResource temporarily unavailable:\e[0;32m “$m”\e[0;30m”

p=cat /var/log/jitsi/jvb.log |grep "Couldn't find packet detail for the seq nums:"|wc -l
echo -e “\033[1;34mCouldn’t find packet detail for the seq nums:\e[0;32m “$p”\e[0;30m”

n=w |grep "load average:"
echo -e “\033[1;34mZatazenie:\e[0;32m “$n”\e[0;30m”

o=netstat -anus|grep "buffer errors"
echo -e “\033[1;34mNetstat:\e[0;32m “$o”\e[0;30m”

Milan

5 Likes

Hi @xranby, can you look at my flame graph please? It was done in low/moderate load:

root@virt1:~/scripts# ./stats.sh
jitsi_participants: 119
jitsi_conferences: 6
jitsi_largest_conference: 35
jitsi_endpoints_sending_video: 7
jitsi_endpoints_sending_audio: 7
jitsi_receive_only_endpoints: 112
jitsi_threads: 520
jitsi_total_failed_conferences: 0
jitsi_total_partially_failed_conferences: 4

Dropnute pakety v JVB: 0 0
Unable to find encoding matching packet: 602398
Negative rtt: 1904
Resource temporarily unavailable: 0
Couldn’t find packet detail for the seq nums: 3084
Zatazenie: 14:52:51 up 32 days, 16:58, 2 users, load average: 2.02, 2.25, 1.91
Netstat: 145289 receive buffer errors 41 send buffer errors

Here it is: https://www.dropbox.com/sh/6jzvsjttrazdx2i/AAB8rmwZYztYCqcEKeySNZl0a?dl=0

flamegraph2

Thank you in advance,

Milan

Hi, thank you for clarification. Jitsi meet experience is strongly dependent on link quality of participants it seems.
Is there somewhere more info about WebRTC limits you mentioned? Is video conference with 50 video participants possible with jitsi? There is no info about such a limitations in jitsi sofrware or @bbaldino @damencho ?

Thank you,

Milan

If you start the room with audio and video muted then you can host a large conference with some key speakers and invite people to enable the camera and microphone when asking questions.

the SFU enduser bandwith limitation appear if 20-30 people enable the camera at the same time.

here is a report of a successful 120 participant jitsi conference using the above strategy: Maximum number of participants on a meeting on meet.jit.si server - #47 by srinivas

2 Likes

Hi, thank you. We are using this strategy from beginning on. :slight_smile: Our system is fine with 70 participants with one presenter, so inline with your estimations.

Milan

1 Like

Great! Thanks for your script, I will try to use in. It could be not so easy, because we are using docker images currently.

Good conclusions.
I will try this tweaks. Thanks to you and big respect to @xranby

In general the flamegraph looks OK, good profiling,
i have two main comments.

org/jitsi/videobridge/cc/vp8/VP8FrameProjection:::rewriteRtp spend quite a lot of time generating json. 2.38% of total CPU time on the machine, and a huge percentage of endpoint send is slowed down by this. JVB engineers should look at this and check if json generation can get optimized here.
The top of the big green peak is zoomed in here to illustrate how large portion of the Endpoint:::Send is spent generating json

The Second observation is that now when the server runs more optimal, by having logging disabled, makes the garbage-collectors CPU usage to be a new viable target to look at to improve performance. The CMS garbagecollector use 3.15% of total CPU time (the yellow peak in your graph quoted below). You are in a position to start evaluate if switching to the new generations of garbage collectors such as G1 and the very latest concurrent pauseless garbagecollectors ZGC and Shenandoah may remove this 3.15% of CPU usage currently spend garbagecollecting using the ConcurrentMarkSweep CMS collector.
Page 53 of this slide deck show some interesting compares of different new JVM GC’s Choosing Right Garbage Collector to Increase Efficiency of Java Memor…

You can explore if generating a “memory” flame graph can give clues how to lower JVB over all memory usage or remove the need to perform garbage-collects.

An Off-cpu flamegraph may also be interesting to see if a service request is blocked.

Flame Graphs - here is the latest flamegraph research that describe the different types of cpu, memory and off-cpu flame graphs.

I still do not have any clues why your system show such many dropped packages,
can you share your dmesg ? Have you installed the latest firmware for your network card?
On my system using a realtek NIC i had to install the firmware-realtek debian package from debian contrib non-free repository’s.

See Firmware - Debian Wiki for information about missing firmware

2 Likes

KUDOS @xranby! I’m learning a lot by you! Thank you! I’m wondering how great must be your servers tuned with your skills!

Tomorrow will have lesson that class with 25 cameras freaks :slight_smile: and I’ll try to capture that epic fail with flame graph. What else should I look at?

I’ll look at given suggestions and study materials. :slight_smile:
That netstat receive buffer errors are for 32 days uptime, is that that bad? I can confirm that number of receive buffer errors is increased when is server under very high load. It is:

88:00.0 Ethernet controller: Intel Corporation Ethernet Controller 10-Gigabit X540-AT2 (rev 01)
88:00.1 Ethernet controller: Intel Corporation Ethernet Controller 10-Gigabit X540-AT2 (rev 01)

with debian 10 default FW/drivers and ixgbe module.

Second JVB server has: 153656 receive buffer errors 19 send buffer errors, but there is:

01:00.0 Ethernet controller: Intel Corporation 82575EB Gigabit Network Connection (rev 02)
01:00.1 Ethernet controller: Intel Corporation 82575EB Gigabit Network Connection (rev 02)
07:00.0 Ethernet controller: Intel Corporation 82575EB Gigabit Network Connection (rev 02)
07:00.1 Ethernet controller: Intel Corporation 82575EB Gigabit Network Connection (rev 02)

with default FW/drivers and igb module.

Thank you, kind regards,

Milan

2 Likes

Hi @xranby, I’m trying to catch perf trace under high load, but produced two jvb crashes:

jvb@virt1:/tmp/test$ java -cp attach-main.jar:$JAVA_HOME/lib/tools.jar net.virtualvoid.perf.AttachOnce 23869
Exception in thread “main” java.io.IOException: Premature EOF
at sun.tools.attach.HotSpotVirtualMachine.readInt(HotSpotVirtualMachine.java:292)
at sun.tools.attach.LinuxVirtualMachine.execute(LinuxVirtualMachine.java:199)
at sun.tools.attach.HotSpotVirtualMachine.loadAgentLibrary(HotSpotVirtualMachine.java:58)
at sun.tools.attach.HotSpotVirtualMachine.loadAgentPath(HotSpotVirtualMachine.java:88)
at net.virtualvoid.perf.AttachOnce.loadAgent(AttachOnce.java:51)
at net.virtualvoid.perf.AttachOnce.main(AttachOnce.java:34)

Had you such a experinece in past?

All conferences were moved to second JVB and can continue. While all rooms were moved to one jvb nice high load situation occurred for us:

root@BackupStorage:~/scripts# ./stats.sh
jitsi_bit_rate_upload: 148190
jitsi_bit_rate_download: 24249
jitsi_participants: 210
jitsi_conferences: 12
jitsi_largest_conference: 40
jitsi_endpoints_sending_video: 24
jitsi_endpoints_sending_audio: 17
jitsi_receive_only_endpoints: 182
jitsi_threads: 1020
jitsi_total_failed_conferences: 0
jitsi_total_partially_failed_conferences: 3

Dropnute pakety v JVB: 0 0
Unable to find encoding matching packet: 919
Negative rtt: 1669
Resource temporarily unavailable: 0
Couldn’t find packet detail for the seq nums: 406733
Zatazenie: 10:22:06 up 24 days, 22:31, 3 users, load average: 4.63, 4.59, 4.29
Netstat: 153656 receive buffer errors 19 send buffer errors

This is older HW 8C/16T E5540@2.53GHz.

Here is flame graph from this: https://www.dropbox.com/s/mhqlzdde65dv1h8/flamegraph3.svg?dl=0

It looks simmilar for me, but you can see differences for sure.

Thank you,

Milan

It happened to me once. It was the only time I forgot to delete the file /tmp/perf-XXXX.map before … might be related.

Hi, I’ll try to delete this file next time. It happened to me 4-times already :frowning:

Thank you for suggestion,

Milan

Hi @xranby, as promised I’ve tested today with larger video group. There were 17 users online in room. When I’ve requested everybody to turn your camera on, system has staled for 30s and all participants got that blue inactive icon, I’ve had tiles view enabled. I’ve taken perf report from that time point and maybe it is interesting, it looks a bit different to me. After that maybe 30s all cameras went up and communication in room was working OK with no problems.

Here it is: https://www.dropbox.com/s/5o0r8ydqez4qy90/flamegraph4.svg?dl=0

Can you look at it please?

After lesson end we all have moved to meet.jit.si server to evaluate difference between our installation and official one. First think that stands out is that HD video was disabled by default and some users had only LD quality! I was surprised! All communications was working OK off course.

@damencho @bbaldino have you implemented some logic that adjust video quality according to number of participants in particular room? Thank you!

Next week I’ll try to manage larger test group e.g. 30-40 people to run compare test on booth installations and I’ll make more perf stats and hopefully with off cpu and memory flame graphs.

Today summary: 4 jvb crashes on production systems, failover works fine :slight_smile: and lot of new questions.

Thank you for your time,

kind regards,

Milan

2 Likes

This crash is caused when the profiling tool tried to connect, a JVM bug, nothing jitsi can do to fix that.

Since your servers have been running continously for 24 days and 32 days respectively i tried to plot the number of dropped packages/minute for your two servers using the data you posted here in this thread.

The server with 4 NIC’s is running super stable now, not a single dropped package during the last 8 days!

The server with 2 NIC’s was dropping some packages during your last report, however it also have been running stable for 6 days.

For the last two flame graphs captured with 200 user high load on one server I have the same conclusions as last time - > Server load for many simultaneous meetings - #124 by xranby

Your idea to reduce video quality as users in one room goes up is a good one! I think we now need to focus on reducing resource usage on client side and after that your conference should scale up to many many users with video enabled.

Thank you @xranby! Kudos again, I hope our discussion here will help other people too. :slight_smile: From your point of view my jvb servers are running good, with minor tweaks possible on garbage collector side? One more tweak should be reduce logging level further more to SEVERE, because generating WARNINGS like: Couldn’t find packet detail for the seq nums: 815347 times at wrong time can cause unnecessary load on server. I think we won’t miss them as users, and when developers will have time for us we can easily enable them back.

This server has more stress today, because first JVB server crashes I caused by pref capturing:

root@BackupStorage:~/scripts# ./stats.sh
jitsi_bit_rate_upload: 1566
jitsi_bit_rate_download: 975
jitsi_participants: 34
jitsi_conferences: 3
jitsi_largest_conference: 25
jitsi_endpoints_sending_video: 2
jitsi_endpoints_sending_audio: 1
jitsi_receive_only_endpoints: 29
jitsi_threads: 217
jitsi_total_failed_conferences: 0
jitsi_total_partially_failed_conferences: 16

Dropnute pakety v JVB: 0 0
Unable to find encoding matching packet: 1934
Negative rtt: 4988
Resource temporarily unavailable: 0
Couldn’t find packet detail for the seq nums: 815347
Zatazenie: 15:54:13 up 25 days, 4:03, 3 users, load average: 1.45, 1.06, 0.92
Netstat: 153656 receive buffer errors 19 send buffer errors

Now I’ll disable unneeded features from UI as you suggested.

One more thing, am I understood good that enabling Octo brings sharing one conference between multiple JVBs and enabling large video conferences with many video participants? @damencho?

Or I’ve misunderstand things?

Thank you!

Kind regards,

Milan

Yes, that is correct. Distributing the conference on several bridges you can reduce the load and so get a bigger conference.

1 Like

Thank you @damencho for your work! Now will be conference limited only by users network connection and CPU power to handle UI I assume, when server resources will be sufficient off course.

Kind regards,

Milan

Hi @xranby, I’m trying to change GC, but it seems only one present is G1 :frowning:

Error: A fatal exception has occurred. Program will exit.
Unrecognized VM option ‘UseShenandoahGC’
Error: Could not create the Java Virtual Machine.
Error: A fatal exception has occurred. Program will exit.
Unrecognized VM option ‘UseZGC’
Error: Could not create the Java Virtual Machine.
Error: A fatal exception has occurred. Program will exit.

Are you interested in flame graphs with G1 enabled?

I wanted to remove film strip from UI, but this option is removed now :frowning:

Thank you,

Milan