Server load for many simultaneous meetings

Nice, 300 users using one server or total using all your three servers?

To find out what kind of operating-system and browser is in use by a user who have issues

  1. look for disconnects and reconnects in the JVB log, the IP address for the user is found there.
  2. search for the same IP address in the nginx log, the operating-system and browser used is logged there.
  3. if possible test using similar hardware and see if you can reproduce the issue.

Possitive RTT such as this one can in my experience get solved by reducing CPU load on the client user-interface. It could be network issues as well such as the end user sitting on a slow wifi.

1 Like

hi, it is sum for two jvbs:

root@virt1:~/scripts# ./stats.sh
jitsi_participants: 159
jitsi_conferences: 13
jitsi_largest_conference: 31
jitsi_endpoints_sending_video: 33
jitsi_endpoints_sending_audio: 16
jitsi_receive_only_endpoints: 107
jitsi_threads: 676
jitsi_total_failed_conferences: 0
jitsi_total_partially_failed_conferences: 0

Dropnute pakety v JVB: 0 0
Unable to find encoding matching packet: 1825
Negative rtt: 4478
Resource temporarily unavailable: 0
Couldn’t find packet detail for the seq nums: 4264
Zatazenie: 10:02:25 up 31 days, 12:08, 2 users, load average: 4.28, 3.59, 2.73
Netstat: 138108 receive buffer errors 41 send buffer errors

and

root@BackupStorage:~/scripts# ./stats.sh
jitsi_participants: 159
jitsi_conferences: 11
jitsi_largest_conference: 32
jitsi_endpoints_sending_video: 24
jitsi_endpoints_sending_audio: 14
jitsi_receive_only_endpoints: 112
jitsi_threads: 622
jitsi_total_failed_conferences: 0
jitsi_total_partially_failed_conferences: 1

Dropnute pakety v JVB: 0 0
Unable to find encoding matching packet: 145
Negative rtt: 53
Resource temporarily unavailable: 0
Couldn’t find packet detail for the seq nums: 5030
Zatazenie: 10:03:57 up 22 days, 22:13, 2 users, load average: 3.94, 4.18, 4.17
Netstat: 153656 receive buffer errors 19 send buffer errors

I’ll look for disconnect messages in jvb log.

Milan

1 Like

Hi, @xranby, what message exactly should I look for in jvb.log. I cant find disconnect/reconnect messages. I was part of 24 member room and enabling all cameras ended bad. :frowning: Can I PM you our jvb.log if you can find something? Or part of it like: cat /var/log/jitsi/jvb.log |grep " affected conference name"

Thank you,

Milan

In the JVB log this is what i see when connection is lost because a 13 year old dell client is overloaded (due to multitasking on the client side):
2020-05-04 06:21:30.852 INFO: [5789] [confId=9a6a956fc0a1ee71 gid=ff191b stats_id=Crawford-w8e conf_name=live ufrag=5417c1e7dr6qui epId=b6c56086 local_ufrag=5417c1e7dr6qui] ConnectivityCheckClient.processTimeout#857: timeout for pair: 192.168.1.1:10000/udp/host -> 192.168.1.123:59972/udp/prflx (stream-b6c56086.RTP), failing.
and then inside the nginx log i see the reconnect, operating system in use and the browser version:
192.168.1.123 - - [04/May/2020:06:21:55 +0200] "POST /http-bind?room= HTTP/2.0" 200 243 "" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36"

by combining the JVB log and the nginx log it should be possible to tell if some client os+browser combination are having more issues compared to others. When the JVB runs stable and clients disconnects then it may be issues on the client end.

1 Like

Thank you, I suspected those messages. :slight_smile: But I cant find those IPs in nginx log. :frowning: logging only localhost and errors, :frowning: need to change it first.

Can one or two participants with such a buggy browser/hw/connection kill whole room experience for all?

Look at the picture when I asked to enable cams, participants gets lagging and marked as inactive/broken (blue icon top left). :frowning:

Thank you,

Milan

using 20-30 people all with with video enabled inside one conference room you are quickly approaching what is possible using SFU webrtc videoconferencing.

When users are using tile view:
23 users all using 160p tileview sending video would using Selective Forwarding Unit ( SFU ) (videobridge) require
0.2mbit * 23 = requires an 4,6 Mbit download for all users.

23 users all 480p sending video would using Selective Forwarding Unit ( SFU ) (videobridge) require
0.7mbit * 23 = requires an 16 Mbit download for all users, people on slow link may start to see reduced video resolution.

If some user connects with a HD webcam and that users webbrowser only send the HD frames to the videobridge then the videobride forward that HD frame to all users. Thus for each user with a browser that only send HD the download requirements are increased by 2Mb for each such user. Now users who cant receive this much data will start to drop out.

Using full screen mode is ideal for SFU. The videobridge then only need to send the video for the active speaker to all viewers. Each user watching fullscreen view then only need to receive 0.7Mbit for 480p.

The server will likely still run stable regardless what happens.

If jitsi would extend the videobridge to have a MCU then the server requirements may increase while bandwidth to the end users get reduced for large conference rooms. Adding MCU support is not trivial, maybe possible to do with low latency by using gstreamer or opengl rendering on the server.

The upside of using SFU is low latency and reduced server requirements,
work great for up to 20 users with audio + video enabled,
work great for 70 users using audio only,
work great for 100 users with one speaker + video to many listeners & viewers.
Typical SFU latency stay below 50ms hence real-time!

Using MCU adds latency, require the server to do the mixing of all video streams,
The upside of MCU is lower end user download bandwidth.

5 Likes

Thanks for your analysis @xranby

The “unable to find encoding matching packet” does not seems to be linked to a specific browser or platform.

My feeling about “Suspiciously high rtt value” is also that it is related to high CPU on end user side. But it was just a feeling. Thanks for confirming that.
I already have set 480 resolution, Start muted and disableAudioLevels.
I do not understand what concretely do DISABLE_VIDEO_BACKGROUND.
About disableH264, I fear that it could have a negative impact on small hardware config where H264 hardware decoding currently works well…

Today, I discovered that jvb, with default deb packages config, is doing some really intensive logging in /tmp/jvb-series.log
I understand it could be usefull for debug, but as a default config, I am a bit surprised !
… and it seems to be linked with the logging things we see eating CPU on flamegraphs ! :slight_smile:

In /etc/jitsi/videobridge/logging.properties I have changed FileHandler.level to OFF:

java.util.logging.FileHandler.level = OFF

And a new flamegraph, under decent load (but not huge, the load was sadly correctly load-balanced between different jvb !)

20200504-high-load.svg
Download: 23Mbps / 5,5kpps
Upload: 41Mbps / 10,8kpps
Participants: 36
Conferences: 4
Sending_audio: 11
Sending_video: 22
Linux load: 5 (=0,63 per core)


SVG File: https://nuage.hadoly.fr/s/YpyNzmjwktGjpiJ

Did not noticed any audio quality issue on a conf running on this jvb.

Just to sum up actions realized on server side:

  • Disable swap
  • Default log level in logging.properties set to .level=WARNING
  • Disable time series logging by setting java.util.logging.FileHandler.level = OFF in logging.properties
5 Likes

Hi, @migo
Can you share you stats.sh script please?

2 Likes

Hi, yes, it is very simple, you need to enable rest api in jvb first i think:

#!/bin/bash
#exit 0
XSTATS=$(curl -s -f -m 3 http://localhost:8080/colibri/stats)

for STAT in participants conferences largest_conference endpoints_sending_video endpoints_sending_audio receive_only_endpoints threads total_failed_conferences total_partially_failed_conferences; do
echo -e "\033[1;34mjitsi_{STAT}: \e[0;32m"(echo $XSTATS | jq “.$STAT”)
done

echo -e “\e[0;30m----------------”

j=grep Dropping /var/log/jitsi/jvb.log | wc -lw
echo -e “\033[1;34mDropnute pakety v JVB:\e[0;32m “$j”\e[0;30m”

k=cat /var/log/jitsi/jvb.log |grep "Unable to find encoding matching packet!" | wc -l
echo -e “\033[1;34mUnable to find encoding matching packet:\e[0;32m “$k”\e[0;30m”

l=cat /var/log/jitsi/jvb.log |grep "Negative rtt" | wc -l
echo -e “\033[1;34mNegative rtt:\e[0;32m “$l”\e[0;30m”

m=cat /var/log/jitsi/jvb.log |grep Resource | wc -l
echo -e “\033[1;34mResource temporarily unavailable:\e[0;32m “$m”\e[0;30m”

p=cat /var/log/jitsi/jvb.log |grep "Couldn't find packet detail for the seq nums:"|wc -l
echo -e “\033[1;34mCouldn’t find packet detail for the seq nums:\e[0;32m “$p”\e[0;30m”

n=w |grep "load average:"
echo -e “\033[1;34mZatazenie:\e[0;32m “$n”\e[0;30m”

o=netstat -anus|grep "buffer errors"
echo -e “\033[1;34mNetstat:\e[0;32m “$o”\e[0;30m”

Milan

5 Likes

Hi @xranby, can you look at my flame graph please? It was done in low/moderate load:

root@virt1:~/scripts# ./stats.sh
jitsi_participants: 119
jitsi_conferences: 6
jitsi_largest_conference: 35
jitsi_endpoints_sending_video: 7
jitsi_endpoints_sending_audio: 7
jitsi_receive_only_endpoints: 112
jitsi_threads: 520
jitsi_total_failed_conferences: 0
jitsi_total_partially_failed_conferences: 4

Dropnute pakety v JVB: 0 0
Unable to find encoding matching packet: 602398
Negative rtt: 1904
Resource temporarily unavailable: 0
Couldn’t find packet detail for the seq nums: 3084
Zatazenie: 14:52:51 up 32 days, 16:58, 2 users, load average: 2.02, 2.25, 1.91
Netstat: 145289 receive buffer errors 41 send buffer errors

Here it is: https://www.dropbox.com/sh/6jzvsjttrazdx2i/AAB8rmwZYztYCqcEKeySNZl0a?dl=0

flamegraph2

Thank you in advance,

Milan

Hi, thank you for clarification. Jitsi meet experience is strongly dependent on link quality of participants it seems.
Is there somewhere more info about WebRTC limits you mentioned? Is video conference with 50 video participants possible with jitsi? There is no info about such a limitations in jitsi sofrware or @bbaldino @damencho ?

Thank you,

Milan

If you start the room with audio and video muted then you can host a large conference with some key speakers and invite people to enable the camera and microphone when asking questions.

the SFU enduser bandwith limitation appear if 20-30 people enable the camera at the same time.

here is a report of a successful 120 participant jitsi conference using the above strategy: Maximum number of participants on a meeting on meet.jit.si server - #47 by srinivas

2 Likes

Hi, thank you. We are using this strategy from beginning on. :slight_smile: Our system is fine with 70 participants with one presenter, so inline with your estimations.

Milan

1 Like

Great! Thanks for your script, I will try to use in. It could be not so easy, because we are using docker images currently.

Good conclusions.
I will try this tweaks. Thanks to you and big respect to @xranby

In general the flamegraph looks OK, good profiling,
i have two main comments.

org/jitsi/videobridge/cc/vp8/VP8FrameProjection:::rewriteRtp spend quite a lot of time generating json. 2.38% of total CPU time on the machine, and a huge percentage of endpoint send is slowed down by this. JVB engineers should look at this and check if json generation can get optimized here.
The top of the big green peak is zoomed in here to illustrate how large portion of the Endpoint:::Send is spent generating json

The Second observation is that now when the server runs more optimal, by having logging disabled, makes the garbage-collectors CPU usage to be a new viable target to look at to improve performance. The CMS garbagecollector use 3.15% of total CPU time (the yellow peak in your graph quoted below). You are in a position to start evaluate if switching to the new generations of garbage collectors such as G1 and the very latest concurrent pauseless garbagecollectors ZGC and Shenandoah may remove this 3.15% of CPU usage currently spend garbagecollecting using the ConcurrentMarkSweep CMS collector.
Page 53 of this slide deck show some interesting compares of different new JVM GC’s Choosing Right Garbage Collector to Increase Efficiency of Java Memor…

You can explore if generating a “memory” flame graph can give clues how to lower JVB over all memory usage or remove the need to perform garbage-collects.

An Off-cpu flamegraph may also be interesting to see if a service request is blocked.

Flame Graphs - here is the latest flamegraph research that describe the different types of cpu, memory and off-cpu flame graphs.

I still do not have any clues why your system show such many dropped packages,
can you share your dmesg ? Have you installed the latest firmware for your network card?
On my system using a realtek NIC i had to install the firmware-realtek debian package from debian contrib non-free repository’s.

See Firmware - Debian Wiki for information about missing firmware

2 Likes

KUDOS @xranby! I’m learning a lot by you! Thank you! I’m wondering how great must be your servers tuned with your skills!

Tomorrow will have lesson that class with 25 cameras freaks :slight_smile: and I’ll try to capture that epic fail with flame graph. What else should I look at?

I’ll look at given suggestions and study materials. :slight_smile:
That netstat receive buffer errors are for 32 days uptime, is that that bad? I can confirm that number of receive buffer errors is increased when is server under very high load. It is:

88:00.0 Ethernet controller: Intel Corporation Ethernet Controller 10-Gigabit X540-AT2 (rev 01)
88:00.1 Ethernet controller: Intel Corporation Ethernet Controller 10-Gigabit X540-AT2 (rev 01)

with debian 10 default FW/drivers and ixgbe module.

Second JVB server has: 153656 receive buffer errors 19 send buffer errors, but there is:

01:00.0 Ethernet controller: Intel Corporation 82575EB Gigabit Network Connection (rev 02)
01:00.1 Ethernet controller: Intel Corporation 82575EB Gigabit Network Connection (rev 02)
07:00.0 Ethernet controller: Intel Corporation 82575EB Gigabit Network Connection (rev 02)
07:00.1 Ethernet controller: Intel Corporation 82575EB Gigabit Network Connection (rev 02)

with default FW/drivers and igb module.

Thank you, kind regards,

Milan

2 Likes

Hi @xranby, I’m trying to catch perf trace under high load, but produced two jvb crashes:

jvb@virt1:/tmp/test$ java -cp attach-main.jar:$JAVA_HOME/lib/tools.jar net.virtualvoid.perf.AttachOnce 23869
Exception in thread “main” java.io.IOException: Premature EOF
at sun.tools.attach.HotSpotVirtualMachine.readInt(HotSpotVirtualMachine.java:292)
at sun.tools.attach.LinuxVirtualMachine.execute(LinuxVirtualMachine.java:199)
at sun.tools.attach.HotSpotVirtualMachine.loadAgentLibrary(HotSpotVirtualMachine.java:58)
at sun.tools.attach.HotSpotVirtualMachine.loadAgentPath(HotSpotVirtualMachine.java:88)
at net.virtualvoid.perf.AttachOnce.loadAgent(AttachOnce.java:51)
at net.virtualvoid.perf.AttachOnce.main(AttachOnce.java:34)

Had you such a experinece in past?

All conferences were moved to second JVB and can continue. While all rooms were moved to one jvb nice high load situation occurred for us:

root@BackupStorage:~/scripts# ./stats.sh
jitsi_bit_rate_upload: 148190
jitsi_bit_rate_download: 24249
jitsi_participants: 210
jitsi_conferences: 12
jitsi_largest_conference: 40
jitsi_endpoints_sending_video: 24
jitsi_endpoints_sending_audio: 17
jitsi_receive_only_endpoints: 182
jitsi_threads: 1020
jitsi_total_failed_conferences: 0
jitsi_total_partially_failed_conferences: 3

Dropnute pakety v JVB: 0 0
Unable to find encoding matching packet: 919
Negative rtt: 1669
Resource temporarily unavailable: 0
Couldn’t find packet detail for the seq nums: 406733
Zatazenie: 10:22:06 up 24 days, 22:31, 3 users, load average: 4.63, 4.59, 4.29
Netstat: 153656 receive buffer errors 19 send buffer errors

This is older HW 8C/16T E5540@2.53GHz.

Here is flame graph from this: https://www.dropbox.com/s/mhqlzdde65dv1h8/flamegraph3.svg?dl=0

It looks simmilar for me, but you can see differences for sure.

Thank you,

Milan

It happened to me once. It was the only time I forgot to delete the file /tmp/perf-XXXX.map before … might be related.

Hi, I’ll try to delete this file next time. It happened to me 4-times already :frowning:

Thank you for suggestion,

Milan