System load increases to very high level and Jitsi Video Bridege crashes

Hello,

We are using multiple JVB instances for our set up. Sometimes, All of sudden the system load of any of the servers goes to very high levels like 500-600(1 minute load) and JVB gets crashed. During this time, the CPU usage is normal from 20% to 80%. We tried changing the OS from Ubuntu18.04 to Ubuntu 20.04. We also tried changing the cloud service provider and fresh installations. But the issue remains the same. We have confirmed that this is not due to DDOS or something. We have been facing this issue since last few months. JVB is configured on dedicated 4vcpu instances with 8GB RAM and 160 GB disk size.
Can anyone help us identifying the issue?

You can find the JVM crash(in /var/crash) logs using the following URL:
https://jvb-logs-tmp.fra1.digitaloceanspaces.com/_usr_lib_jvm_java-8-openjdk-amd64_jre_bin_java.0.crash

You can find the JVB error logs using the following URL:
https://jvb-logs-tmp.fra1.digitaloceanspaces.com/jvb.log

Thanks in advance!

1 Like

your logs are full of resource temporary available for sctp connections (you know that sctp is deprecated right ?). You could consider bandwidth problems instead of focusing on cpu, ram - important resources but bandwidth is a key factor, probably the most important here.

@gpatel-fr
Yes. sctp is deprecated in the latest version. We are using the previous version. The bandwidth does not seem to be an issue because each instance has 1Gbps of outbound bandwidth available. We hardly reach 200 Mbps outbound transfer. And the inbound transfer rate is also very low. We tried updating to the latest JVB version but in that case we are facing issues in Jibri instances.

not using the recommended version makes it difficult for you to get help from Jitsi devs - the most advanced help that you can hope to get for free on this forum.

About bandwidth, it’s a complex question; you may be experienced on these matters - more than myself - but in the possibility that you are less, you have to understand that some hosters are let’s say crafty and use wiggle room in expressions; if you see 1gbs in the ‘network’ column, you may have to look at the asterisk somewhere that is pointing to a small letters notice advising that the quoted speed is the network adapter capability and should correspond most of the time to a real bandwidth, but that when the system (from the hoster’s point of view, it’s the big server from where are carved small VMs like yours) is loaded the real bandwidth can fall to 200/300 mbits. With a real time system like Jitsi-meet, such features are a no-no. You would get at best big quality problems from time to time, and at worst trigger unexpected bugs in the software - Jitsi devs are not testing their software with crappy hardware, sad but that’s the way of developers :-).

Anyway, even if you have no bandwidth problems, log messages seem to say that there are network problems. So you should monitor network to see what happens when the server can’t push data to the network - lack of resources like handles, MTU discrepancies on the network path, heavy use of TCP instead of UDP, or whatever.

@gpatel-fr
I am not much experienced person in matter of bandwidth. Some of the reasons I mentioned that bandwidth might not be an issue because:

  1. We tried switching to different cloud service provider (used dedicated CPU instances in both the cases). Also confirmed with the cloud provider technical support team that the outbound bandwidth is not an issue. They claimed that they throttle after 1 Gbps.
  2. The issue wasn’t there for initial 5 months. The issue started appearing all of sudden in the existing set up. Initially we performed load testing with almost 3 times more load than the existing load. Everything worked fine at that time.

I think by using some tool or something I should look into how much bandwidth they actually provide. Other than that I will also try disabling scpt. The thing is, from the logs, I am not able to identify what could be an issue.
Thanks! :slight_smile:

refer to this - ‘crashes in the bridge’…

1 Like

Yes. I think this is happening because of SCTP. Someone else has also reported the problem here in the past.
I will confirm here by moving from SCTP to Websockets if it works.