Meetings keep crashing when 5 or more participants are active

Hi,

we’re experiencing massive problems with our self-hosted Jitsi installation and wanted to ask for help here. Out setup utilises ejabberd for XMPP and the newest videobridge2 on Debian and our own STUN server.

The problem:

Meeting with 5 or more people suddenly crash and cannot be recovered or reconnected to.

jicofo.log from around the crash:

Jicofo 2020-07-28 10:02:16.999 INFO: [100] org.jitsi.jicofo.JitsiMeetConferenceImpl.log() Expiring channels for: xxxyyyxxxyyyxxx@conference.meet.example.com/29048190 on: Bridge[jid=jvbbrewery@conference.meet.example.com/bef086b7-8e0c-4b87-b67a-c94e42124ecc, relayId=null, region=null, stress=0.02]
Jicofo 2020-07-28 10:02:17.000 WARNING: [100] org.jitsi.jicofo.bridge.BridgeSelectionStrategy.log() Failed to select initial bridge for participantRegion=null
Jicofo 2020-07-28 10:02:17.000 SEVERE: [100] org.jitsi.jicofo.JitsiMeetConferenceImpl.log() Can not invite participant -- no bridge available.
Jicofo 2020-07-28 10:02:17.000 WARNING: [100] org.jitsi.jicofo.bridge.BridgeSelectionStrategy.log() Failed to select initial bridge for participantRegion=null
Jicofo 2020-07-28 10:02:17.000 SEVERE: [100] org.jitsi.jicofo.JitsiMeetConferenceImpl.log() Can not invite participant -- no bridge available.
Jicofo 2020-07-28 10:02:17.001 WARNING: [100] org.jitsi.jicofo.bridge.BridgeSelectionStrategy.log() Failed to select initial bridge for participantRegion=null
Jicofo 2020-07-28 10:02:17.001 SEVERE: [100] org.jitsi.jicofo.JitsiMeetConferenceImpl.log() Can not invite participant -- no bridge available.
Jicofo 2020-07-28 10:02:17.001 WARNING: [100] org.jitsi.jicofo.bridge.BridgeSelectionStrategy.log() Failed to select initial bridge for participantRegion=null
Jicofo 2020-07-28 10:02:17.001 SEVERE: [100] org.jitsi.jicofo.JitsiMeetConferenceImpl.log() Can not invite participant -- no bridge available.
Jicofo 2020-07-28 10:02:17.001 WARNING: [100] org.jitsi.jicofo.bridge.BridgeSelectionStrategy.log() Failed to select initial bridge for participantRegion=null
Jicofo 2020-07-28 10:02:17.001 SEVERE: [100] org.jitsi.jicofo.JitsiMeetConferenceImpl.log() Can not invite participant -- no bridge available.
Jicofo 2020-07-28 10:02:17.001 WARNING: [100] org.jitsi.jicofo.bridge.BridgeSelectionStrategy.log() Failed to select initial bridge for participantRegion=null
Jicofo 2020-07-28 10:02:17.001 SEVERE: [100] org.jitsi.jicofo.JitsiMeetConferenceImpl.log() Can not invite participant -- no bridge available.
Jicofo 2020-07-28 10:02:17.001 WARNING: [100] org.jitsi.jicofo.bridge.BridgeSelectionStrategy.log() Failed to select initial bridge for participantRegion=null
Jicofo 2020-07-28 10:02:17.002 SEVERE: [100] org.jitsi.jicofo.JitsiMeetConferenceImpl.log() Can not invite participant -- no bridge available.
Jicofo 2020-07-28 10:02:17.002 WARNING: [100] org.jitsi.jicofo.bridge.BridgeSelectionStrategy.log() Failed to select initial bridge for participantRegion=null
Jicofo 2020-07-28 10:02:17.002 SEVERE: [100] org.jitsi.jicofo.JitsiMeetConferenceImpl.log() Can not invite participant -- no bridge available.
Jicofo 2020-07-28 10:02:31.997 WARNING: [209] org.jitsi.jicofo.bridge.JvbDoctor.log() jvbbrewery@conference.meet.example.com/bef086b7-8e0c-4b87-b67a-c94e42124ecc health-check timed out, but will give it another try after: 5000

The jitsivideobridge2 process is okay during all of this, but the meeting just stops.

jvb.log from around a crash:

2020-07-28 10:02:03.708 WARNING: [259] [confId=aa7710d3ccbe0c27 epId=37670487 gid=56278 stats_id=Cade-JqF conf_name=xxxyyyxxxyyyxxx] VideoParser.transform#58: Unable to find encoding matching packet! packet=RtpPacket: PT=100, Ssrc=551795307, SeqNum=14238, M=false, X=true, Ts=3346858253, encodings=
2020-07-28 10:02:03.708 WARNING: [260] [confId=aa7710d3ccbe0c27 epId=37670487 gid=56278 stats_id=Cade-JqF conf_name=xxxyyyxxxyyyxxx] VideoParser.transform#58: Unable to find encoding matching packet! packet=RtpPacket: PT=100, Ssrc=551795307, SeqNum=14239, M=false, X=true, Ts=3346858253, encodings=
2020-07-28 10:02:03.708 WARNING: [262] [confId=aa7710d3ccbe0c27 epId=37670487 gid=56278 stats_id=Cade-JqF conf_name=xxxyyyxxxyyyxxx] VideoParser.transform#58: Unable to find encoding matching packet! packet=RtpPacket: PT=100, Ssrc=551795307, SeqNum=14240, M=true, X=true, Ts=3346858253, encodings=
2020-07-28 10:02:03.709 WARNING: [262] [confId=aa7710d3ccbe0c27 epId=37670487 gid=56278 stats_id=Cade-JqF conf_name=xxxyyyxxxyyyxxx] VideoParser.transform#58: Unable to find encoding matching packet! packet=RtpPacket: PT=100, Ssrc=1832852352, SeqNum=13704, M=true, X=true, Ts=3346858253, encodings=
2020-07-28 10:02:03.739 WARNING: [259] [confId=aa7710d3ccbe0c27 epId=37670487 gid=56278 stats_id=Cade-JqF conf_name=xxxyyyxxxyyyxxx] VideoParser.transform#58: Unable to find encoding matching packet! packet=RtpPacket: PT=100, Ssrc=551795307, SeqNum=14241, M=false, X=true, Ts=3346861223, encodings=
2020-07-28 10:02:03.739 WARNING: [262] [confId=aa7710d3ccbe0c27 epId=37670487 gid=56278 stats_id=Cade-JqF conf_name=xxxyyyxxxyyyxxx] VideoParser.transform#58: Unable to find encoding matching packet! packet=RtpPacket: PT=100, Ssrc=551795307, SeqNum=14242, M=true, X=true, Ts=3346861223, encodings=
2020-07-28 10:02:03.739 WARNING: [261] [confId=aa7710d3ccbe0c27 epId=37670487 gid=56278 stats_id=Cade-JqF conf_name=xxxyyyxxxyyyxxx] VideoParser.transform#58: Unable to find encoding matching packet! packet=RtpPacket: PT=100, Ssrc=1832852352, SeqNum=13705, M=true, X=true, Ts=3346861223, encodings=
Got sctp association state update: 1
sctp is now up.  was ready? false
2020-07-28 10:02:21.061 SEVERE: [34] [confId=aa7710d3ccbe0c27 epId=72475673 gid=56278 stats_id=Sharon-AvJ conf_name=xxxyyyxxxyyyxxx] Endpoint.lambda$scheduleEndpointMessageTransportTimeout$7#982: EndpointMessageTransport still not connected.
2020-07-28 10:02:39.408 WARNING: [97032] [confId=aa7710d3ccbe0c27 gid=56278 stats_id=Lupe-nSx componentId=1 conf_name=xxxyyyxxxyyyxxx ufrag=bbirt1eea6tutn name=stream-80074878 epId=80074878 local_ufrag=bbirt1eea6tutn] MergingDatagramSocket.doRemove#349: Removing the active socket. Won't be able to send until a new one is elected.

Sometimes it is possible to reconnect to a meeting, but it will crash again. If any more information is needed, I can happily provide it too. I’m just not sure where to look for pointers anymore.

The issue was caused by the Jicofo health checks, which started to fail after a certain number of participants (often around 5-7) in a meeting. The error message corresponding to the issue was

Jicofo 2020-07-28 10:01:56.995 WARNING: [209] org.jitsi.jicofo.bridge.JvbDoctor.log() jvbbrewery@conference.meet.example.com/PRIMARY-BRIDGE health-check timed out, but will give it another try after: 5000
Jicofo 2020-07-28 10:02:16.996 WARNING: [209] org.jitsi.jicofo.bridge.JvbDoctor.log() Health check timed out for: jvbbrewery@conference.meet.example.com/PRIMARY-BRIDGE
Jicofo 2020-07-28 10:02:16.997 SEVERE: [100] org.jitsi.jicofo.JitsiMeetConferenceImpl.log() One of our bridges failed: jvbbrewery@conference.meet.example.com/PRIMARY-BRIDGE

Furthermore, we noticed that this comment in the source code seems to be related to the behaviour we see.

This might be due to the fact we’re using an external XMPP service (ejabberd) which adds latency in network communication.

We currently “fixed” our instance by disabling health checks in Jicofo by adding

org.jitsi.jicofo.health.ENABLE_HEALTH_CHECKS=false

to /etc/jitsi/jicofo/sip-communicator.properties.

Does anyone know what implications this might cause?