Meetings keep crashing when 5 or more participants are active

Hi,

we’re experiencing massive problems with our self-hosted Jitsi installation and wanted to ask for help here. Out setup utilises ejabberd for XMPP and the newest videobridge2 on Debian and our own STUN server.

The problem:

Meeting with 5 or more people suddenly crash and cannot be recovered or reconnected to.

jicofo.log from around the crash:

Jicofo 2020-07-28 10:02:16.999 INFO: [100] org.jitsi.jicofo.JitsiMeetConferenceImpl.log() Expiring channels for: xxxyyyxxxyyyxxx@conference.meet.example.com/29048190 on: Bridge[jid=jvbbrewery@conference.meet.example.com/bef086b7-8e0c-4b87-b67a-c94e42124ecc, relayId=null, region=null, stress=0.02]
Jicofo 2020-07-28 10:02:17.000 WARNING: [100] org.jitsi.jicofo.bridge.BridgeSelectionStrategy.log() Failed to select initial bridge for participantRegion=null
Jicofo 2020-07-28 10:02:17.000 SEVERE: [100] org.jitsi.jicofo.JitsiMeetConferenceImpl.log() Can not invite participant -- no bridge available.
Jicofo 2020-07-28 10:02:17.000 WARNING: [100] org.jitsi.jicofo.bridge.BridgeSelectionStrategy.log() Failed to select initial bridge for participantRegion=null
Jicofo 2020-07-28 10:02:17.000 SEVERE: [100] org.jitsi.jicofo.JitsiMeetConferenceImpl.log() Can not invite participant -- no bridge available.
Jicofo 2020-07-28 10:02:17.001 WARNING: [100] org.jitsi.jicofo.bridge.BridgeSelectionStrategy.log() Failed to select initial bridge for participantRegion=null
Jicofo 2020-07-28 10:02:17.001 SEVERE: [100] org.jitsi.jicofo.JitsiMeetConferenceImpl.log() Can not invite participant -- no bridge available.
Jicofo 2020-07-28 10:02:17.001 WARNING: [100] org.jitsi.jicofo.bridge.BridgeSelectionStrategy.log() Failed to select initial bridge for participantRegion=null
Jicofo 2020-07-28 10:02:17.001 SEVERE: [100] org.jitsi.jicofo.JitsiMeetConferenceImpl.log() Can not invite participant -- no bridge available.
Jicofo 2020-07-28 10:02:17.001 WARNING: [100] org.jitsi.jicofo.bridge.BridgeSelectionStrategy.log() Failed to select initial bridge for participantRegion=null
Jicofo 2020-07-28 10:02:17.001 SEVERE: [100] org.jitsi.jicofo.JitsiMeetConferenceImpl.log() Can not invite participant -- no bridge available.
Jicofo 2020-07-28 10:02:17.001 WARNING: [100] org.jitsi.jicofo.bridge.BridgeSelectionStrategy.log() Failed to select initial bridge for participantRegion=null
Jicofo 2020-07-28 10:02:17.001 SEVERE: [100] org.jitsi.jicofo.JitsiMeetConferenceImpl.log() Can not invite participant -- no bridge available.
Jicofo 2020-07-28 10:02:17.001 WARNING: [100] org.jitsi.jicofo.bridge.BridgeSelectionStrategy.log() Failed to select initial bridge for participantRegion=null
Jicofo 2020-07-28 10:02:17.002 SEVERE: [100] org.jitsi.jicofo.JitsiMeetConferenceImpl.log() Can not invite participant -- no bridge available.
Jicofo 2020-07-28 10:02:17.002 WARNING: [100] org.jitsi.jicofo.bridge.BridgeSelectionStrategy.log() Failed to select initial bridge for participantRegion=null
Jicofo 2020-07-28 10:02:17.002 SEVERE: [100] org.jitsi.jicofo.JitsiMeetConferenceImpl.log() Can not invite participant -- no bridge available.
Jicofo 2020-07-28 10:02:31.997 WARNING: [209] org.jitsi.jicofo.bridge.JvbDoctor.log() jvbbrewery@conference.meet.example.com/bef086b7-8e0c-4b87-b67a-c94e42124ecc health-check timed out, but will give it another try after: 5000

The jitsivideobridge2 process is okay during all of this, but the meeting just stops.

jvb.log from around a crash:

2020-07-28 10:02:03.708 WARNING: [259] [confId=aa7710d3ccbe0c27 epId=37670487 gid=56278 stats_id=Cade-JqF conf_name=xxxyyyxxxyyyxxx] VideoParser.transform#58: Unable to find encoding matching packet! packet=RtpPacket: PT=100, Ssrc=551795307, SeqNum=14238, M=false, X=true, Ts=3346858253, encodings=
2020-07-28 10:02:03.708 WARNING: [260] [confId=aa7710d3ccbe0c27 epId=37670487 gid=56278 stats_id=Cade-JqF conf_name=xxxyyyxxxyyyxxx] VideoParser.transform#58: Unable to find encoding matching packet! packet=RtpPacket: PT=100, Ssrc=551795307, SeqNum=14239, M=false, X=true, Ts=3346858253, encodings=
2020-07-28 10:02:03.708 WARNING: [262] [confId=aa7710d3ccbe0c27 epId=37670487 gid=56278 stats_id=Cade-JqF conf_name=xxxyyyxxxyyyxxx] VideoParser.transform#58: Unable to find encoding matching packet! packet=RtpPacket: PT=100, Ssrc=551795307, SeqNum=14240, M=true, X=true, Ts=3346858253, encodings=
2020-07-28 10:02:03.709 WARNING: [262] [confId=aa7710d3ccbe0c27 epId=37670487 gid=56278 stats_id=Cade-JqF conf_name=xxxyyyxxxyyyxxx] VideoParser.transform#58: Unable to find encoding matching packet! packet=RtpPacket: PT=100, Ssrc=1832852352, SeqNum=13704, M=true, X=true, Ts=3346858253, encodings=
2020-07-28 10:02:03.739 WARNING: [259] [confId=aa7710d3ccbe0c27 epId=37670487 gid=56278 stats_id=Cade-JqF conf_name=xxxyyyxxxyyyxxx] VideoParser.transform#58: Unable to find encoding matching packet! packet=RtpPacket: PT=100, Ssrc=551795307, SeqNum=14241, M=false, X=true, Ts=3346861223, encodings=
2020-07-28 10:02:03.739 WARNING: [262] [confId=aa7710d3ccbe0c27 epId=37670487 gid=56278 stats_id=Cade-JqF conf_name=xxxyyyxxxyyyxxx] VideoParser.transform#58: Unable to find encoding matching packet! packet=RtpPacket: PT=100, Ssrc=551795307, SeqNum=14242, M=true, X=true, Ts=3346861223, encodings=
2020-07-28 10:02:03.739 WARNING: [261] [confId=aa7710d3ccbe0c27 epId=37670487 gid=56278 stats_id=Cade-JqF conf_name=xxxyyyxxxyyyxxx] VideoParser.transform#58: Unable to find encoding matching packet! packet=RtpPacket: PT=100, Ssrc=1832852352, SeqNum=13705, M=true, X=true, Ts=3346861223, encodings=
Got sctp association state update: 1
sctp is now up.  was ready? false
2020-07-28 10:02:21.061 SEVERE: [34] [confId=aa7710d3ccbe0c27 epId=72475673 gid=56278 stats_id=Sharon-AvJ conf_name=xxxyyyxxxyyyxxx] Endpoint.lambda$scheduleEndpointMessageTransportTimeout$7#982: EndpointMessageTransport still not connected.
2020-07-28 10:02:39.408 WARNING: [97032] [confId=aa7710d3ccbe0c27 gid=56278 stats_id=Lupe-nSx componentId=1 conf_name=xxxyyyxxxyyyxxx ufrag=bbirt1eea6tutn name=stream-80074878 epId=80074878 local_ufrag=bbirt1eea6tutn] MergingDatagramSocket.doRemove#349: Removing the active socket. Won't be able to send until a new one is elected.

Sometimes it is possible to reconnect to a meeting, but it will crash again. If any more information is needed, I can happily provide it too. I’m just not sure where to look for pointers anymore.

The issue was caused by the Jicofo health checks, which started to fail after a certain number of participants (often around 5-7) in a meeting. The error message corresponding to the issue was

Jicofo 2020-07-28 10:01:56.995 WARNING: [209] org.jitsi.jicofo.bridge.JvbDoctor.log() jvbbrewery@conference.meet.example.com/PRIMARY-BRIDGE health-check timed out, but will give it another try after: 5000
Jicofo 2020-07-28 10:02:16.996 WARNING: [209] org.jitsi.jicofo.bridge.JvbDoctor.log() Health check timed out for: jvbbrewery@conference.meet.example.com/PRIMARY-BRIDGE
Jicofo 2020-07-28 10:02:16.997 SEVERE: [100] org.jitsi.jicofo.JitsiMeetConferenceImpl.log() One of our bridges failed: jvbbrewery@conference.meet.example.com/PRIMARY-BRIDGE

Furthermore, we noticed that this comment in the source code seems to be related to the behaviour we see.

This might be due to the fact we’re using an external XMPP service (ejabberd) which adds latency in network communication.

We currently “fixed” our instance by disabling health checks in Jicofo by adding

org.jitsi.jicofo.health.ENABLE_HEALTH_CHECKS=false

to /etc/jitsi/jicofo/sip-communicator.properties.

Does anyone know what implications this might cause?

This is a feature to be able to detect unhealthy bridges and remove them from the pool.

We face the same issue. It appears every few weeks during idle time. Sadly it is also recognized once the next user wants to use the service.

Browsing similar issues we decided to disable the health checks as they seem to cause the crash:

2020-09-09 19:20:38.529 INFO: [18] Videobridge.createConference#255: create_conf, id=fa88a2ac2ac1ad68 gid=-1 logging=false
2020-09-09 19:20:38.558 INFO: [18] AbstractHealthCheckService.run#182: Performed a successful health check in PT0.03S. Sticky failure: false
OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x000000075029b000, 196608, 0) failed; error='Cannot allocate memory' (errno=12)
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 196608 bytes for committing reserved memory.
# An error report file with more information is saved as:
# /tmp/hs_err_pid914.log

After this happening, the videobrige is also down.

Hi,

how much RAM does your machine have? It seems it’s not enough in any case…

2 GB. So far we never had more than 5 users in a call.

The health-check crashes appeared when there was no load on the machine, so I assume it is unrelated to the maximum memory?

Both jicofo and jvb are configured to take by default up to 3GB of RAM. So on a machine with running both by default you would need ~8GB Ram. You can instruct jicofo and jvb to use less so you don’t get in this situation.


For jicofo you can set it in /etc/jitsi/jicofo/config

Thanks! Trying this now.

If it works, this is a better solution than disabling health-checks all together.