Frequent Problems with Prosody in a larger env

Hi everyone,

thank you for all your work with Jitsi it’s a great thing and especially helps many people during COVID.

We see some issues recently with Prosody and have no idea left on what to do with it. From time to time Prosody just starts eating one CPU and stops responding to XMPP requests. And all users get disconnected because Jicofo also cannot talk to the bridges anymore.

There are no real logs on what leads to this situation just symptoms but I will paste them here:
Jicofo

Jicofo 2020-12-17 10:56:13.566 INFO: [3520] org.jitsi.jicofo.FocusManager.log() Exception while trying to start the conference
net.java.sip.communicator.service.protocol.OperationFailedException: Failed to join the room
        at org.jitsi.impl.protocol.xmpp.ChatRoomImpl.joinAs(ChatRoomImpl.java:298)
        at org.jitsi.impl.protocol.xmpp.ChatRoomImpl.join(ChatRoomImpl.java:209)
        at org.jitsi.jicofo.JitsiMeetConferenceImpl.joinTheRoom(JitsiMeetConferenceImpl.java:581)
        at org.jitsi.jicofo.JitsiMeetConferenceImpl.start(JitsiMeetConferenceImpl.java:404)
        at org.jitsi.jicofo.FocusManager.conferenceRequest(FocusManager.java:465)
        at org.jitsi.jicofo.FocusManager.conferenceRequest(FocusManager.java:419)
        at org.jitsi.jicofo.FocusManager.conferenceRequest(FocusManager.java:394)
        at org.jitsi.jicofo.xmpp.FocusComponent.handleConferenceIq(FocusComponent.java:337)
        at org.jitsi.jicofo.xmpp.FocusComponent.handleIQSetImpl(FocusComponent.java:228)
        at org.jitsi.xmpp.component.ComponentBase.handleIQSet(ComponentBase.java:362)
        at org.xmpp.component.AbstractComponent.processIQRequest(AbstractComponent.java:515)
        at org.xmpp.component.AbstractComponent.processIQ(AbstractComponent.java:289)
        at org.xmpp.component.AbstractComponent.processQueuedPacket(AbstractComponent.java:239)
        at org.xmpp.component.AbstractComponent.access$100(AbstractComponent.java:81)
        at org.xmpp.component.AbstractComponent$PacketProcessor.run(AbstractComponent.java:1051)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: org.jivesoftware.smack.SmackException$NoResponseException: No response received within reply timeout. Timeout was 15000ms (~15s). Waited for response using: AndFilter: (StanzaTypeFilter: Presence, OrFilter: (AndFilter: (FromMatchesFilter (ignoreResourcepart): reli1017@conference.meet.ffmuc.net, MUCUserStatusCodeFilter: status=110), AndFilter: (FromMatchesFilter (full): reli1017@conference.meet.ffmuc.net/focus, StanzaIdFilter: id=SQYdG-1318979, PresenceTypeFilter: type=error))).
        at org.jivesoftware.smack.SmackException$NoResponseException.newWith(SmackException.java:111)
        at org.jivesoftware.smack.SmackException$NoResponseException.newWith(SmackException.java:98)
        at org.jivesoftware.smack.StanzaCollector.nextResultOrThrow(StanzaCollector.java:260)
        at org.jivesoftware.smackx.muc.MultiUserChat.enter(MultiUserChat.java:355)
        at org.jivesoftware.smackx.muc.MultiUserChat.createOrJoin(MultiUserChat.java:498)
        at org.jivesoftware.smackx.muc.MultiUserChat.createOrJoin(MultiUserChat.java:444)
        at org.jitsi.impl.protocol.xmpp.ChatRoomImpl.joinAs(ChatRoomImpl.java:240)
        ... 17 more

Jicofo 2020-12-17 10:56:13.732 SEVERE: [3593] org.jitsi.jicofo.AbstractChannelAllocator.log() jvbbrewery@internal.auth.meet.ffmuc.net/jvb9.meet.ffmuc.net - failed to allocate channels, will consider the bridge faulty: Timed out waiting for a response.
org.jitsi.protocol.xmpp.colibri.exception.TimeoutException: Timed out waiting for a response.
        at org.jitsi.impl.protocol.xmpp.colibri.ColibriConferenceImpl.maybeThrowOperationFailed(ColibriConferenceImpl.java:342)
        at org.jitsi.impl.protocol.xmpp.colibri.ColibriConferenceImpl.createColibriChannels(ColibriConferenceImpl.java:282)
        at org.jitsi.protocol.xmpp.colibri.ColibriConference.createColibriChannels(ColibriConference.java:112)
        at org.jitsi.jicofo.ParticipantChannelAllocator.doAllocateChannels(ParticipantChannelAllocator.java:111)
        at org.jitsi.jicofo.AbstractChannelAllocator.allocateChannels(AbstractChannelAllocator.java:271)
        at org.jitsi.jicofo.AbstractChannelAllocator.doRun(AbstractChannelAllocator.java:190)
        at org.jitsi.jicofo.AbstractChannelAllocator.run(AbstractChannelAllocator.java:150)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:834)
Jicofo 2020-12-17 10:47:56.376 WARNING: [49] org.jitsi.jicofo.bridge.BridgeSelectionStrategy.log() Failed to select bridge for participantRegion=ffmuc-de1
Jicofo 2020-12-17 10:47:56.376 SEVERE: [49] org.jitsi.jicofo.JitsiMeetConferenceImpl.log() Can not invite participant -- no bridge available.
Jicofo 2020-12-17 10:47:56.376 WARNING: [49] org.jitsi.jicofo.bridge.BridgeSelectionStrategy.log() Failed to select bridge for participantRegion=ffmuc-de1
Jicofo 2020-12-17 10:47:56.376 SEVERE: [49] org.jitsi.jicofo.JitsiMeetConferenceImpl.log() Can not invite participant -- no bridge available.
Jicofo 2020-12-17 10:47:56.377 WARNING: [49] org.jitsi.jicofo.bridge.BridgeSelectionStrategy.log() Failed to select bridge for participantRegion=ffmuc-de1
Jicofo 2020-12-17 10:47:56.377 SEVERE: [49] org.jitsi.jicofo.JitsiMeetConferenceImpl.log() Can not invite participant -- no bridge available.
Jicofo 2020-12-17 10:47:56.377 WARNING: [49] org.jitsi.jicofo.bridge.BridgeSelectionStrategy.log() Failed to select bridge for participantRegion=ffmuc-de1
Jicofo 2020-12-17 10:47:56.377 SEVERE: [49] org.jitsi.jicofo.JitsiMeetConferenceImpl.log() Can not invite participant -- no bridge available.
Jicofo 2020-12-17 10:47:56.377 WARNING: [49] org.jitsi.jicofo.bridge.BridgeSelectionStrategy.log() Failed to select bridge for participantRegion=ffmuc-de1

System stats:

Outage yesterday:

Outage today:

We already run with backend epoll and max openfiles cranked to the max. And as you can see the setup runs pretty stable until “whatever” happens.

Those are our prosody config files:

Any idea what could go wrong?

Best and thank you.

What version of prosody do you use? And version of jitsi-meet?

What is the number of participants when this happens?

Are you using bosh or websockets?

Prosody version:
0.11.7-1~buster4

Jicofo version:
1.0-644-1

Jitsi-meet-web version:
1.0.4576-2

It happens some times at 600 or 1500 and some times at 2700 so no clear rule.

We configured everything to use websockets.

Ok, that is good. There are few optimization fixes we worked on with the prosody team that will be out in 0.11.8, but you can use it and now by using the latest from https://packages.prosody.im/debian/pool/main/p/prosody-0.11/.

Also in the latest unstable there are several optimizations around the prosody modules and the client to spare a lot of unnecessary XMPP stanzas, so updating to the latest unstable will also bring you a lot of prosody optimization.

Prosody is single-threaded so I would advise you to add to your monitoring and the CPU usage to the prosody process and when it hits 100% on its core for some time you are seeing this problem.

We had seen that on meet.jit.si and worked on improving it ^. There are a few more optimizations I’m currently working on.

1 Like

Do you recommend any specific nightly version of prosody and/or jitsi? Or just latest is greatest? :slight_smile:

Thank you very much! :slight_smile:

There was a fix about a lot of presence messages in big conferences in nightly118, but yeah take 119 (there was a small fix after 118) and stick to it.
Here is the changelog for the changes went in after 0.11.7 https://hg.prosody.im/trunk/log?rev=only(0.11%2C+0.11.7)

1 Like

Just another question did you tune muc_room_cache_size or change the storage_backend from memory to internal for meet.jit.si? Does it make any sense to do so?

So muc_room_cache_size is basically how many concurrent rooms you will allow on that shard … it is something like 10000 for meet.jit.si and everything is set to memory there, we don’t want anything written on the disk for those. And anything other than memory may result io operations, so there is no point in changing that.
Make sure your prosody also has enough memory …

1 Like

That makes sense, thank you :).