Prosody Load issue with more than 45 JVBs

Hi Everyone,

I am performing some tests with Jitsi in my own environment trying to figure out the maximum number of JVBs a single shard can handle.

Our test environment consist of the following :

1- A single Shard Instance with Prosdy and Jicofo + Nginx | AWS c5.4xlarge 16 Cores 32G RAM

dpkg -s prosody | grep Version

Version: 0.11.2-1
Config-Version: 0.11.2-1

2- 45 JVBs connected from different AWS regions to the same SHARD

dpkg -s jicofo | grep Version

Version: 1.0-563-1

The issue is that when I went beyond 45 JVBs, prosody CPU went to 100% everything slowed down and I started to notice some errors about allocated JVBs channels in the logs :

Jicofo:

Jicofo 2020-05-14 08:27:37.439 SEVERE: [1543] org.jitsi.jicofo.AbstractChannelAllocator.log() jvbbrewery@internal.auth.MYDomain.com/us-east-1-jvb-01 - failed to allocate channels, will consider the bridge faulty: Timed out waiting for a response.
org.jitsi.protocol.xmpp.colibri.exception.TimeoutException: Timed out waiting for a response.
	at org.jitsi.impl.protocol.xmpp.colibri.ColibriConferenceImpl.maybeThrowOperationFailed(ColibriConferenceImpl.java:400)
	at org.jitsi.impl.protocol.xmpp.colibri.ColibriConferenceImpl.createColibriChannels(ColibriConferenceImpl.java:315)
	at org.jitsi.protocol.xmpp.colibri.ColibriConference.createColibriChannels(ColibriConference.java:115)
	at org.jitsi.jicofo.ParticipantChannelAllocator.doAllocateChannels(ParticipantChannelAllocator.java:146)
	at org.jitsi.jicofo.AbstractChannelAllocator.allocateChannels(AbstractChannelAllocator.java:250)
	at org.jitsi.jicofo.AbstractChannelAllocator.doRun(AbstractChannelAllocator.java:169)
	at org.jitsi.jicofo.AbstractChannelAllocator.run(AbstractChannelAllocator.java:143)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:834)

JVB :

> 2020-05-14 08:26:39.428 INFO: [21] org.ice4j.ice.harvest.SinglePortUdpHarvester.<init>: Initialized SinglePortUdpHarvester with address *********:10000/udp
SCTP JNI load: Linux OS detected
SCTP lib loaded
=====>: org_jitsi_modified_sctp4j_SctpJni.c calling init
=====>: org_jitsi_modified_sctp4j_SctpJni.c about to set SCTP_DEBUG_ALL
2020-05-14 08:26:40.319 INFO: [21] AbstractHealthCheckService.run#171: Performed a successful health check in PT1.006824S. Sticky failure: false
2020-05-14 08:26:43.417 WARNING: [22] [hostname=MYDomain.com id=california1] MucClient.lambda$getConnectAndLoginCallable$7#643: [MucClient id=california1 hostname=MYDomain.com] error connecting
org.jivesoftware.smack.SmackException: javax.net.ssl.SSLException: Unsupported or unrecognized SSL message
	at org.jivesoftware.smack.tcp.XMPPTCPConnection$PacketReader.parsePackets(XMPPTCPConnection.java:1076)
	at org.jivesoftware.smack.tcp.XMPPTCPConnection$PacketReader.access$300(XMPPTCPConnection.java:1000)
	at org.jivesoftware.smack.tcp.XMPPTCPConnection$PacketReader$1.run(XMPPTCPConnection.java:1016)
	at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: javax.net.ssl.SSLException: Unsupported or unrecognized SSL message
	at java.base/sun.security.ssl.SSLSocketInputRecord.handleUnknownRecord(SSLSocketInputRecord.java:439)
	at java.base/sun.security.ssl.SSLSocketInputRecord.decode(SSLSocketInputRecord.java:184)
	at java.base/sun.security.ssl.SSLTransport.decode(SSLTransport.java:108)
	at java.base/sun.security.ssl.SSLSocketImpl.decode(SSLSocketImpl.java:1144)
	at java.base/sun.security.ssl.SSLSocketImpl.readHandshakeRecord(SSLSocketImpl.java:1055)
	at java.base/sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:395)
	at org.jivesoftware.smack.tcp.XMPPTCPConnection.proceedTLSReceived(XMPPTCPConnection.java:810)
	at org.jivesoftware.smack.tcp.XMPPTCPConnection.access$1200(XMPPTCPConnection.java:151)
	at org.jivesoftware.smack.tcp.XMPPTCPConnection$PacketReader.parsePackets(XMPPTCPConnection.java:1071)
	... 3 more
2020-05-14 08:26:43.418 WARNING: [28] org.jivesoftware.smack.AbstractXMPPConnection.callConnectionClosedOnErrorListener: Connection XMPPTCPConnection[not-authenticated] (0) closed with error
java.net.SocketException: Connection or inbound has closed
	at java.base/sun.security.ssl.SSLSocketImpl$AppInputStream.read(SSLSocketImpl.java:798)
	at java.base/sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
	at java.base/sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
	at java.base/sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
	at java.base/java.io.InputStreamReader.read(InputStreamReader.java:185)
	at java.base/java.io.BufferedReader.read1(BufferedReader.java:210)
	at java.base/java.io.BufferedReader.read(BufferedReader.java:287)
	at org.xmlpull.mxp1.MXParser.fillBuf(MXParser.java:2992)
	at org.xmlpull.mxp1.MXParser.more(MXParser.java:3046)
	at org.xmlpull.mxp1.MXParser.parseProlog(MXParser.java:1410)
	at org.xmlpull.mxp1.MXParser.nextImpl(MXParser.java:1395)
	at org.xmlpull.mxp1.MXParser.next(MXParser.java:1093)
	at org.jivesoftware.smack.tcp.XMPPTCPConnection$PacketReader.parsePackets(XMPPTCPConnection.java:1248)
	at org.jivesoftware.smack.tcp.XMPPTCPConnection$PacketReader.access$300(XMPPTCPConnection.java:1000)
	at org.jivesoftware.smack.tcp.XMPPTCPConnection$PacketReader$1.run(XMPPTCPConnection.java:1016)
	at java.base/java.lang.Thread.run(Thread.java:834)
2020-05-14 08:26:43.419 WARNING: [28] [hostname=MYDomain.com id=california1] MucClient$1.connectionClosedOnError#295: Closed on error:
java.net.SocketException: Connection or inbound has closed
	at java.base/sun.security.ssl.SSLSocketImpl$AppInputStream.read(SSLSocketImpl.java:798)
	at java.base/sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
	at java.base/sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
	at java.base/sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
	at java.base/java.io.InputStreamReader.read(InputStreamReader.java:185)
	at java.base/java.io.BufferedReader.read1(BufferedReader.java:210)
	at java.base/java.io.BufferedReader.read(BufferedReader.java:287)
	at org.xmlpull.mxp1.MXParser.fillBuf(MXParser.java:2992)
	at org.xmlpull.mxp1.MXParser.more(MXParser.java:3046)
	at org.xmlpull.mxp1.MXParser.parseProlog(MXParser.java:1410)
	at org.xmlpull.mxp1.MXParser.nextImpl(MXParser.java:1395)
	at org.xmlpull.mxp1.MXParser.next(MXParser.java:1093)
	at org.jivesoftware.smack.tcp.XMPPTCPConnection$PacketReader.parsePackets(XMPPTCPConnection.java:1248)
	at org.jivesoftware.smack.tcp.XMPPTCPConnection$PacketReader.access$300(XMPPTCPConnection.java:1000)
	at org.jivesoftware.smack.tcp.XMPPTCPConnection$PacketReader$1.run(XMPPTCPConnection.java:1016)
	at java.base/java.lang.Thread.run(Thread.java:834)
2020-05-14 08:26:49.312 INFO: [21] Videobridge.createConference#320: create_conf, id=d96faa13d226ce99 gid=null logging=false
2020-05-14 08:26:49.353 INFO: [21] AbstractHealthCheckService.run#171: Performed a successful health check in PT0.041714S. Sticky failure: false
2020-05-14 08:26:53.440 WARNING: [22] [hostname=MYDomain.com id=california1] MucClient.lambda$getConnectAndLoginCallable$7#643: [MucClient id=california1 hostname=MYDomain.com] error connecting
org.jivesoftware.smack.SmackException$NoResponseException: No response received within reply timeout. Timeout was 5000ms (~5s). While waiting for establishing TLS
	at org.jivesoftware.smack.SmackException$NoResponseException.newWith(SmackException.java:93)
	at org.jivesoftware.smack.SynchronizationPoint.checkForResponse(SynchronizationPoint.java:272)
	at org.jivesoftware.smack.SynchronizationPoint.checkIfSuccessOrWait(SynchronizationPoint.java:157)
	at org.jivesoftware.smack.SynchronizationPoint.checkIfSuccessOrWaitOrThrow(SynchronizationPoint.java:128)
	at org.jivesoftware.smack.tcp.XMPPTCPConnection.connectInternal(XMPPTCPConnection.java:908)
	at org.jivesoftware.smack.AbstractXMPPConnection.connect(AbstractXMPPConnection.java:383)
	at org.jitsi.xmpp.mucclient.MucClient.lambda$getConnectAndLoginCallable$7(MucClient.java:638)
	at org.jitsi.retry.RetryStrategy$TaskRunner.run(RetryStrategy.java:193)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:834)
2020-05-14 08:26:53.462 INFO: [45] org.jivesoftware.smack.tcp.XMPPTCPConnection$PacketReader.parsePackets: XMPPTCPConnection[not-authenticated] (0) received closing </stream> element. Server wants to terminate the connection, calling disconnect()
2020-05-14 08:26:58.461 WARNING: [40] [hostname=MYDomain.com id=california1] MucClient$1.reconnectionFailed#326: Reconnection failed: 
org.jivesoftware.smack.SmackException$NoResponseException: No response received within reply timeout. Timeout was 5000ms (~5s). While waiting for establishing TLS
	at org.jivesoftware.smack.SmackException$NoResponseException.newWith(SmackException.java:93)
	at org.jivesoftware.smack.SynchronizationPoint.checkForResponse(SynchronizationPoint.java:272)
	at org.jivesoftware.smack.SynchronizationPoint.checkIfSuccessOrWait(SynchronizationPoint.java:157)
	at org.jivesoftware.smack.SynchronizationPoint.checkIfSuccessOrWaitOrThrow(SynchronizationPoint.java:128)
	at org.jivesoftware.smack.tcp.XMPPTCPConnection.connectInternal(XMPPTCPConnection.java:908)
	at org.jivesoftware.smack.AbstractXMPPConnection.connect(AbstractXMPPConnection.java:383)
	at org.jivesoftware.smack.ReconnectionManager$2.run(ReconnectionManager.java:289)
	at java.base/java.lang.Thread.run(Thread.java:834)
2020-05-14 08:26:59.312 INFO: [21] Videobridge.createConference#320: create_conf, id=2b47314e74c63249 gid=null logging=false
2020-05-14 08:26:59.323 INFO: [21] AbstractHealthCheckService.run#171: Performed a successful health check in PT0.011603S. Sticky failure: false

Prosody :

|May 14 07:28:53 c2s5613fc6e7510|info|Client disconnected: handshake too long|
|---|---|---|
|May 14 07:29:35 c2s5613fb726680|info|Client disconnected: handshake too long|
|May 14 07:30:18 c2s5613fa129890|info|Client disconnected: handshake too long|
|May 14 07:32:52 c2s5613f828f090|info|Client disconnected: handshake too long|
|May 14 07:33:08 c2s5613fa2bc9e0|info|Client disconnected: handshake too long|
|May 14 07:33:40 c2s5613fad1b030|info|Client disconnected: handshake too long|
|May 14 07:34:13 c2s5613f9e7fb50|info|Client disconnected: handshake too long|
|May 14 07:34:33 c2s5613fa192230|info|Client disconnected: handshake too long|
|May 14 07:36:12 c2s5613fcbca6f0|info|Client disconnected: handshake too long|
|May 14 07:36:54 c2s5613fcb0d570|info|Client disconnected: handshake too long|
|May 14 07:37:36 c2s5613fd854300|info|Client disconnected: handshake too long|
|May 14 07:38:08 c2s5613f8749ac0|info|Client disconnected: handshake too long|
|May 14 07:38:49 c2s5613fbe3f420|info|Client disconnected: handshake too long|
|May 14 07:40:29 c2s5613f8923450|info|Client disconnected: handshake too long|
|May 14 07:41:01 c2s5613fab03790|info|Client disconnected: handshake too long|
|May 14 07:41:29 c2s5613f75219c0|info|Client disconnected: handshake too long|

Observations :

I was under the impression that prosody is multi-threaded, However I noticed that it is using a single core I am not sure if there’s a specific config paramater to change that.

Once I removed the newely added JVBs and went back to the 45 JVBs I started with prosody CPU went down to below 50% and the environment started to get back to normal.

Question:
1- Is there any limitations to the number of JVBs a single SHARD can handle ?
2- Is there any fine-tunning configuration parametrs I should take into considration to handle such load ?
3- Any solution or suggestion regarding the above warnnings and errors ?

Thanks in advance.

Best Regards,
Mohamed Abada

i don’t know but there is a thing called 'light threads" that are definitely not the same as multiprocessing. Searching a bit I have found this an aborted attempt but where was cited as motivation that ‘Prosody on the other hand is pretty well architectured but single threaded’.
This said you should find errors in Prosody log if this was the reason

Thanks @gpatel-fr
@damencho any idea what is happenning here ?

Any support here ?