Prosody Load issue with more than 45 JVBs

Hi Everyone,

I am performing some tests with Jitsi in my own environment trying to figure out the maximum number of JVBs a single shard can handle.

Our test environment consist of the following :

1- A single Shard Instance with Prosdy and Jicofo + Nginx | AWS c5.4xlarge 16 Cores 32G RAM

dpkg -s prosody | grep Version

Version: 0.11.2-1
Config-Version: 0.11.2-1

2- 45 JVBs connected from different AWS regions to the same SHARD

dpkg -s jicofo | grep Version

Version: 1.0-563-1

The issue is that when I went beyond 45 JVBs, prosody CPU went to 100% everything slowed down and I started to notice some errors about allocated JVBs channels in the logs :

Jicofo:

Jicofo 2020-05-14 08:27:37.439 SEVERE: [1543] org.jitsi.jicofo.AbstractChannelAllocator.log() jvbbrewery@internal.auth.MYDomain.com/us-east-1-jvb-01 - failed to allocate channels, will consider the bridge faulty: Timed out waiting for a response.
org.jitsi.protocol.xmpp.colibri.exception.TimeoutException: Timed out waiting for a response.
	at org.jitsi.impl.protocol.xmpp.colibri.ColibriConferenceImpl.maybeThrowOperationFailed(ColibriConferenceImpl.java:400)
	at org.jitsi.impl.protocol.xmpp.colibri.ColibriConferenceImpl.createColibriChannels(ColibriConferenceImpl.java:315)
	at org.jitsi.protocol.xmpp.colibri.ColibriConference.createColibriChannels(ColibriConference.java:115)
	at org.jitsi.jicofo.ParticipantChannelAllocator.doAllocateChannels(ParticipantChannelAllocator.java:146)
	at org.jitsi.jicofo.AbstractChannelAllocator.allocateChannels(AbstractChannelAllocator.java:250)
	at org.jitsi.jicofo.AbstractChannelAllocator.doRun(AbstractChannelAllocator.java:169)
	at org.jitsi.jicofo.AbstractChannelAllocator.run(AbstractChannelAllocator.java:143)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:834)

JVB :

> 2020-05-14 08:26:39.428 INFO: [21] org.ice4j.ice.harvest.SinglePortUdpHarvester.<init>: Initialized SinglePortUdpHarvester with address *********:10000/udp
SCTP JNI load: Linux OS detected
SCTP lib loaded
=====>: org_jitsi_modified_sctp4j_SctpJni.c calling init
=====>: org_jitsi_modified_sctp4j_SctpJni.c about to set SCTP_DEBUG_ALL
2020-05-14 08:26:40.319 INFO: [21] AbstractHealthCheckService.run#171: Performed a successful health check in PT1.006824S. Sticky failure: false
2020-05-14 08:26:43.417 WARNING: [22] [hostname=MYDomain.com id=california1] MucClient.lambda$getConnectAndLoginCallable$7#643: [MucClient id=california1 hostname=MYDomain.com] error connecting
org.jivesoftware.smack.SmackException: javax.net.ssl.SSLException: Unsupported or unrecognized SSL message
	at org.jivesoftware.smack.tcp.XMPPTCPConnection$PacketReader.parsePackets(XMPPTCPConnection.java:1076)
	at org.jivesoftware.smack.tcp.XMPPTCPConnection$PacketReader.access$300(XMPPTCPConnection.java:1000)
	at org.jivesoftware.smack.tcp.XMPPTCPConnection$PacketReader$1.run(XMPPTCPConnection.java:1016)
	at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: javax.net.ssl.SSLException: Unsupported or unrecognized SSL message
	at java.base/sun.security.ssl.SSLSocketInputRecord.handleUnknownRecord(SSLSocketInputRecord.java:439)
	at java.base/sun.security.ssl.SSLSocketInputRecord.decode(SSLSocketInputRecord.java:184)
	at java.base/sun.security.ssl.SSLTransport.decode(SSLTransport.java:108)
	at java.base/sun.security.ssl.SSLSocketImpl.decode(SSLSocketImpl.java:1144)
	at java.base/sun.security.ssl.SSLSocketImpl.readHandshakeRecord(SSLSocketImpl.java:1055)
	at java.base/sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:395)
	at org.jivesoftware.smack.tcp.XMPPTCPConnection.proceedTLSReceived(XMPPTCPConnection.java:810)
	at org.jivesoftware.smack.tcp.XMPPTCPConnection.access$1200(XMPPTCPConnection.java:151)
	at org.jivesoftware.smack.tcp.XMPPTCPConnection$PacketReader.parsePackets(XMPPTCPConnection.java:1071)
	... 3 more
2020-05-14 08:26:43.418 WARNING: [28] org.jivesoftware.smack.AbstractXMPPConnection.callConnectionClosedOnErrorListener: Connection XMPPTCPConnection[not-authenticated] (0) closed with error
java.net.SocketException: Connection or inbound has closed
	at java.base/sun.security.ssl.SSLSocketImpl$AppInputStream.read(SSLSocketImpl.java:798)
	at java.base/sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
	at java.base/sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
	at java.base/sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
	at java.base/java.io.InputStreamReader.read(InputStreamReader.java:185)
	at java.base/java.io.BufferedReader.read1(BufferedReader.java:210)
	at java.base/java.io.BufferedReader.read(BufferedReader.java:287)
	at org.xmlpull.mxp1.MXParser.fillBuf(MXParser.java:2992)
	at org.xmlpull.mxp1.MXParser.more(MXParser.java:3046)
	at org.xmlpull.mxp1.MXParser.parseProlog(MXParser.java:1410)
	at org.xmlpull.mxp1.MXParser.nextImpl(MXParser.java:1395)
	at org.xmlpull.mxp1.MXParser.next(MXParser.java:1093)
	at org.jivesoftware.smack.tcp.XMPPTCPConnection$PacketReader.parsePackets(XMPPTCPConnection.java:1248)
	at org.jivesoftware.smack.tcp.XMPPTCPConnection$PacketReader.access$300(XMPPTCPConnection.java:1000)
	at org.jivesoftware.smack.tcp.XMPPTCPConnection$PacketReader$1.run(XMPPTCPConnection.java:1016)
	at java.base/java.lang.Thread.run(Thread.java:834)
2020-05-14 08:26:43.419 WARNING: [28] [hostname=MYDomain.com id=california1] MucClient$1.connectionClosedOnError#295: Closed on error:
java.net.SocketException: Connection or inbound has closed
	at java.base/sun.security.ssl.SSLSocketImpl$AppInputStream.read(SSLSocketImpl.java:798)
	at java.base/sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
	at java.base/sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
	at java.base/sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
	at java.base/java.io.InputStreamReader.read(InputStreamReader.java:185)
	at java.base/java.io.BufferedReader.read1(BufferedReader.java:210)
	at java.base/java.io.BufferedReader.read(BufferedReader.java:287)
	at org.xmlpull.mxp1.MXParser.fillBuf(MXParser.java:2992)
	at org.xmlpull.mxp1.MXParser.more(MXParser.java:3046)
	at org.xmlpull.mxp1.MXParser.parseProlog(MXParser.java:1410)
	at org.xmlpull.mxp1.MXParser.nextImpl(MXParser.java:1395)
	at org.xmlpull.mxp1.MXParser.next(MXParser.java:1093)
	at org.jivesoftware.smack.tcp.XMPPTCPConnection$PacketReader.parsePackets(XMPPTCPConnection.java:1248)
	at org.jivesoftware.smack.tcp.XMPPTCPConnection$PacketReader.access$300(XMPPTCPConnection.java:1000)
	at org.jivesoftware.smack.tcp.XMPPTCPConnection$PacketReader$1.run(XMPPTCPConnection.java:1016)
	at java.base/java.lang.Thread.run(Thread.java:834)
2020-05-14 08:26:49.312 INFO: [21] Videobridge.createConference#320: create_conf, id=d96faa13d226ce99 gid=null logging=false
2020-05-14 08:26:49.353 INFO: [21] AbstractHealthCheckService.run#171: Performed a successful health check in PT0.041714S. Sticky failure: false
2020-05-14 08:26:53.440 WARNING: [22] [hostname=MYDomain.com id=california1] MucClient.lambda$getConnectAndLoginCallable$7#643: [MucClient id=california1 hostname=MYDomain.com] error connecting
org.jivesoftware.smack.SmackException$NoResponseException: No response received within reply timeout. Timeout was 5000ms (~5s). While waiting for establishing TLS
	at org.jivesoftware.smack.SmackException$NoResponseException.newWith(SmackException.java:93)
	at org.jivesoftware.smack.SynchronizationPoint.checkForResponse(SynchronizationPoint.java:272)
	at org.jivesoftware.smack.SynchronizationPoint.checkIfSuccessOrWait(SynchronizationPoint.java:157)
	at org.jivesoftware.smack.SynchronizationPoint.checkIfSuccessOrWaitOrThrow(SynchronizationPoint.java:128)
	at org.jivesoftware.smack.tcp.XMPPTCPConnection.connectInternal(XMPPTCPConnection.java:908)
	at org.jivesoftware.smack.AbstractXMPPConnection.connect(AbstractXMPPConnection.java:383)
	at org.jitsi.xmpp.mucclient.MucClient.lambda$getConnectAndLoginCallable$7(MucClient.java:638)
	at org.jitsi.retry.RetryStrategy$TaskRunner.run(RetryStrategy.java:193)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:834)
2020-05-14 08:26:53.462 INFO: [45] org.jivesoftware.smack.tcp.XMPPTCPConnection$PacketReader.parsePackets: XMPPTCPConnection[not-authenticated] (0) received closing </stream> element. Server wants to terminate the connection, calling disconnect()
2020-05-14 08:26:58.461 WARNING: [40] [hostname=MYDomain.com id=california1] MucClient$1.reconnectionFailed#326: Reconnection failed: 
org.jivesoftware.smack.SmackException$NoResponseException: No response received within reply timeout. Timeout was 5000ms (~5s). While waiting for establishing TLS
	at org.jivesoftware.smack.SmackException$NoResponseException.newWith(SmackException.java:93)
	at org.jivesoftware.smack.SynchronizationPoint.checkForResponse(SynchronizationPoint.java:272)
	at org.jivesoftware.smack.SynchronizationPoint.checkIfSuccessOrWait(SynchronizationPoint.java:157)
	at org.jivesoftware.smack.SynchronizationPoint.checkIfSuccessOrWaitOrThrow(SynchronizationPoint.java:128)
	at org.jivesoftware.smack.tcp.XMPPTCPConnection.connectInternal(XMPPTCPConnection.java:908)
	at org.jivesoftware.smack.AbstractXMPPConnection.connect(AbstractXMPPConnection.java:383)
	at org.jivesoftware.smack.ReconnectionManager$2.run(ReconnectionManager.java:289)
	at java.base/java.lang.Thread.run(Thread.java:834)
2020-05-14 08:26:59.312 INFO: [21] Videobridge.createConference#320: create_conf, id=2b47314e74c63249 gid=null logging=false
2020-05-14 08:26:59.323 INFO: [21] AbstractHealthCheckService.run#171: Performed a successful health check in PT0.011603S. Sticky failure: false

Prosody :

|May 14 07:28:53 c2s5613fc6e7510|info|Client disconnected: handshake too long|
|---|---|---|
|May 14 07:29:35 c2s5613fb726680|info|Client disconnected: handshake too long|
|May 14 07:30:18 c2s5613fa129890|info|Client disconnected: handshake too long|
|May 14 07:32:52 c2s5613f828f090|info|Client disconnected: handshake too long|
|May 14 07:33:08 c2s5613fa2bc9e0|info|Client disconnected: handshake too long|
|May 14 07:33:40 c2s5613fad1b030|info|Client disconnected: handshake too long|
|May 14 07:34:13 c2s5613f9e7fb50|info|Client disconnected: handshake too long|
|May 14 07:34:33 c2s5613fa192230|info|Client disconnected: handshake too long|
|May 14 07:36:12 c2s5613fcbca6f0|info|Client disconnected: handshake too long|
|May 14 07:36:54 c2s5613fcb0d570|info|Client disconnected: handshake too long|
|May 14 07:37:36 c2s5613fd854300|info|Client disconnected: handshake too long|
|May 14 07:38:08 c2s5613f8749ac0|info|Client disconnected: handshake too long|
|May 14 07:38:49 c2s5613fbe3f420|info|Client disconnected: handshake too long|
|May 14 07:40:29 c2s5613f8923450|info|Client disconnected: handshake too long|
|May 14 07:41:01 c2s5613fab03790|info|Client disconnected: handshake too long|
|May 14 07:41:29 c2s5613f75219c0|info|Client disconnected: handshake too long|

Observations :

I was under the impression that prosody is multi-threaded, However I noticed that it is using a single core I am not sure if there’s a specific config paramater to change that.

Once I removed the newely added JVBs and went back to the 45 JVBs I started with prosody CPU went down to below 50% and the environment started to get back to normal.

Question:
1- Is there any limitations to the number of JVBs a single SHARD can handle ?
2- Is there any fine-tunning configuration parametrs I should take into considration to handle such load ?
3- Any solution or suggestion regarding the above warnnings and errors ?

Thanks in advance.

Best Regards,
Mohamed Abada

i don’t know but there is a thing called 'light threads" that are definitely not the same as multiprocessing. Searching a bit I have found this an aborted attempt but where was cited as motivation that ‘Prosody on the other hand is pretty well architectured but single threaded’.
This said you should find errors in Prosody log if this was the reason

Thanks @gpatel-fr
@damencho any idea what is happenning here ?

Any support here ?

Hello @Mohamed_Abada we saying about the same exact results. In fact we are running the same size server in AWS using K8s. We see that with 25 conferences with about 30 persons is our breaking point. Have you found any solutions?

Hi @aljen

I beleive your issue is a bit different than the one I ran into, in my case the problem was the number of JVBs connected to a single shard server, this I managed to fix by having more than one shard server.

In your case, can you clarify your setup a litle bit more ? Number of JVBs the size of your Shard instance and your JVB instance ? versions ? logs ?

@Mohamed_Abada Little offtopic, Once you have scaled, Are you performing any load tests on the jvb?. I would like to know how are you doing load tests?

Hello @Mohamed_Abada

In our testing we found we can run about 25 JVB with 30 Participants in each in a single Shard. We are running everything on AWS c5.4xlarge using K8s. When we do more people we begin to see Prosody have errors and everything slowing down. Our thinking is that we will only be able to run about 750 participants per Shard. If that is the case then we will be managing about 75 to 100 shards in each cluster. We thought we would be able to support more people per shard, but so far it becomes unstable when we are at about 900-1000 people per shard. In order to keep it stable we are running only 750 people per shard for now. Any thoughts?