Site slowing down on concurrent connections

We have setup our own self hosted jitsi solution and we saw that our platform doesn’t support concurrent activities.
First we scaled our JVBs and switched from data-channel to websocket on our prosody and JVB connections. After that, everything is fine and with torture, we could test and make sure our system supports as many concurrent users as we need in large conferences. But since in the real world, meetings aren’t mostly that big, we are seeing issues when the same amount of users are in rooms of 3.

The problem is when we want to load test and add our desired amount of users with torture, it takes around 10 mins for all the torture users to enter the rooms and joining a room as a real person in that time, which normally takes under 5 seconds, is taking as long as 30 seconds to a minute.

Any ideas which part of the system might be not configured properly and what we could do to solve this issue?

P.S: we had few incidents in production level where the jicofo logged that it couldn’t see the JVB’s and our stats show that all conferences are dropped and the users are kicked out at the log time. That’s why we assumed either jicofo or prosody might not be able to process the load of users activity and it might infect the JVB healthcheck and since the jicofo assumes that the JVBs are not available, it starts dumping the users out.

3 Likes

It can be a problem on the torture side, you need enough resources for each instance, few cores and 4bto 8 gb or ram.

1 Like

We have setup Torture on servers with overall 320 cores and almost 1000 GB memory, also have increased the total number of threads which is used by operating system (default was 7072, we set 100000).
No exceptions occurs in Torture instances with the setup.
The problem is that it takes 10 minutes to bring 300 participants(~ 100 conferences, 3 participants per conference).


This chart is constructed by JVB data (colibri/stats endpoint)
As it shows, it has taken 10 minutes to reach 300 participants in 100 conferences.

Also, in production, It has been seen that in the case of participants more than almost 150, with 6 JVB instance, prosody can’t handle videobridge health check responses and does not deliver the responses to Jicofo, so, suddenly, Jicofo assumes that all videobridges are gone (crashed) and cancels all of the conferences and all of the participants are kicked out together.


As the chart shows, at 09:31:00 suddenly all of the conferences have been failed. We checked the logs and all of the videobridges were available and there was no disconnection error for videobridges in prosody logs at the time, but Jicofo has logged that all videobridges are failed and are not connected anymore, so it has kickedout everyone and has terminated all of the conferences.
We see this behaviour once a day . We switched to JVB websocket and Prosody websocket and increased the kernel threads limit and … but the problem still occurs.
We appreciate anyhelp.
thanks

Are you running prosody 0.11.7? When the problem occurs do you see jicofo able to join a room in prosody.

* util.indexedbheap: Fix heap data structure corruption, causing some timers to fail after a reschedule (fixes [#1572](https://issues.prosody.im/1572))

There was a bug in prosody where certain packets may got delayed which will result timeouts in jicofo and it can think bridges gone away.

1 Like

Thanks for the reply.
No we are using dockerized version of Prosody which is available in docker-jitsi-meet project. docker-jitsi-meet installs prosody 0.11.5.
I checked the Prosody log and there was no such an error. (log level is set to info, not debug)
I must upgrade prosody?

There is no error that occurs in prosody when this occurs. You will notice it in timeouts in jicofo logs and jicofo not able to join rooms due to timeouts and if healthchecks are running they will be failing.
As you describe this problem, this is what it comes to my mind, that you are hitting this.

1 Like

Thank you. I upgrade prosody and see what happens.
But for now, there is no Prosody 0.11.7 in


How can I find debian package of Prosody 0.11.7? I just found nightly builds at

thanks

Here: https://prosody.im/download/package_repository

I followed the instructions but it installs
http://packages.prosody.im/debian stretch/main amd64 prosody amd64 0.11.5-1~stretch6
(installs version 0.11.5 not 0.11.7)

Yeah, I see there is no 0.11.6 and 0.11.7 for stretch … Not sure what you can do in this case…
We will be working on updating docker to newer Debian, but it involves and other changes that need to be done first and it will take time.

Alright,thanks for the reply.
For now, I think I have two solutions:
1- Use nightly builds.
2- Change Dockerfile and use Ubuntu (groovy, bionic, …) instead of Debian Stretch, but it is a little bit difficult (I have tested it before) since some Lua packages are incompatible with some Ubuntu versions and some packages (packages used in Dockerfile of docker-jitsi-meet for Prosody) are not available in some Ubuntu versions.
So I prefer to use nightly builds. Do you know which version of nightly builds has fixed the problem?
I think I have to use of one the latest files of

Is there any other solution to fix the problem temporarily?

Well you can apply the fix by hand or with a diff … it is very simple. https://hg.prosody.im/trunk/rev/7d4c292f178e
Skip the tests change and do only the one in util/indexedbheap.lua

1 Like

Very nice, thank you.
I can patch the file with diff. This is much more easier than other solutions I suggested.

Have you guys scaled prosody for meet.jit.si as well as the JVBs? It seems like the initial connection to meeting speed on your platform in consistent independent from the number of users on the platform.

@damencho
Do you have any idea? Thank you.

If you are asking about meet.ji.tis. Yes, of course, it is scaled, we have multiple shards in each of the 7 regions we run meet.jit.si with hundreds of bridges to serve it. There are thousands of concurrent sessions and a single shard cannot meet that load.

2 Likes

We dug deeper into the issue and we started generating some MUC load test on our prosody. We added around 750 users in 50 conferences (15 participants per conference) to our prosody using chat clients (xmppjs library) and started sending messages by the bots.

The very odd thing here is although the overall CPU usage of our 32 core server was low but one of the cores was busy (80% to 100% usage) all the time and since the bot where just connecting to Prosody, and sending chat messages, neither Jicofo nor JVB where involved in the test.
So we could conclude that our issue comes from the prosody getting very busy but only using 1 core to handle the load and it is obvious that it will cap it maximum of that core very soon. Since prosody is the heart of message passing in our system, users joining the room and there overall experience gets affected a lot. When the loadtest is not running, we could join our rooms in less than 5 seconds, but when we run the tests, it takes us more than 2 minutes to join the same room.

Is there a way to force Prosody to use more than one core?




P.S1: We are using docker and docker-compose to spin up our stack.
P.S2: Here is the code we used to generate this load.

Prosody by its nature is single threaded

1 Like

Can you try same test with this version of prosody-0.11_1nightly111-1
You can take it from here https://packages.prosody.im/debian/pool/main/p/prosody-0.11/
So you see any difference?

1 Like

We use Prosody docker image provided in docker-jitsi-meet (debian:stretch-slim), should we have use this debian package? That’s the last debian stretch package, the others are for ubuntu.