Bandwidth problem. Maybe?

Hey guys!

I’m running a shard with multiples JVBs with octo and JWT tokens enabled. The versions are from the latest stable. On average, I was able to host 200-300 users simultaneously, on ~25 different conferences, distributed between the JVBs. In all conferences, most of the users are audio/video muted.

Things were going well, so I decided to increase the number of users (and, of course, the number of JVBs) on the shard. I was hosting ~450 users, on ~40 conferences, when all conferences crashed.

The JMS is hosted on a 16vCPU/16GB server and the additional JVBs are on 8vCPU/8GB. The load on each server were very low, not exceeding 25%.

I think the problem is related to the bandwidth. But that’s a topic I have very little knowledge of, so that’s where my question is. The servers are hosted on Google Cloud (E2 series), and from what I read from the docs, the total egress network traffic of each instance is 7Gbps. On the monitoring page (from Google Cloud console), I was able to see that for about 2h the bandwidth on the JMS server, was around 4-6MiB/s, then there was a spike, going up to 12-14MiB/s. And that’s when everything crashed.
The weird thing is, some time on previous days, the bandwidth got close to 20MiB/s on this same server and everything was fine.

Regarding that the most users are audio/video muted the majority of time, could this problem be caused due to low bandwidth?

Apparently, the logs are ok. Couldn’t identify anything unusual.

Thanks for your help.

What crashed mean? What were the symptoms? What participants experience?

Oh, forgot to mention that. All participants were shown that message that something has happened (don’t remember exactly the words), and upon reloading the page, all they could see was a grey screen. After some time, the shard was “healthy” again. But at that point, I had already moved the users to another shard.

So to further see what was the problem jicofo logs would be helpful. One other thing is that you may see the cpu usage as 25% but prosody is single threaded and it can be that it was using 100% out of 4 cores, you need to monitor that.
Also which prosody version do you use, latest 0.11.8 has few performance optimisations. And you are using latest stable, I guess.

Will try to gets the logs.

In the meantime, just to clarify some things, if you please. Yes, the prosody version is the latest 0.11.8. So, is it normal for a setup and number of users like mine, with only one JMS, that prosody could reach 100%? I know we don’t know if that’s the case here, but is it a real possibility?

And to avoid that, one could have all the shards share the same JVBs. Many shards, many prosodies. Is that correct to assume? Or even many shards with limited number of JVBs and users.

[Edit]:
These are logs of the time when it crashed… It’s jsut a snipped, but it went on repeating these lines for about 10 minutes before becoming healthy again:

Jicofo 2021-03-09 23:27:01.127 WARNING: [161] JvbDoctor$HealthCheckTask.doHealthCheck#246: jvbbrewery@internal.auth.my-meet.domain.com/6726f9f8-bcf4-40e0-9f11-9b99c31cc16a health-check timed out, but will give it another
try after: 5000
Jicofo 2021-03-09 23:27:04.465 WARNING: [56] JvbDoctor$HealthCheckTask.doHealthCheck#246: jvbbrewery@internal.auth.my-meet.domain.com/db00d66c-22ef-4b19-8200-2f1d5a76bf3a health-check timed out, but will give it another
try after: 5000
Jicofo 2021-03-09 23:27:04.890 WARNING: [100] JvbDoctor$HealthCheckTask.doHealthCheck#246: jvbbrewery@internal.auth.my-meet.domain.com/92e93cf3-c73f-4e7b-a4ce-539d3bb25e11 health-check timed out, but will give it another
try after: 5000
Jicofo 2021-03-09 23:27:05.318 WARNING: [162] JvbDoctor$HealthCheckTask.doHealthCheck#246: jvbbrewery@internal.auth.my-meet.domain.com/3f1c4dfc-8441-469d-9f9b-7d494cb6b2c2 health-check timed out, but will give it another
try after: 5000
Jicofo 2021-03-09 23:27:06.372 WARNING: [71] JvbDoctor$HealthCheckTask.doHealthCheck#246: jvbbrewery@internal.auth.my-meet.domain.com/23067787-5da9-47b2-828d-f7adadcbb36b health-check timed out, but will give it another
try after: 5000
Jicofo 2021-03-09 23:27:07.005 WARNING: [163] JvbDoctor$HealthCheckTask.doHealthCheck#246: jvbbrewery@internal.auth.my-meet.domain.com/20c22d25-a9cb-4b5c-aa06-42a545db42cd health-check timed out, but will give it another
try after: 5000
Jicofo 2021-03-09 23:27:08.300 WARNING: [164] JvbDoctor$HealthCheckTask.doHealthCheck#246: jvbbrewery@internal.auth.my-meet.domain.com/2ef9d115-ed48-41ae-8794-e9b8266417de health-check timed out, but will give it another
try after: 5000
Jicofo 2021-03-09 23:27:17.628 WARNING: [70] JvbDoctor$HealthCheckTask.doHealthCheck#271: Health check timed out for: jvbbrewery@internal.auth.my-meet.domain.com/5bf5a33b-0c47-4c50-af45-5eac94503219
Jicofo 2021-03-09 23:27:18.514 WARNING: [99] JvbDoctor$HealthCheckTask.doHealthCheck#271: Health check timed out for: jvbbrewery@internal.auth.my-meet.domain.com/f3072dd1-ac1d-46eb-b76e-76b829138f55
Jicofo 2021-03-09 23:27:21.128 WARNING: [161] JvbDoctor$HealthCheckTask.doHealthCheck#271: Health check timed out for: jvbbrewery@internal.auth.my-meet.domain.com/6726f9f8-bcf4-40e0-9f11-9b99c31cc16a
Jicofo 2021-03-09 23:27:24.466 WARNING: [56] JvbDoctor$HealthCheckTask.doHealthCheck#271: Health check timed out for: jvbbrewery@internal.auth.my-meet.domain.com/db00d66c-22ef-4b19-8200-2f1d5a76bf3a
Jicofo 2021-03-09 23:27:24.890 WARNING: [100] JvbDoctor$HealthCheckTask.doHealthCheck#271: Health check timed out for: jvbbrewery@internal.auth.my-meet.domain.com/92e93cf3-c73f-4e7b-a4ce-539d3bb25e11
Jicofo 2021-03-09 23:27:25.318 WARNING: [162] JvbDoctor$HealthCheckTask.doHealthCheck#271: Health check timed out for: jvbbrewery@internal.auth.my-meet.domain.com/3f1c4dfc-8441-469d-9f9b-7d494cb6b2c2
Jicofo 2021-03-09 23:27:26.373 WARNING: [71] JvbDoctor$HealthCheckTask.doHealthCheck#271: Health check timed out for: jvbbrewery@internal.auth.my-meet.domain.com/23067787-5da9-47b2-828d-f7adadcbb36b
Jicofo 2021-03-09 23:27:27.005 WARNING: [163] JvbDoctor$HealthCheckTask.doHealthCheck#271: Health check timed out for: jvbbrewery@internal.auth.my-meet.domain.com/20c22d25-a9cb-4b5c-aa06-42a545db42cd
Jicofo 2021-03-09 23:27:28.300 WARNING: [164] JvbDoctor$HealthCheckTask.doHealthCheck#271: Health check timed out for: jvbbrewery@internal.auth.my-meet.domain.com/2ef9d115-ed48-41ae-8794-e9b8266417de
Jicofo 2021-03-09 23:27:32.629 WARNING: [70] JvbDoctor$HealthCheckTask.doHealthCheck#246: jvbbrewery@internal.auth.my-meet.domain.com/5bf5a33b-0c47-4c50-af45-5eac94503219 health-check timed out, but will give it another
try after: 5000
Jicofo 2021-03-09 23:27:33.515 WARNING: [169] JvbDoctor$HealthCheckTask.doHealthCheck#246: jvbbrewery@internal.auth.my-meet.domain.com/f3072dd1-ac1d-46eb-b76e-76b829138f55 health-check timed out, but will give it another
try after: 5000
Jicofo 2021-03-09 23:27:36.128 WARNING: [161] JvbDoctor$HealthCheckTask.doHealthCheck#246: jvbbrewery@internal.auth.my-meet.domain.com/6726f9f8-bcf4-40e0-9f11-9b99c31cc16a health-check timed out, but will give it another
try after: 5000
Jicofo 2021-03-09 23:27:39.466 WARNING: [56] JvbDoctor$HealthCheckTask.doHealthCheck#246: jvbbrewery@internal.auth.my-meet.domain.com/db00d66c-22ef-4b19-8200-2f1d5a76bf3a health-check timed out, but will give it another
try after: 5000
Jicofo 2021-03-09 23:27:39.891 WARNING: [100] JvbDoctor$HealthCheckTask.doHealthCheck#246: jvbbrewery@internal.auth.my-meet.domain.com/92e93cf3-c73f-4e7b-a4ce-539d3bb25e11 health-check timed out, but will give it another
try after: 5000
Jicofo 2021-03-09 23:27:40.318 WARNING: [162] JvbDoctor$HealthCheckTask.doHealthCheck#246: jvbbrewery@internal.auth.my-meet.domain.com/3f1c4dfc-8441-469d-9f9b-7d494cb6b2c2 health-check timed out, but will give it another
try after: 5000
Jicofo 2021-03-09 23:27:41.373 WARNING: [71] JvbDoctor$HealthCheckTask.doHealthCheck#246: jvbbrewery@internal.auth.my-meet.domain.com/23067787-5da9-47b2-828d-f7adadcbb36b health-check timed out, but will give it another
try after: 5000
Jicofo 2021-03-09 23:27:42.006 WARNING: [163] JvbDoctor$HealthCheckTask.doHealthCheck#246: jvbbrewery@internal.auth.my-meet.domain.com/20c22d25-a9cb-4b5c-aa06-42a545db42cd health-check timed out, but will give it another
try after: 5000
Jicofo 2021-03-09 23:27:43.300 WARNING: [164] JvbDoctor$HealthCheckTask.doHealthCheck#246: jvbbrewery@internal.auth.my-meet.domain.com/2ef9d115-ed48-41ae-8794-e9b8266417de health-check timed out, but will give it another
try after: 5000
Jicofo 2021-03-09 23:27:52.630 WARNING: [70] JvbDoctor$HealthCheckTask.doHealthCheck#271: Health check timed out for: jvbbrewery@internal.auth.my-meet.domain.com/5bf5a33b-0c47-4c50-af45-5eac94503219
Jicofo 2021-03-09 23:27:53.515 WARNING: [169] JvbDoctor$HealthCheckTask.doHealthCheck#271: Health check timed out for: jvbbrewery@internal.auth.my-meet.domain.com/f3072dd1-ac1d-46eb-b76e-76b829138f55
Jicofo 2021-03-09 23:27:56.129 WARNING: [161] JvbDoctor$HealthCheckTask.doHealthCheck#271: Health check timed out for: jvbbrewery@internal.auth.my-meet.domain.com/6726f9f8-bcf4-40e0-9f11-9b99c31cc16a
Jicofo 2021-03-09 23:27:59.467 WARNING: [56] JvbDoctor$HealthCheckTask.doHealthCheck#271: Health check timed out for: jvbbrewery@internal.auth.my-meet.domain.com/db00d66c-22ef-4b19-8200-2f1d5a76bf3a
Jicofo 2021-03-09 23:27:59.891 WARNING: [100] JvbDoctor$HealthCheckTask.doHealthCheck#271: Health check timed out for: jvbbrewery@internal.auth.my-meet.domain.com/92e93cf3-c73f-4e7b-a4ce-539d3bb25e11

This is when jvbs disconnected from the brewery room.

Gather and some jvb logs

Was able to get some logs only from the JMS server… The JVBs ones are autoscaled.

This appeared around the time it crashed:

JVB 2021-03-09 23:26:57.259 SEVERE: [37] [hostname=my-meet.domain.com id=shard] MucClient$MucWrapper.setPresenceExtensions#758: Failed to send stanza:

Then, it was follow by lots of these:

JVB 2021-03-09 23:27:00.035 WARNING: [4708] ColibriWebSocketServlet.createWebSocket#129: Received request for an nonexistent conference: 5ed7db0b6b612b81
JVB 2021-03-09 23:27:00.044 WARNING: [4564] ColibriWebSocketServlet.createWebSocket#129: Received request for an nonexistent conference: f650f7a852e165a5
JVB 2021-03-09 23:27:00.307 WARNING: [4696] ColibriWebSocketServlet.createWebSocket#129: Received request for an nonexistent conference: 278994eb0b70f15b
JVB 2021-03-09 23:27:00.440 WARNING: [4122] [confId=d33df1fe809094d8 epId=70de8d1a gid=14656 stats_id=Kale-NhE conf_name=wmd2dcrv8uey332nxbie@conference.my-meet.domain.com] EndpointMessageTransport.endpointMessage#526: Unable to find endpoint to send EndpointMessage to: 394c27fc

I don’t know if it’s relevant or not, but I’ve not enabled websockets.

Why? Not connected maybe? Seems like the network between the jvbs and prosody was the problem …

Gonna try to find out. Is there a way to know how/why the jvbs lost the connection to prosody? As in something that can cause this?

As you pointed out earlier, the jvbs disconnected from the brewery room. All of them. I guess we’ve found the problem. Now I’ll try to figure out why the jvbs are randomly disconnecting.

@damencho Sorry to ping you directly, but since you’re the one who was helping me with information…

The issue described in the first post happened again with the same messages on the logs. After some time, all the bridges are disconnected and it takes some more time for they to be connected again.

After some time investigating, I have discovered some unusual things. That’s why I returned to ask you if these things could be the cause.

What I found is the following: the videobridges, when they’re discovered by jicofo, they’re added (‘Added new videobridge…’), then they’re removed (‘A brige left the MUC…’), then they’re added again. It happens with every new jvb created, autoscaled or not.

And the most important fact is that this behaviour is only seen if the org.jitsi.jicofo.BridgeSelector.BRIDGE_SELECTION_STRATEGY is set to IntraRegionBridgeSelectionStrategy. If the selection is RegionBasedBridgeSelectionStrategy or SplitBridgeSelectionStrategy, the bridge is discovered and added by jicofo just fine, without the “adding…, removing…, adding again…”.

Haven’t had any conferences with RegionBasedBridgeSelectionStrategy set yet, but do you think this could be what was causing the issue?

That’s interesting but don’t think so … But what happens when they start loosing connection and timing out …

Well, all jvbs disconnect at the same time. So, all users see the message that something wrong has happened and when the page is reloaded, all they see is a grey screen.

One more important note is that when the shard was having 200-300 users, at maximum, things apparently were ok. Then I pushed the number to ~450 and that’s when the issue occured. But yesterday, at the moment of the crash, the shard had less than 50 users.

At the time of writing this post, I have a few conferences running, with ~80 users, for about 2h. And everything is fine so far. The only change I did was set org.jitsi.jicofo.BridgeSelector.BRIDGE_SELECTION_STRATEGY to RegionBasedBridgeSelectionStrategy instead of IntraRegionBridgeSelectionStrategy.

I’ll be back with more info as soon as I get more users on the shard.

You need to see in the moment of the drop what happens, what is the memory usage?, is it a network between jvbs and prosody going down?, what is the prosody cpu usage at that time … ? Stuff like that, check prosody logs, jvb logs, nginx logs, syslogs and see at the time which one start seeing a problem first …