We’re getting some weird behaviors with OCTO enabled over our PROD environment, as participants are leaving the calls they’re taking up to 1 minute to actually leave the rooms after hangup at the far side.
Note : This is happening regardless of the number of JVBs used.
This has nothing to do with OCTO. When the browser window unloads we send a xmpp presence unavailable and this is the signalling for a participant leaving and is broadcasted to all current participants.
We had updated it before 2-3 months as chrome will be making a change in how it behaves on window unload. I was thinking that you maybe using an older version and been hit by that … but yours is pretty up to date.
Do you have any modifications in jitsi-meet that maybe affecting those? We haven’t seen problems with this.
@damencho it is working fine over our development environemnt "Stand-Alone Server’ with the same jitsi-meet code and same versions, hence I am suspecting that it might have something to do with OCTO and MUCs.
There must be another problem then. So this is pure xmpp signalling and has nothing to do withe jicofo and the bridges…
Maybe if you enable debug logging in prosody and reproduce and check what you see at the moment of closing the tab, you may see some clue.
As @darylhutchings mentioned the issue was actually due to HAProxy timeout configuartion.
Currently the lowest timeout we managed to set at HAProxy is 15s, lowering timeout more than that is cauing lots of timeout exceptions.
Tweaking HAProxy configuration proved to be effective, However at this point we’re not sure if whatever we have in-place is the best practice, hence it would be amazing if you can share some details about how HAProxy timeouts are currently configured for meet.jit.si just to make sure if we’re on the right track.
Also if you can advise on how to insure lower latancy and best performance out of the below architcture would be highly appriciated.
I’m not well familiar with HA Proxy config, but default for bosh is 60 secs timeout. This means that the bosh session will timeout on the server if there is no activity from the client. For catching no network cases and some other failures to communicate the client sends a ping every 10 seconds in case of no activity to ensure client to server connection is alive.
I would assume HA Proxy to follow those timeouts and be a little bit above them to allow client server proper timeout without interrupting their connection.
Well, default timeouts are in the client and in prosody, so this will just be breaking the connection earlier than expected than overriding anything. For example the timeout in the client is hardcoded in the library and cannot be configured, as its value follows the spec. @Aaron_K_van_Meerten can give an update about HA Proxy and bosh timeouts.
In order to properly tunnel BOSH, it is important to ensure that every hop along the way has a timeout of over 60 seconds. We have tuned our ALBs, HAProxy and nginx to have a 90-second timeout, such that the final timeout reply will be sure to come from the prosody server, instead of timing out before then.
In addition, the ping module must be enabled in prosody, to ensure that clients regularly refresh their timeout. Since we have the ping module set to run every 10s, a client would have to miss 5 pings before being timed out.
I would not suggest having the haproxy timeout be 15s, as that would only work for clients which are receiving reliable pings. It’s important that prosody be the server that repliies with the timeout, as otherwise prosody will not treat the client as disconnected, but rather will leave the participant in the room for the entire 60 second window.
As far as latency and performance goes, for a single shard deployment this looks about right. In our case we run a shard in every region as well, and when we’re using Octo we interconnect the bridges such that every bridge knows about every shard. Then the HAProxy selects the closest shard to the users, rather than simply letting Jicofo select the closest bridge. This helps optimize round-trip times for users in particular geos.
@Aaron_K_van_Meerten Thanks a lot for the details and explaination provided, However if 90s is your timeout value across all hops excpet Prosody then how come over meet.jit.si participants thumbnail is removed instantly at the far side once a particpant is leaving the call or ending his session by closing his/her browser consedring that BOSH time out is set to 60s by default ?
I understand that OCTO is disabled at the moment for meet.jit.si, However if I understand correctly it still using the same environment going through your ALBs, HAProxy and Nginx.
In our case with timeout configured to 15s we’re relaying also on option http-keep-alive which enables HTTP keep-alive mode on the client-and server- sides. This provides the lowest latency on the client side and the fastest session reuse on the server, which give us the best performance so far from what I have tested over the last few days.
Our aim is to lower the time it takes to clear participant thumbnail at the far side once participant closed his browser or left the call, the closest we managed to get to what it is showing at your meet.jit.si implemntation was when we used HAProxy’s option httpclose and Prosody’s bosh_max_inactivity = 2 which helped us to lower this time to 2 sec, with the expnse of low quality, hence we decide to revert that back and to stick with keep-alive option along side timeout of 15s, which is working fine for us now.
I don’t think using HTTP keep-alive has any effect on the need to have the HAProxy timeout longer than the BOSH timeout that Aaron mentioned. BOSH uses long polling, so a single HTTP request can take as long as 60 seconds to timeout.