JVB Endpoints suspended due to insufficient bandwidth only 5 to 15 users - AWS c5 2xl & 4xl instances Manual Setup (not dockerized)

@rpgresearch Can you try the same scenario same people on meet.jit.si and if you reproduce it, can you send me in private message the meeting link and approximate time of it, this may help to debug it. Thanks.

Will do.

I have also, just to rule out any spikes in bandwidth, cpu, or ram that we might not have been catching in cloudwatch to:
JMS from c5a.4xlarge to c5n.9xlarge
JVB raised up to c5n.4xlarge
Jibri raised up to c5a.4xlarge

That way there are plenty of resources available with the performance tuning tweaks.

I am still adding all of the tuning tweaks to this setup, from what I learned from load testing of 2,500+ users on one server back in the June as per what I learned here: Hitting hard limit around 600 participants, then start dropping constantly suggestions? - #11 by rpgresearch

In a few hours we’re trying a test with only USA people to rule out the India distance issue (since we’re not setting up a JVB in India (plus octo, etc.)) (even though they all had good results on test.webrtc.org and such.

We’re also trying to rule out any potential VPN factors.

If we still see the same problems after these changes are finished, and these tests rule out the other issues, and still see the bandwidth issues, then I’ll have them hit the public server and let you know how that goes. I’ll update here how each test goes. Thanks @damencho !

Hmm, I notice they set this to the same setting for both:
org.ice4j.ice.harvest.NAT_HARVESTER_LOCAL_ADDRESS=central.internal.uat.uat-aws-ediolivemeet.myedio.com

org.ice4j.ice.harvest.NAT_HARVESTER_PUBLIC_ADDRESS=central.internal.uat.uat-aws-ediolivemeet.myedio.com

in /etc/jitsi/videobridge/sip-communicator.properties.

Is that going to cause issues? Shouldn’t the LOCAL and PUBLIC be different to work correctly? (AWS private IP and AWS Elastic IP) ?

Yes, those need to be the IP addresses. The issue will be not able to establish media with the bridge. Maybe it has also the setting about the stun mapping harvester that overrides these?

1 Like

I will put those on the tweak list next. Thanks!

We completed a USA only participants meeting earlier today, included with a range of the tweaks I made from the link above. We did not see a single insufficient bandwidth issues error, though had different issues we’ll address separate.

We’ll test again on this same setup in a few hours to once again include folks from India again, and if the insufficient bandwidth issue returns, then we’ll assume it is the geographic distance. It is not in the scope of this particular setup to setup a localized JVB in the India area, plus Octo, etc. Most of the participants for this setup are East coast planned.

However, we’re having other issues with browser and OS compatibility issues, and some room desyncing of some users as per here: Participants get out of sync with other participants in same room, have to refresh browser (rejoin room) as workaround. What is proper fix? for the all USA participants.

I’ll post what I find with the India participants in a few hours.

Regards.

Just to clear up any misunderstanding here, for participants using Chrome, the bandwidth estimation is done at the sender side (JVB) via transport-cc, which basically involves the browser reporting back to JVB which packets it received and which were lost. JVB does the estimation based on that reported loss. When it’s not sending max/HD already, it probes upwards a little from the current sending rate and checks if loss increases, and keeps iterating upwards until either it’s sending max rate or loss increases. When loss is observed it reduces the sending rate until loss drops off, and then starts to probe upward again. (There’s a bit more detail but that’s the general outline.)

So as long as transport-cc is supported, BWE doesn’t depend on the browser reporting bandwidth information, it’s all calculated on the JVB side based on the observed loss. In general if JVB has enough resources & bandwidth, any observed loss is genuine loss and the decision to lower BWE is appropriate. Even a fibre link sold as 1Gbit/1Gbit might (will) sometimes have congestion somewhere else in the path, especially if it’s sold without guarantees as most residential products are.

Firefox is a different story because it doesn’t support transport-cc, so bandwidth estimation for Firefox uses REMB, which is estimated on the receiver (Firefox) side and is less accurate.

4 Likes

Thanks for sharing your knowledge. it brings a lot of clarity to the problem.

I don’t think it invalidates the idea that default JVB config could overreact to small events as the post I linked to hints, since in these cases the problem concerns Jitsi install having no particular problems starting to display this unsufficient bandwidth message all about at the same time.

Reading jvb/src/main/kotlin/org/jitsi/videobridge/cc/BandwidthProbing.kt at my skill level and available time is not conclusive at all for me, but I get the impression that raising padding-period parameter could force JVB to wait for more time before deciding that bandwidth has changed for good. This could lead to frozen thumbnails for short periods but it’s less noticeable than a black screen and scary messages.

For sure the algorithm can always be improved. As far as I can see the current send-side estimator is almost a verbatim port of the one in libwebrtc and thus in Chromium/Chrome.

Changing the probe period is a “micro” level change. It will change how quickly JVB can react to changes in available bandwidth. Making the period longer may cause it to lower BWE more slowly, but that’s not a good thing if the estimate is accurate, since sudden increase in loss needs to be reacted to quickly otherwise worse consequences than just video quality reduction will happen (broken video stream, broken audio). It would also cause it to recover more slowly after the loss is no longer happening.

If there is a problem with accuracy or sensitivity of BWE I think it would be better solved directly by adjusting the algorithm rather than by adjusting the probe period.

There is such thing as Opus redundancy, as there is no magic it comes from an additional transmission delay so I don’t expect video transmission problem to immediately impact sound - not if Jitsi-videobridge design is any good at least.
In the tile mode anyway, a glitch in display for one or 2 thumbnail will not be noticed, and if these tiles are silent as most are any sound impact would not be either, however a thumbnail going dark is a disturbing visual event.

There is such thing as Opus redundancy, as there is no magic it comes from an additional transmission delay so I don’t expect video transmission problem to immediately impact sound - not if Jitsi-videobridge design is any good at least.

Opus RED does help but is disabled by default (although there is still FEC) and since it leads to higher bandwidth usage for the same effective bitrate, it can sometimes paradoxically make things worse on constrained / low-quality links.

Not much JVB can do in terms of its design to help in case of link congestion, since in that situation the packets that are lost are effectively chosen randomly, out in the network after JVB has sent them. All you can do is lower the sending bitrate to alleviate the congestion, which is what BWE is designed to do.

In the tile mode anyway, a glitch in display for one or 2 thumbnail will not be noticed, and if these tiles are silent as most are any sound impact would not be either, however a thumbnail going dark is a disturbing visual event.

I agree with this and there are definitely numerous ways this could be improved. Faster recovery from transient congestion (or a way to ‘skip over’ very transient congestion without it affecting BWE) would be nice. Perhaps there is a better way to present a suspended stream than a black tile (which are almost universally hated by users in my experience). Perhaps additional temporal layers with very very low framerate could be considered. etc.

We tested again with the India team last night, and then with a mostly USA and just a few in India this morning.

I had made about 30 different tweaks between the last test and last night’s, and another tweak to the JVB AWS HARVESTER ( as per trying to fix the users falling out of sync, they’re calling it “being lost in the 5th dimension” or “parallel universe” btw :slight_smile: Participants get out of sync with other participants in same room, have to refresh browser (rejoin room) as workaround. What is proper fix? - #2 by emrah ) between last night’s test and this mornings.

I can confirm that every time we have people from India join I start seeing the bwe insufficient bandwidth messages, even if they have plenty of bandwidth, and their test.webrtc.org test scores were all good. There is just something about them being in India (it is still possibly also a combination factor of India + their VPN). We’re trying to setup another server that doesn’t require the VPN to try to rule that out. We don’t see the issues with the on-prem or VPN users in the USA (since my recent tweaks), but for both last night’s and this morning’s the test, India members generated those messages constantly.

We’ll be testing these more and working on getting better data next week, and I’ll report back later next week on findings. Cheers!

a side-effect of VPNs is often to block UDP traffic. If UDP is not allowed, traffic has to go through TCP and it’s well-known that there are big problems with real time system when you have TCP + packets lost. Maybe your US users don’t have significant packet loss and TCP performance impact is not too much.
Significant loss with TCP can seem insignificant, however 1/1000 is a high rate of packet loss, 1/100 will create major problems.

Our experience with India has been that you need JVBs in the country to reliably get decent quality. The international bandwidth provided to a lot of home and office connections in India is too low quality for reliable realtime usage. And yes, bad VPNs are a big problem. A decent VPN that allows UDP traffic and doesn’t add too much latency (e.g. something built with Wireguard) can be fine, but many VPNs aren’t.

Coming back to this after a week-end when I got bored and wrote a small application to show the estimated bandwidth for users in real time on the server. Two things astonished me:

  • the black screen was not actually a black screen: it seems that now JVB freeze the current image - I have never seen the ‘video suspended for insufficient bandwidth’. At least that was my experience. Maybe it’s a recent change (I’m using unstable on my test server). Possibly JVB devs are changing undocumented behaviour of the software without saying anything to us - devious people :slight_smile:

  • how slow recovery is compared to drop: when something bad happens on bandwidth, JVB is indeed very fast to drop estimate below any value usable for video (100kb), and it is raising estimate very slowly in comparaison. Once it comes to something useful (300 kb for me, with VP9), it begins to display a crapastic video, and still raises slowly the estimate, I think it is raising the framerate, and once the product current framerate x raised framerate can justify higher resoluton, it raises resolution (and decrease framerate). That’s the impression I got at least.

1 Like

is there any progress to fix this issue?

There isn’t a well-identified bug to fix here. So far, every case I’ve personally helped to diagnose has turned out to be a bandwidth limitation at the client or the server, and once that issue was resolved the problem went away. At AVStack we don’t see this issue, ever, unless the client has bandwidth limitations or connectivity that is somehow broken. That’s not to say there’s nothing to improve in JVB, but if you are having an issue, in my experience it’s almost certainly related to limited bandwidth or CPU somewhere, and not a bug in JVB.

I have often seen that Jitsi cannot estimate the bandwidth for the download, but it can for the upload. Browsers have always been Brave or Chrome.

Do you have any idea why that could be? our Jitsi is in a DMZ and we access both internally and externally via the public IP address (NAT). Websockets and bridges can be reached.

with disabled bwe everything works fine so far.

I think one reason for no BWE in the download (bridge → user) direction would be if the user was not receiving any video. (It’s based on observed loss, so if no video is being sent then there’s no potential loss to observe.) Similar in the opposite direction if they’re not sending any.

hm, this happened although every participant received and sent video and audio. in this scenario videos are often suspended.