Streams corrupted after 15 mins, resulting in incorrect BWE

We’ve encountered and determined the cause of a JVB bug related to video streams becoming corrupted after 15 minutes. The result is incorrect BWEs which further causes other streams to degrade.

The unresolved issue https://github.com/jitsi/jitsi-videobridge/issues/1648 is an example of this bug, but this bug is “wider” than the reported issue. It affects meet.jit.si, and (based on some of the community topics) I think some other users have been encountering this for a while.

The guidelines recommend posting here before submitting a bug report. Should I list the specifics here, or append to the existing reported issue, or create a new issue?

Perhaps detail your findings here first for others’ contributions to ascertain whether what you’re experiencing is in fact a bug. Either way, if it indeed turns out to be, it will still have the right visibility to be actioned by the team.

1 Like

If you have evidence that it’s indeed that bug, please post a comment there with your findings.

I’ve included the summary here and added the specific details as a comment to the original bug report:
https://github.com/jitsi/jitsi-videobridge/issues/1648

Steps to reproduce

Join meet.jit.si with 5 people. Ensure at least one participant is Firefox or Safari (to drop back to VP8). Ensure all participants are in tile mode. Reduce a participant’s window size until tiles are 2x2 so that one video stream is not displayed and inactive. Wait 20 minutes. Reactivate video stream by increasing window size, or scrolling down, or switching to stage view (so reactivated stream will be shown in filmstrip). Reactivated stream is now “corrupted” and will report packet losses resulting in reduced BWE and other video streams being degraded.

Summary

After an endpoint’s stream has been suspended (due to bandwidth allocation, last-n or pagination) and later resumes, RPT’s SEQ number should continue incrementing without a gap in the sequence. In the case of VP9 packets, or VP8 packets where no other endpoint is requesting that SSRC, the SEQ number correctly “pauses” then “resumes” from the previous count. However, VP8 packets (with at least one other endpoint requesting the SSRC) resume after a pause with a gap in the SEQ number as though the counter continued incrementing during the paused stage.

If the pause was sufficiently long (15-20 mins) the discontiguous SEQ number of the resumed stream will appear to the WebRTC library as a replay attack and all of the packets for that SSRC will be discarded. The browser will then report all the packets as lost (via the TCC in its RTCP) and the JVB’s TCC node will reduce the BWE and suspend endpoints. After the problematic SSRC becomes inactive the packet losses will stop, the BWE will increase, the allocator will reactivate the endpoint, then lost packets will be reported again, rinse and repeat.

If the “corrupted” video is the current speaker then you won’t see any video. If the corrupted video is one of the thumbnails on a stage view, the on-stage video will be degraded and some thumbnails might become inactive. If the corrupted video is included in tile view, most tiles will show as either low frame rates, and/or numerous tiles switching on and off (flicking every few seconds) without any improvement.

In the case of larger meetings, if at least one person is viewing 5x5 tiles then other participants viewing stage mode will have approximately 6 videos active (in the filmstrip) and 19 videos inactive and vulnerable to this issue. After 20 minutes if a participant switches to tile mode they will experience the issue. Or, if one of the inactive participants speaks, they become a recent speaker and will appear in the other participants’ filmstrips which will trigger the issue.

4 Likes