[jitsi-dev] AV Sync/LipSyncHack Questions


#1

The VideoBridge by default seems to split user Audio and Video into two
separate MediaStreams so that they can be independently removed and
republished.

I know that you guys have produced the LipSyncHack, and my layman
understanding of this is that it produces blank VP8 video frames and
rewrites the RTP packets and sequence numbers to independently synchronize
the audio and video streams.

For one, I'm interested in exactly 'how' the above effect is actually
achieved, and if there exists any documentation regarding how exactly it
works.

My second question is, will the audio and video frames first arrive from
the published MediaStreams in sync, or is it possible that there is a delay
between both that needs to be resolved?

I.E: I need to assume the Audio and Video streams that are independently
published start in sync, and track changes in their PTS to assure they
remain in sync.

Can I rely on the fact that the begin in sync, given they are being
produced by two separate MediaStreams?

At the moment my process is to obtain the video PTS and the audio PTS, from
each independent stream, zero them out, and to render according to those
PTS values.

Is there any other consideration I should make with RTCP, or with respect
to the LipSyncHack, in order to assure these streams stay in sync?

My third question is, why when the LipSyncHack is enabled, I seem to
receive a couple Audio Frames at the very start of the stream with an RTP
timestamp of '0', followed by what looks like a normal PTS.

Thanks in advance for any help understanding this,
    Jason Thomas


#2

Hi Jason,

The VideoBridge by default seems to split user Audio and Video into two separate MediaStreams so that they can be independently removed and republished.

I know that you guys have produced the LipSyncHack, and my layman understanding of this is that it produces blank VP8 video frames and rewrites the RTP packets and sequence numbers to independently synchronize the audio and video streams.

LipSyncHack is only necessary to work around a bug in webrtc. Webrtc will attempt to synchronize the playback of an audio and a video stream if they have the same msid (or the same cname, I'm not sure which one it is). The bug is that there is no audio playback until the receiver receives frames on the video stream. So if the sender has turned off their camera, or the bridge is not forwarding video packets to the receiver, the audio isn't played back.

LipSyncHack works around this bug by injecting video frames, to make sure that some reach the receiver. It is not necessary for synchronizing streams in general.

For one, I'm interested in exactly 'how' the above effect is actually achieved, and if there exists any documentation regarding how exactly it works.

My second question is, will the audio and video frames first arrive from the published MediaStreams in sync, or is it possible that there is a delay between both that needs to be resolved?

I.E: I need to assume the Audio and Video streams that are independently published start in sync, and track changes in their PTS to assure they remain in sync.

Can I rely on the fact that the begin in sync, given they are being produced by two separate MediaStreams?

At the moment my process is to obtain the video PTS and the audio PTS, from each independent stream, zero them out, and to render according to those PTS values.

Is there any other consideration I should make with RTCP, or with respect to the LipSyncHack, in order to assure these streams stay in sync?

RTP streams have independent timestamps. In order to sync two streams (assuming they come from the same source), RTCP Sender Reports are used, which contain a mapping between an RTP timestamp and an NTP timestamp (which represents the sender's wall clock at the same instance).

My third question is, why when the LipSyncHack is enabled, I seem to receive a couple Audio Frames at the very start of the stream with an RTP timestamp of '0', followed by what looks like a normal PTS.

This seems weird to me, I don't know why it happens.

I hope this helps,
Boris

···

On 26/03/2018 11:42, Jason Thomas wrote:


#3

Thanks Boris for the clarification, especially around the LipSyncHack--- it
makes things a lot clearer.

I'll see if there is a way to hook into the RTCP sender reports inside
libwebrtc.

Cheers,

- Jason Thomas.

···

On Mon, Mar 26, 2018 at 1:03 PM, Boris Grozev <boris@jitsi.org> wrote:

Hi Jason,

On 26/03/2018 11:42, Jason Thomas wrote:

The VideoBridge by default seems to split user Audio and Video into two
separate MediaStreams so that they can be independently removed and
republished.

I know that you guys have produced the LipSyncHack, and my layman
understanding of this is that it produces blank VP8 video frames and
rewrites the RTP packets and sequence numbers to independently synchronize
the audio and video streams.

LipSyncHack is only necessary to work around a bug in webrtc. Webrtc will
attempt to synchronize the playback of an audio and a video stream if they
have the same msid (or the same cname, I'm not sure which one it is). The
bug is that there is no audio playback until the receiver receives frames
on the video stream. So if the sender has turned off their camera, or the
bridge is not forwarding video packets to the receiver, the audio isn't
played back.

LipSyncHack works around this bug by injecting video frames, to make sure
that some reach the receiver. It is not necessary for synchronizing streams
in general.

For one, I'm interested in exactly 'how' the above effect is actually
achieved, and if there exists any documentation regarding how exactly it
works.

My second question is, will the audio and video frames first arrive from
the published MediaStreams in sync, or is it possible that there is a delay
between both that needs to be resolved?

I.E: I need to assume the Audio and Video streams that are independently
published start in sync, and track changes in their PTS to assure they
remain in sync.

Can I rely on the fact that the begin in sync, given they are being
produced by two separate MediaStreams?

At the moment my process is to obtain the video PTS and the audio PTS,
from each independent stream, zero them out, and to render according to
those PTS values.

Is there any other consideration I should make with RTCP, or with respect
to the LipSyncHack, in order to assure these streams stay in sync?

RTP streams have independent timestamps. In order to sync two streams
(assuming they come from the same source), RTCP Sender Reports are used,
which contain a mapping between an RTP timestamp and an NTP timestamp
(which represents the sender's wall clock at the same instance).

My third question is, why when the LipSyncHack is enabled, I seem to
receive a couple Audio Frames at the very start of the stream with an RTP
timestamp of '0', followed by what looks like a normal PTS.

This seems weird to me, I don't know why it happens.

I hope this helps,
Boris

--
- Jason Thomas
   http://jasonthom.as