Major Audio problems/choppiness when using RecorderRtpImpl


#1

Because we have some unique requirements (we want to captures individual audio streams as well as RTCP SenderReports, so that we can do additional per-participant processing while also capturing the RTP/NTP time sync information on each stream), we are trying to leverage the (now deprecated) RecorderRtpImpl. We finally got it to work end-to-end, however we are running into some really significant audio issues.

The audio streams are getting persisted, however, the audio files are extremely choppy — almost as if a few milliseconds of audio was inserted into between every 20ms frame. I am wondering whether this has something to do with either:

The SilenceEffect and/or ActiveSpeakerDetector codecs/effects
The Muxer being used to output the stream
Threading/Synchronization issues
JMF weirdness

When we disabled the SilenceEffect, the problem appeared to get a little better, but the quality of the saved audio streams is still quite bad (almost incomprehensible). I tried writing out the linear audio samples within the doProcess method of the SilenceEffect — just to see whether the problem exists further upstream in the JMF graph. However, the audio we got was WAY worse, producing a wav file that was 10X larger than the audio file normally generated via the DataSink. This seems really strange to me as when we log the Buffers coming in to the doProcess method of the SilenceEffect, I don’t see any dropped frames, and the rtpTimestamps of each buffer appear to be the expected 20ms/960 samples/1920 bytes. So what would explain the extremely stretched out debug output, generated via “logging” the bytes from each Buffer/frame coming in to the SilenceEffect?

Another theory we have is that the problem is caused by the Muxer created via JMF in the custom DataSink produced in RecorderRtpImpl. Because we wanted to adjust the Buffer timestamps in our datasink, we needed a PushBufferDataSource and therefore we couldn’t use the WavMuxer. So we set the ContentDescriptor to “raw” in order to get the RawBufferMux. I don’t see how this should cause any issues (especially since there appear to be problems upstream), but I wonder whether it is somehow blocking or that maybe the CircularBuffer is overwriting frames before they are output to a file.

Another theory is that there are synchronization issues either causing inadvertent blocking, or causing threads to somehow affect each other’s data. This doesn’t seem likely, however, since we really haven’t changed the general flow of the RecorderRtpImpl (it’s generally the exact same code, with just a few small changes related to the DataSink and post-processing timestamps).

My last theory is that the RecorderRtpImpl is no longer compatible with the existing Jitsi codebase, and that these significant audio issues are somehow related to the older JMF code. We’ve mentioned using the RecorderRtpImpl on the Jitsi community call, and the feedback we got was that it should generally work in an “audio-only” context. We would use Jibri or Jigasi instead, but we needed access to the RtpTranslator in order to intercept RTCP SenderReport packets so that we could use this data for capturing timing data so that we could support some custom synchronization and audio post-processing requirements within our application.

If there somehow is a significant issue with the now-deprecated RecorderRtpImpl, I wonder whether there might be a work-around? For instance, would it be possible to still intercept RTCP senderReports, but use a different approach for capturing streams? Could we possibly use the AudioMixerMediaDevice or AudioSilenceMediaDevice to capture individual streams (via the ReceiveStreamBufferListener) but still get access to RTCP SR packets to capture timing information?

We have been spending weeks trying to get this to work reliably, so any help or feedback you can provide would be very appreciated!

Thanks in advance for all your help!

Best,

Paul

@damencho @Boris_Grozev


#2

After further debugging, this seems like it may be related to dropped packets. This even seems to happen after I’ve disabled the SilenceEffect and activeSpeakerDetection. This seems surprising given that we’re only trying to capture audio, and we’ve never noticed dropped packets before.

Could this somehow be related to the transcoding of Opus/rtp to Linear/48000? That doesn’t seem likely since we’ve been using opus all along. Could there be something in the RecorderRtpImpl that is blocking upstream threads (which in turn is causing packets to get dropped)?

In our tests using Jigasi and the RecorderImpl we never noticed any dropped packets, so I’m wondering whether this is a side effect of using the older JMF/FMJ code in the RecorderRtpImpl?

It would be really helpful if anyone has any advice on:

  • How to debug the root cause of the dropped packets.
  • Whether a simple workaround, such as increasing the JitterBuffer size or changing to a different codec might help.
  • Whether there’s a way to change the RecorderRtpImpl to use the newer neomedia approaches instead of older JMF/FMJ style. For instance, is there a way we could use AudioMediaSilenceDevice/AudioMixerDevice together with a ReceiveStreamBufferListener?

Thanks again for all your help!

Best,

Paul


#3

Have you been able to identify where exactly the packets are being dropped?


#4

It seems like this stems from the transcoding process. Increasing the jitter buffer significantly seems to help a lot, but I think we may need to try moving the transcoding from opus to linear audio to a post-processing step outside of Jitsi (effectively recording in Opus, rather than linear 48000 audio).

We still want to keep the two Effects in the RecorderRtpImpl active (SilenceEffect and ActiveSpeakerDetection). These currently seem to require Linear Audio as the supported input format, however, from what I can tell, they both should be able to work with basically any format, right? The SilenceEffect is just looking for gaps in the RTP packet sequence, and the ActiveSpeakerDetection can just use the audio level data encoded in the RTP packets. Does this strategy make sense? Is there any problem with changing the ActiveSpeakerDetection’s supported input format to be Opus or any audio format?

Thanks for your help!


#5

Hey Paul,

Our active speaker detection only works with audio levels. I don’t remember why we chose to calculate it ourselves instead of reading it from the RTP header extension, but this should be easy to do.

The silence effect that we have is different, it only works with raw audio. I don’t know if doing the same is even possible in an encoded opus stream (can you insert your own frame in the stream without braking it?), but even if you could you would be losing out on the FEC and PLC that the opus decoder does. I think ideally you would save all opus frames you receive with their timestamps and let the opus decoder handle it later. But this is all just my speculation – I don’t know whether it’s supported by the .opus format, or what their recommendations are.

Regards,

Boris


#6

I think we have code that measures audio levels in case of a mixer. So mixer decodes audio, mix it and push cssrc audio levels to the stream so the other participants can show audio levels even in case you receive one stream.