Implementing RecorderRtpImpl in NLJ

As we discussed on the last Jitsi Community Call, we are planning to port our JVB-based Recorder implementation (based on the older RecorderRtpImpl code) to use NLJ and Jitsi-RTP. Since the Recorder implementation relies on JMF/FMJ, we are looking for some direction on how the architecture will change (such as examples for refactoring JMF constructs to use the new NLJ / Jitsi-RTP constructs).

We appreciate any direction or advice so that we can start working towards migrating to NLJ.

Thanks in advance for your help,

Paul

@bbaldino @Boris_Grozev

Hey Paul,
Happy to help. Could you lay out the requirements you have for recording? The RecorderRtpImpl code is a bit before my time, but also it may be worthwhile to start fresh with what you guys need rather than think of things in terms of the old implementation (once we know what you need, we can still think about the new implementation in terms of the old, but as far as requirements I think it’d be good to start ‘fresh’).
For example, are you looking to record all audio and video streams separately? Or do you want an experience similar to what a participant in the call would see (mixed audio, active speaker switching), etc.

Hi @bbaldino.

How to record all audio and video streams separately?
I need that idea for my job.

Thanks

It’s easy to create a Node which captures all media data and insert it in the receiver pipelines for each endpoint, but what to do from there will depend on what sort of result you’re looking for (like I mentioned in the above post).

How to do it? You can guide me or example.

Take a look at the PcapWriter node for an example of something that grabs passing data. The receiver pipeline is built here. Adding a PcapWriter would just involve calling node(PcapWriter()) in that pipeline wherever you wanted it to be.

Hey Brian,

Thanks for your help with this. I agree that starting fresh would be the best approach. Our goal is to record all audio and video streams separately, but we are using the Synchronizer to capture RTCP SR Packets so that we can synchronize audio and media streams and correlate RTP timestamps in participant streams to JVB time. We are also capturing audio levels from RTP packets to support active speaker detection (with the goal of capturing speaker changes so that we can capture and persist this metadata). We are currently transcoding audio streams from Opus to PCM, but we want to eventually remove this and avoid the overhead of transcoding. However, this was the original approach in the RecorderRtpImpl class, as there is a SilenceEffect that requires PCM audio (and I think the ActiveSpeakerDetector for the Recorder also relies on pulling the audio levels from the PCM samples rather than the RTP packet header).

We have also increased the jitterBuffer quite a bit for the recorder in order to reduce the risk of dropped packets (we aim to maximize the quality/stability of the recorded streams — even at the expense of increased latency). However, it seems like we can lose the last few seconds of a recording, due to having some packets still stuck in the jitterBuffer at the time a recording ends. So, if you have any recommendations of potentially preventing this type of scenario that would be really helpful as well.

Thanks again for all your help and suggestions. I’ll take a look at he PcapWriter node to get started.

Hey Paul,
So I’ve got a couple thoughts on how to do this, and again I’m coming from a place pretty ignorant of the RecorderRtpImpl so sorry if some of it doesn’t make sense.

I’m thinking you’d have a central recorder class. You’d create one of these per conference and it’d be responsible for doing most of the heavy lifting of the recording (or, at least, be the main ‘controller’ of executing it–transcoding, writing to disk, etc.). You’d have a node implementation which gets inserted in the receive pipeline on the RTP path after decrypt: this gets you access to all the incoming media (you could do a single one for audio and video say after decrypt here or separate ones for audio and video here and here). I think the jitter buffer would live in the node–but note it doesn’t have to delay the packets from moving through the rest of the pipeline. The node would be passed the instance of the recorder class so it could pass the data there. The recorder class could also subscribe to incoming RTCP from a Transceiver via its RtcpEventNotifier, this is how you can get the SRs.

You’ll want to make sure you do the various work that needs to be done in the proper thread pool context: jvb 2.0 has separate thread pools for CPU-bound, IO-bound and scheduled work.

As for the issue of losing the last few seconds of the recording, when receiver is shut down we execute a teardown visitor through the pipeline, which synchronously calls stop on each node, so I think this should work to give you a chance to flush any packets that hadn’t been recorded yet.

Hopefully that made sense and works as a starting point for the discussion.

Hey Brian,

Thank you so much for your suggestions on this. This all makes sense to me, and it sounds like this refactor should simplify and improve the current Recorder implementation significantly.

I’ll start scaffolding something based on your recommended approach here, and will reply back to this thread with any questions or issues that come up.

Thanks again for all your help!

Best,

Paul

1 Like

Hey Paul, wanted to check in on how this was going for you guys. Running into any issues?

Hey Brian,

Thanks for following up! We haven’t run into any issues yet, but we’re going to be doing a bunch of tests later today — so I’m guessing we’ll have a few questions by Monday (and maybe we can discuss during the Jitsi Call, if that works).

Thanks again for all your help!

Best,

Paul

Hey Paul, I was thinking more about the endpoint solution and was thinking it might be a bit awkward for doing the jitter buffer work, at least–it feels like it’d be cleaner to have that on the receive side (though maybe not impossible to do in the pseudo-endpoint scheme).

I think the main risk about doing the pipeline-methods are that it perhaps has a larger ‘surface’ than the pseudo-endpoint method, in the sense that it may be more likely to require changes in the event of code changes in JMT–but I’d consider most pieces in this area (the Node API, the way you tap into RTCP packets/events, etc.) reasonably stable and, if they were to change, I’d think it’d be something mostly superficial.

The lack of dynamic Node injection is also maybe not the most elegant here, though I’d say it’s far from an impossible task: the hardest aspect of it is figuring out how to describe ‘where’ a Node should be inserted, but the visitor framework we already have wouldn’t make it too hard to implement one of a couple schemes (using a Node’s name or class type, for example). And, all that being said, always inserting the recorder node and only enabling it via state isn’t the worst thing.

Just some thoughts on things that came to mind, it’s possible there are other needs of the recorder I’m missing. Feel free to ask any questions here and we can get it figured out.

I’ve also been thinking about this and I came to the opposite conclusion :slight_smile:

How does the Conference approach make implementing a jitter buffer awkward?

The receive pipeline is part of JMT, and inserting Nodes to it from the bridge violates encapsulation. It is error prone and might be broken by changes in JMT (when we work on JMT in the future, I would rather not have to consider how other users of the library might be modifying the pipeline). Also note that in order to tap into audio, video and RTCP at least 3 Nodes will need to be inserted.

For the recording use-case there is no reason to plug into the packets in the middle of the receive pipeline. Receiving the packets after they have been processed by JMT is sufficient, and this can be easily done in
Conference, which receives all of the packets after they have been processed. It is definitely cleaner this way, in the sense that it will be obvious what the code does.

Implementing an actual AbstractEndpoint might indeed be awkward, but it is not necessary. The recorder could be implemented as a PotentialPacketHandler, the same way that the OctoTentacle is:
https://github.com/jitsi/jitsi-videobridge/blob/master/src/main/java/org/jitsi/videobridge/Conference.java#L1103

That is, Conference would have a Recorder instance and will feed packets to it like it does to the tentacle. This would be the only modification to existing brdige code that is necessary, everything else would be separate recording code.

This solution will also just work with Octo.

Boris

Hey Brian and Boris,

Thank you both so much for all your feedback on these two different approaches! It’s actually really helpful to get a clearer sense of how these different strategies differ, their trade-offs, etc. I think the best approach would be for us to prototype both solutions and then see which is cleaner/more stable for our use-case.

A couple of quick questions: are there different performance characteristics between these two strategies? For instance, do we need to synchronize operations in the conference approach vs the JMT approach? Also, how might the jitter buffer work differ between the two?

Thanks again for all your help!

Best,

Paul

I’ve also been thinking about this and I came to the opposite conclusion :slight_smile:

How does the Conference approach make implementing a jitter buffer awkward?

I was thinking that the problem would be simplified if each jitter buffer were entirely independent and in their own receive pipelines, as opposed to having a group of jitter buffers which would have to be indexed. Doing them in the receive pipeline also means you get the scope of a single endpoint for free, though it’s not clear to me how important that would be. Maybe more important would be the ease with which they could access other stats about the transmitting endpoint (RTT, for example). I don’t think doing this with the Conference approach would be impossible, just not as clear to me how it would flow.

The receive pipeline is part of JMT, and inserting Nodes to it from the bridge violates encapsulation.

I don’t see it this way just because I actually think the pipeline construction probably should be in the bridge, and JMT should just provide the ‘building blocks’ (the nodes) to put a pipeline together. It’s just that in practice this is easier to do in JMT (both because kotlin makes it easier and it may be difficult to construct from the higher level in the bridge because the plumbing might be a pain). I see the point though based on the current state.

It is error prone and might be broken by changes in JMT (when we work on JMT in the future, I would rather not have to consider how other users of the library might be modifying the pipeline). Also note that in order to tap into audio, video and RTCP at least 3 Nodes will need to be inserted.

I wasn’t too worried about this because the Nodes are so self-contained and the basic premise of a node’s function (packets passing through it) is so unlikely to change that I figured, at worst, they’d be looking at only superficial tweaks to their node implementations and perhaps moving where it’s inserted. Also, we don’t need a Node for RTCP as we already have the RTCP event notifier to access incoming RTCP.

For the recording use-case there is no reason to plug into the packets in the middle of the receive pipeline. Receiving the packets after they have been processed by JMT is sufficient, and this can be easily done in
Conference, which receives all of the packets after they have been processed. It is definitely cleaner this way, in the sense that it will be obvious what the code does.

I’m not sure I agree that it’s more obvious, but I generally think what you’ve said here is true. Again I’m a little worried that they may need access to more low-level data that’d be more readily accessible in the receiver, but this is speculation, really, as I’m not sure all of what’s needed. If it comes down to nothing but SRs (which we forward already) then it should be fine. If other stuff is needed then it gets a bit trickier.

Implementing an actual AbstractEndpoint might indeed be awkward, but it is not necessary. The recorder could be implemented as a PotentialPacketHandler, the same way that the OctoTentacle is:
https://github.com/jitsi/jitsi-videobridge/blob/master/src/main/java/org/jitsi/videobridge/Conference.java#L1103

Yes that’s a good idea.

That is, Conference would have a Recorder instance and will feed packets to it like it does to the tentacle. This would be the only modification to existing brdige code that is necessary, everything else would be separate recording code.

This solution will also just work with Octo.

Also a good point. The octo case is worth thinking about in terms of how you guys want to handle it. Would a single bridge be charged with recording all participants (what if this bridge is removed from the call because all of its participants left, but other bridges are still in the call?), or should each bridge handle the recording of its local participants?

Boris


Hey Brian and Boris,

Thank you both so much for all your feedback on these two different approaches! It’s actually really helpful to get a clearer sense of how these different strategies differ, their trade-offs, etc. I think the best approach would be for us to prototype both solutions and then see which is cleaner/more stable for our use-case.

To be clear, the conference approach has, without a doubt, a much smaller ‘surface’ and is less likely to be affected by changes in the bridge/JMT so I think Boris’ idea is good, even though I think I’m less concerned about the brittle-ness of the other approach than he is. I think your plan here is a good one. Try both a bit–or even try the conference one and see if it works out ok and if so that’d be great (I’d imagine most of the core recording code would be reusable so wouldn’t be too much throwaway if it didn’t work).

A couple of quick questions: are there different performance characteristics between these two strategies? For instance, do we need to synchronize operations in the conference approach vs the JMT approach? Also, how might the jitter buffer work differ between the two?

What is the end result you guys are looking for? Is it a file with all the audio and video? If so, then you’re going to have multiple threads at play at one point or another. I’d probably look at using a queue somewhere where you can safely add items from N threads and then define your own threading model on the other side of it.

-brian