Server-side Background noise filter using Nvidia's RTX Voice filter

Check out this short demo video: It uses AI to remove background noise.

It’s branded as “RTX voice” but it doesn’t need to be RTX. It should work great (I suspect) with the GPUs on popular cloud services as they are designed for AI

Here’s an article talking about how to implement this without an RTX GPU

Not sure what the added latency is… might need to buffer a few video frames to keep AV sync.

This is not possible on the server side, this means jvb will need to decode and encode audio, where it is just a router. This needs to be implemented on the client side, in the browsers.

So it is possible, but only if the bridge added audio decoding/encoding and accounted for any latency by delaying the video to keep the AV sync. Is that right?

We got rid of any encoding/decoding/mixing logic in the bridge moving to jvb2 as this is not used and is not scalable and does not make sense for SFU.

Just curious, without audio mixing, how is the conversation still so natural? I would have thought that at least a bit of mixing was essential for natural conversation, as people like to overtalk each other a least a little.

Thinking about it more, my guess is that since talker A does hear interrupter B, they will know to pause and let B talk, even though third-party listener C may not have heard B. Is that right?

Multiple streams, jvb is a smart network router, clients are decoding and playing

Hi @damencho,

I checked jitsi-media-transform document and it says “Jitsi Media Transform contains classes for processing and transforming RTP and RTCP packets”

I also came across below diagram in the presentation video [https://www.youtube.com/watch?v=K63z-VrvU8Y]
image

So, now I understand that the jitsi-media-transform is part of Video Bridge and it helps in processing and transforming RTP packets.

Now the question is, Is it possible to use jitsi-media-transform to get decoded audio and apply background noise cancellation?

Nope, jvb cannot decode or encode audio or video.

Thank you for the quick response.

Is the below flow possible?
jitsi-media-transform -> decode audio -> apply audio filter -> encode audio -> jitsi-media-transform

@bbaldino is this possible, thanks?

Yeah this is definitely possible, you’d just have to write nodes for that functionality and insert them into the right place in the pipeline, so this would involve changes in jitsi-media-transform.

If you’re asking about being able to do this externally from jitsi-media-transform, then you’d have to write some nodes which could forward the data to a local socket/some IPC mechanism for you to process and then a node which would read the processed data, we don’t have anything like that in the code today but it would be doable.

1 Like

Thank you for the suggestions for both local and external processing.

I have one more query.
If end-to-end encryption or SRTP is enabled, jitsi-media-transform will still be able to decrypt the payload and audio can be processed?

If end-to-end encryption is enabled you need the key to decrypt, so the answer is no you will not be able to process audio payload. SRTP should always be enabled on webrtc streams, that one by default decrypts as the secure connection is established between the endpoint and the bridge and need new encryption when sending to another endpoint.

1 Like

That’s the exact answer I was looking for. Thank you.

Let’s take the SRTP case. I see that jitsi-media-transform has nodes for decrypt and encrypt.

So, the handling of different keys for end-points is already taken care in jitsi-media-transform?

end-point1 -> jitsi-media-transform [ decrypt(key1) -> media transform -> encrypt(key2) ] -> end-point2