Using Mozilla DeepSpeech for automatic speech-to-text

Hello,

I plan to integrate the mozilla deepspeech project into jigasi to enable automatic speech-to-text. I am aware of the work @Nik_V has done with enabling automatic speech-to-text with Google Speech API.

My plan is to use the Java bindings provided by Mozilla DeepSpeech (https://deepspeech.readthedocs.io/en/v0.7.0/Java-API.html) and implement the transcription service interface already existing in jigasi. The bindings over a streaming interface and matches what jigasi needs (broadly).

Any comments, feedback or ideas to help with this?

I also have a more specific question - the Java bindings of Mozilla DeepSpeech needs a 16-bit, mono raw audio sampled at 16kHz (assume raw means linear PCM encoding). I guess that the audio format coming here https://github.com/jitsi/jigasi/blob/master/src/main/java/org/jitsi/jigasi/transcription/TranscriptionRequest.java is in the Opus encoding. Is there an established way in the jitsi codebase to get the raw output from the Opus audio format? Also might need to resample to 16 kHz.

If I recall correctly, the audio format should already be linear (verify by setting a breakpoint here: https://github.com/jitsi/jigasi/blob/309d62ad757b4492f537ab150e7f7fce72a0f0ae/src/main/java/org/jitsi/jigasi/transcription/Participant.java#L560)

However, looking at the DeepSpeech model, it requires an array of shorts. Have a look at https://github.com/jitsi/jitsi-webrtc-vad-wrapper/blob/master/src/main/java/org/jitsi/webrtcvadwrapper/audio/ByteSignedPcmAudioSegment.java, which is used to convert the audio to 16 bit PCM required by the VAD detection we use to filter out silence speech. It does not, however, resample the 48 kHz audio to 16 kHz. I think it’s OK to simply replace int[] with short[]

Thanks @Nik_V! Very useful!

So should I assume that the audio is sampled at 48 kHz? Is there somewhere in the code where I can check that?

You should inspect the AudioFormat object at https://github.com/jitsi/jigasi/blob/309d62ad757b4492f537ab150e7f7fce72a0f0ae/src/main/java/org/jitsi/jigasi/transcription/Participant.java#L560, it will tell you the encoding, sample rate, etc. If I recall correctly it audio is indeed 48 kHz.

One concern I have is the performance of the DeepSpeech model bindings. Does it perform well on CPU? How does that scale when there are e.g 10 participants being served at the same time?

Good question! I don’t know at the moment. It does work without GPU and the DeepSpeech team professes some great improvements on performance on the latests version. I am new to both Jitsi and DeepSpeech and it felt that using the Java bindings would be the easier way to start. Depending on how it goes - I will probably try the approach where DeepSpeech is an separate separate project.

Hi @Mircea_Moise
Have you worked on this further? I too plan to integrate DeepSpeech into Jigasi. But for some reason the current Jigasi build is failing. How are you planning to use raw audio output?

The lastest DeepSpeech version works really well on CPU with not-so-noticeable improvements on GPU.

Hey! Sorry - I need to come back to the thread with a more in-depth description. Briefly, this is my progress:

  1. I struggled to get jigasi connecting (this is without even transcription working) - I managed to get that working, it was all about jigasi config. Let me know if you are also struggling with that and I can share more. My setup is slightly different: I run everything else in docker and jigasi from source.
  2. My plan changed a bit and I am planning to have DeepSpeech as a separate service that is integrated via websockets. That mainly because websockets allows a real-time kinda of interface. Example here: https://github.com/mozilla/DeepSpeech-examples/tree/r0.7/web_microphone_websocket
  3. While I got Jigasi to work and to write a custom transcription service that connects (implementing this https://github.com/jitsi/jigasi/blob/master/src/main/java/org/jitsi/jigasi/transcription/TranscriptionService.java)- I didn’t get jigasi to actually receive the audio… I need to come back to it. I assume is a config setting I need to get right.

Let me know if I can help

Hi @Mircea_Moise
Thank you for getting back!

I’m having trouble getting the transcription to work. As given in the README, I go to the URI jitsi_meet_transcribe and I find a CC button. I can’t get it to work even after enabling it. I’ve made all the required changes in the sip properties file. Can you help me out here?

I don’t think I can implement web-sockets from bottom up right now, so my only option is to replace the Google Cloud Speech-to-Text API calls with that of DeepSpeech xD. Do tell me if you could get it to work though.

I guess Jigasi gets the audio as an Opus stream (still not sure about that, need to dig deeper) and sampling is definitely 48kHz. Need to downsample that and use the DeepSpeech Java API.