Using Mozilla DeepSpeech for automatic speech-to-text

Hello,

I plan to integrate the mozilla deepspeech project into jigasi to enable automatic speech-to-text. I am aware of the work @Nik_V has done with enabling automatic speech-to-text with Google Speech API.

My plan is to use the Java bindings provided by Mozilla DeepSpeech (https://deepspeech.readthedocs.io/en/v0.7.0/Java-API.html) and implement the transcription service interface already existing in jigasi. The bindings over a streaming interface and matches what jigasi needs (broadly).

Any comments, feedback or ideas to help with this?

I also have a more specific question - the Java bindings of Mozilla DeepSpeech needs a 16-bit, mono raw audio sampled at 16kHz (assume raw means linear PCM encoding). I guess that the audio format coming here https://github.com/jitsi/jigasi/blob/master/src/main/java/org/jitsi/jigasi/transcription/TranscriptionRequest.java is in the Opus encoding. Is there an established way in the jitsi codebase to get the raw output from the Opus audio format? Also might need to resample to 16 kHz.

If I recall correctly, the audio format should already be linear (verify by setting a breakpoint here: https://github.com/jitsi/jigasi/blob/309d62ad757b4492f537ab150e7f7fce72a0f0ae/src/main/java/org/jitsi/jigasi/transcription/Participant.java#L560)

However, looking at the DeepSpeech model, it requires an array of shorts. Have a look at https://github.com/jitsi/jitsi-webrtc-vad-wrapper/blob/master/src/main/java/org/jitsi/webrtcvadwrapper/audio/ByteSignedPcmAudioSegment.java, which is used to convert the audio to 16 bit PCM required by the VAD detection we use to filter out silence speech. It does not, however, resample the 48 kHz audio to 16 kHz. I think it’s OK to simply replace int[] with short[]

Thanks @Nik_V! Very useful!

So should I assume that the audio is sampled at 48 kHz? Is there somewhere in the code where I can check that?

You should inspect the AudioFormat object at https://github.com/jitsi/jigasi/blob/309d62ad757b4492f537ab150e7f7fce72a0f0ae/src/main/java/org/jitsi/jigasi/transcription/Participant.java#L560, it will tell you the encoding, sample rate, etc. If I recall correctly it audio is indeed 48 kHz.

One concern I have is the performance of the DeepSpeech model bindings. Does it perform well on CPU? How does that scale when there are e.g 10 participants being served at the same time?

Good question! I don’t know at the moment. It does work without GPU and the DeepSpeech team professes some great improvements on performance on the latests version. I am new to both Jitsi and DeepSpeech and it felt that using the Java bindings would be the easier way to start. Depending on how it goes - I will probably try the approach where DeepSpeech is an separate separate project.