Jigasi Transcription with Vosk - repeated lines

Hey everybody,

After some headaches (you might’ve seen my other posts), I finally have SIP and Transcription up and working. I was initially going to go ahead with Google Speech to Text but I saw the Vosk integration and wanted to try it out.

Good news, it’s working!

Bad news, I have repeated lines in the transcripts, here is an example:

Transcript of conference held at Jan 25, 2021 in room roomname@conference.meet.domain.com
Initial people present at 9:08:30 PM:

Transcript, started at 9:08:30 PM:
________________________________________________________________________________
<9:08:30 PM> User: joined the conference
<9:08:34 PM> User: testing 
<9:08:34 PM> User: testing testing 
<9:08:38 PM> User: hello 
<9:08:39 PM> User: hello this is 
<9:08:39 PM> User: hello this is user
<9:08:45 PM> User: why are you 
<9:08:45 PM> User: why are you doubling my 
<9:08:46 PM> User: why are you doubling my speech 
<9:08:51 PM> User left the conference
________________________________________________________________________________


End of transcript at Jan 25, 2021 9:08:53 PM

Any thoughts here?

You see this only in the txt dump of the transcripts on jigasi machine, right?

In the txt dump as well as the closed captions in the meeting

If you see the duplication in the UI, I suspect vosk is not sending the info that transcription is the final version.

so this is a vosk problem? Any idea how “final version” works?

This is basically like editions of precious returned transcriptions and the final one is marked as such.
I have never ran vosk, so I don’t have any experience woth it …

does jigasi send anything to tell the server that the end-of-file and wants the return transcript? Or does it just send a constant stream of audio?

See here in post #3

if you continuously stream audio from the client, the only question you should care about is if you received an interim or final transcribe, and is it empty or not. In general, if you see a result key in the response payload, it means you received a final transcribe (which might be treated as a logical pause). That’s basically it. The server can’t predict if a client is going to send more chunks for transcribing. So if your connection is still open, the response payload is a single truth.

It sends audio, its the transcribing service decides that it has done with trying some phrase.

Na be the vosk implementation is not handling it correctly you can look at the code or ask the developer that contributed whether he is willing to take a look. This implementation is contributed and we/jitsi team do not use it at the moment.
What I’m sure that the google one is working correctly and displaying captions in UI correctly.

Thanks for your help. I appreciate it. I’ll see if I can track down the developer.

@Nickolay_Shmyrev are you able to shed some light on this issue?

I will check coming days and let you know. Please be patient, no need to open dozen issues everywhere.

Thanks for looking into it.

Hi, did you get a chance to look into this yet?

@damencho

There is a possible fix to the duplicate string issue here:

Is there anyway we can get this pushed to the official repo so I can test it? I’m not building jitsi from source.

Sure, create a PR, thanks.

done

Hello. I had the same problem and wanted to know if you got some update about this duplication