I'm working on a transcription module for jitsi-meet. I've come up the
following design for the module. I would welcome any feedback or ideas for
The current idea is that all audio streams will be recorder, and after
recording is finished the whole audio file gets send to a Sphinx4 HTTP
It's very hard to send some audio every x seconds because it's very
difficult to in Sphinx4 to get good results from a short file, especially
when you cut of the audio mid sentence.
We had the idea of merging audio files with the previous sent chunk but the
synchronization was very difficult.
The idea is that you can plug in another (paid) service when it's required.
One issue I've not solved yet it the getting the name of the person
belonging to a JitsiTrack. I did notice that a JitsiTrack has a
partipicantID. Can I use this id to get the name?
The module will have 3 main javacript "classes". The design I come up with
is as follows:
this file will record the audio streams from every user in the conference
will be called on event "TRACK_ADDED" in conference.js to give the new
JitsiTrack. If a recording session is already ongoing, it will also starts
recording the new JitsiTrack
will be called on event "TRACK_REMOVED" in conference.js to tell that it's
not longer necessary to record that JitsiTrack. This might not be needed
because the stream will run out?
will start recording from all given JitsiTracks. Can only be called once.
will stop recording from all given JitsiTracks.
will return an array of bytes for each recorder JitsiTrack audio stream
will reset the recorder so startAudioRecording can be called again. This
will delete the arrays of a previous recording session.
will be able to send the byte arrays to any speech-to-text service. Note
that it can take a service up to 2/3x the length of the given audio file to
give back the text
will send the byte array of one audio file in the way the chosen
speech-to-text services requires it. Will return a function which will be
called with the answer once it has been retrieved and parsed
will parse the output from the speech-to-text service to the right format
expected by transcriber.js
will manage the whole transcription process
will tell the audioRecorder to start recording and note down the current
time. can only be called once.
will tell the audioRecorder to stop recording, note down the current time,
get the byteArrays and send them to the transcriptionService. can only be
Can only be called when startTranscribing has fired. Will note down event,
name of who did it and the time. At the end these events will be merged
into the transcription
Supported events will be user joined, left, (un)muted audio/video and chat
will be a method which will be fired when the transcriptionService has
given back all the answers for all the given byte arrays. it will then
start the merging
will merge all the audio transcription to a single text string as well as
merge the events
will be a method which gets the transcription once the result been
retrieved from the server and completely merged
will return the transcript. Throws an error when onTranscriptionReady() has
not fired yet
will reset the retrieved events and the audioRecorder