[jitsi-dev] Design of transcription module


#1

Hello,

I'm working on a transcription module for jitsi-meet. I've come up the
following design for the module. I would welcome any feedback or ideas for
improvement.
The current idea is that all audio streams will be recorder, and after
recording is finished the whole audio file gets send to a Sphinx4 HTTP
server[0].
It's very hard to send some audio every x seconds because it's very
difficult to in Sphinx4 to get good results from a short file, especially
when you cut of the audio mid sentence.
We had the idea of merging audio files with the previous sent chunk but the
synchronization was very difficult.
The idea is that you can plug in another (paid) service when it's required.

One issue I've not solved yet it the getting the name of the person
belonging to a JitsiTrack. I did notice that a JitsiTrack has a
partipicantID. Can I use this id to get the name?

The module will have 3 main javacript "classes". The design I come up with
is as follows:

### audioRecorder.js
this file will record the audio streams from every user in the conference

giveTrack(JitsiTrack)
will be called on event "TRACK_ADDED" in conference.js to give the new
JitsiTrack. If a recording session is already ongoing, it will also starts
recording the new JitsiTrack

notifyTrackStopped(jitsiTrack)
will be called on event "TRACK_REMOVED" in conference.js to tell that it's
not longer necessary to record that JitsiTrack. This might not be needed
because the stream will run out?

startAudioRecording()
will start recording from all given JitsiTracks. Can only be called once.

stopAudioRecording()
will stop recording from all given JitsiTracks.

getByteArrays()
will return an array of bytes for each recorder JitsiTrack audio stream

reset()
will reset the recorder so startAudioRecording can be called again. This
will delete the arrays of a previous recording session.

### transcriptionService.js
will be able to send the byte arrays to any speech-to-text service. Note
that it can take a service up to 2/3x the length of the given audio file to
give back the text

sendByteArray()
will send the byte array of one audio file in the way the chosen
speech-to-text services requires it. Will return a function which will be
called with the answer once it has been retrieved and parsed

parseAnswer()
will parse the output from the speech-to-text service to the right format
expected by transcriber.js

### transcriber.js
will manage the whole transcription process

startTranscribing()
will tell the audioRecorder to start recording and note down the current
time. can only be called once.

stopTranscribing()
will tell the audioRecorder to stop recording, note down the current time,
get the byteArrays and send them to the transcriptionService. can only be
called once.

notifyEvent(event, name)
Can only be called when startTranscribing has fired. Will note down event,
name of who did it and the time. At the end these events will be merged
into the transcription
Supported events will be user joined, left, (un)muted audio/video and chat
messages

onRetrievedAllAnswers()
will be a method which will be fired when the transcriptionService has
given back all the answers for all the given byte arrays. it will then
start the merging

mergeAnswers()
will merge all the audio transcription to a single text string as well as
merge the events

onTranscriptionReady(transcription)
will be a method which gets the transcription once the result been
retrieved from the server and completely merged

getTranscription()
will return the transcript. Throws an error when onTranscriptionReady() has
not fired yet

reset()
will reset the retrieved events and the audioRecorder

Regards,

Nik

[0] = https://github.com/nikvaessen/Sphinx4-HTTP-server


#2

Hey Nik,

Hello,

I'm working on a transcription module for jitsi-meet. I've come up the
following design for the module. I would welcome any feedback or ideas
for improvement.
The current idea is that all audio streams will be recorder, and after
recording is finished the whole audio file gets send to a Sphinx4 HTTP
server[0].
It's very hard to send some audio every x seconds because it's very
difficult to in Sphinx4 to get good results from a short file,
especially when you cut of the audio mid sentence.
We had the idea of merging audio files with the previous sent chunk but
the synchronization was very difficult.
The idea is that you can plug in another (paid) service when it's required.

One issue I've not solved yet it the getting the name of the person
belonging to a JitsiTrack. I did notice that a JitsiTrack has a
partipicantID. Can I use this id to get the name?

You can get the name from the JitsiParticipant object using getDisplayName(). And you can get the JitsiParticipant from the JitsiConference. From the global context you can use for example:
var displayName = APP.conference._room.getParticipantById(track.getParticipantId()).getDisplayName()

But you should export what you need and not use the global context.

I don't think it is good design for JitsiTrack to have getParticipantId() instead of getParticipant(). If you want to add getParticipant() and use that instead, that would be a better solution (and also arguably easier to implement), and we can accept it in the library (lib-jitsi-meet).

The module will have 3 main javacript "classes". The design I come up
with is as follows:

### audioRecorder.js
this file will record the audio streams from every user in the conference

giveTrack(JitsiTrack)
will be called on event "TRACK_ADDED" in conference.js to give the new
JitsiTrack. If a recording session is already ongoing, it will also
starts recording the new JitsiTrack

notifyTrackStopped(jitsiTrack)
will be called on event "TRACK_REMOVED" in conference.js to tell that
it's not longer necessary to record that JitsiTrack. This might not be
needed because the stream will run out?

This is a minor point, but I think better names for these would be addTrack and removeTrack.

startAudioRecording()
will start recording from all given JitsiTracks. Can only be called once.

stopAudioRecording()
will stop recording from all given JitsiTracks.

Also, how about just using start() and stop()? I don't see how it could be ambiguous, so it's better to stick to simple names.

getByteArrays()
will return an array of bytes for each recorder JitsiTrack audio stream

reset()
will reset the recorder so startAudioRecording can be called again. This
will delete the arrays of a previous recording session.

### transcriptionService.js
will be able to send the byte arrays to any speech-to-text service. Note
that it can take a service up to 2/3x the length of the given audio file
to give back the text

sendByteArray()
will send the byte array of one audio file in the way the chosen
speech-to-text services requires it. Will return a function which will
be called with the answer once it has been retrieved and parsed

To clarify: will it take a callback function as a parameter?

parseAnswer()
will parse the output from the speech-to-text service to the right
format expected by transcriber.js

### transcriber.js
will manage the whole transcription process

startTranscribing()
will tell the audioRecorder to start recording and note down the current
time. can only be called once.

stopTranscribing()
will tell the audioRecorder to stop recording, note down the current
time, get the byteArrays and send them to the transcriptionService. can
only be called once.

Maybe this should take a callback which will receive the transcript, once ready?

Again, maybe just start() and stop()?

notifyEvent(event, name)
Can only be called when startTranscribing has fired. Will note down
event, name of who did it and the time. At the end these events will be
merged into the transcription
Supported events will be user joined, left, (un)muted audio/video and
chat messages

How about addEvent(event, name)? Seems more straightforward, since what it will do is "add" an event to the transcript.

onRetrievedAllAnswers()
will be a method which will be fired when the transcriptionService has
given back all the answers for all the given byte arrays. it will then
start the merging

mergeAnswers()
will merge all the audio transcription to a single text string as well
as merge the events

Are these two going to be part of the interface or just the implementation?

onTranscriptionReady(transcription)
will be a method which gets the transcription once the result been
retrieved from the server and completely merged

getTranscription()
will return the transcript. Throws an error when onTranscriptionReady()
has not fired yet

I would suggest to replace these two with a callback passed to stopTranscribing(). What do you think?

Overall the design looks good to me! I'm very exited to see this at work :slight_smile:

Regards,
Boris

···

On 27/07/16 09:55, Nik V wrote:


#3

Hey Nik,

Hello,

I'm working on a transcription module for jitsi-meet. I've come up the
following design for the module. I would welcome any feedback or ideas
for improvement.
The current idea is that all audio streams will be recorder, and after
recording is finished the whole audio file gets send to a Sphinx4 HTTP
server[0].
It's very hard to send some audio every x seconds because it's very
difficult to in Sphinx4 to get good results from a short file,
especially when you cut of the audio mid sentence.
We had the idea of merging audio files with the previous sent chunk but
the synchronization was very difficult.
The idea is that you can plug in another (paid) service when it's
required.

One issue I've not solved yet it the getting the name of the person
belonging to a JitsiTrack. I did notice that a JitsiTrack has a
partipicantID. Can I use this id to get the name?

You can get the name from the JitsiParticipant object using
getDisplayName(). And you can get the JitsiParticipant from the
JitsiConference. From the global context you can use for example:
var displayName =
APP.conference._room.getParticipantById(track.getParticipantId()).getDisplayName()

But you should export what you need and not use the global context.

I don't think it is good design for JitsiTrack to have getParticipantId()
instead of getParticipant(). If you want to add getParticipant() and use
that instead, that would be a better solution (and also arguably easier to
implement), and we can accept it in the library (lib-jitsi-meet).

I will try to if I can find out how to implement getParticipant()

The module will have 3 main javacript "classes". The design I come up
with is as follows:

### audioRecorder.js
this file will record the audio streams from every user in the conference

giveTrack(JitsiTrack)
will be called on event "TRACK_ADDED" in conference.js to give the new
JitsiTrack. If a recording session is already ongoing, it will also
starts recording the new JitsiTrack

notifyTrackStopped(jitsiTrack)
will be called on event "TRACK_REMOVED" in conference.js to tell that
it's not longer necessary to record that JitsiTrack. This might not be
needed because the stream will run out?

This is a minor point, but I think better names for these would be
addTrack and removeTrack.

startAudioRecording()
will start recording from all given JitsiTracks. Can only be called once.

stopAudioRecording()
will stop recording from all given JitsiTracks.

Also, how about just using start() and stop()? I don't see how it could be
ambiguous, so it's better to stick to simple names

getByteArrays()
will return an array of bytes for each recorder JitsiTrack audio stream

reset()
will reset the recorder so startAudioRecording can be called again. This
will delete the arrays of a previous recording session.

### transcriptionService.js
will be able to send the byte arrays to any speech-to-text service. Note
that it can take a service up to 2/3x the length of the given audio file
to give back the text

sendByteArray()
will send the byte array of one audio file in the way the chosen
speech-to-text services requires it. Will return a function which will
be called with the answer once it has been retrieved and parsed

To clarify: will it take a callback function as a parameter?

I didn't know about callback functions, but now I read up about them they
seem to be exactly the kind of thing I wanted to use.

parseAnswer()
will parse the output from the speech-to-text service to the right
format expected by transcriber.js

### transcriber.js
will manage the whole transcription process

startTranscribing()
will tell the audioRecorder to start recording and note down the current
time. can only be called once.

stopTranscribing()
will tell the audioRecorder to stop recording, note down the current
time, get the byteArrays and send them to the transcriptionService. can
only be called once.

Maybe this should take a callback which will receive the transcript, once
ready?

Again, maybe just start() and stop()?

Callback would also make sense here.

notifyEvent(event, name)
Can only be called when startTranscribing has fired. Will note down
event, name of who did it and the time. At the end these events will be
merged into the transcription
Supported events will be user joined, left, (un)muted audio/video and
chat messages

How about addEvent(event, name)? Seems more straightforward, since what it
will do is "add" an event to the transcript.

onRetrievedAllAnswers()
will be a method which will be fired when the transcriptionService has
given back all the answers for all the given byte arrays. it will then
start the merging

mergeAnswers()
will merge all the audio transcription to a single text string as well
as merge the events

Are these two going to be part of the interface or just the implementation?

These methods would be used internally to merge the transcripts. The
onRetrievedAllAnswers would fire once the server has returned all POSTs and
then handle all the merging steps.

onTranscriptionReady(transcription)
will be a method which gets the transcription once the result been
retrieved from the server and completely merged

getTranscription()
will return the transcript. Throws an error when onTranscriptionReady()
has not fired yet

I would suggest to replace these two with a callback passed to
stopTranscribing(). What do you think?

Yes, you're right, that would be better.

Overall the design looks good to me! I'm very exited to see this at work :slight_smile:

Regards,
Boris

All name suggestions also make sense.

Regards,

Nik

···

On Wed, Jul 27, 2016 at 5:52 PM, Boris Grozev <boris@jitsi.org> wrote:

On 27/07/16 09:55, Nik V wrote: