Top n DominantSpeakerIdentification from video-bridge

I saw the class by which jitsi actually select the dominant speaker and maybe this roughly implements the Dominant Speaker Identification Algo. This seems really cool to me they took the consideration of speech-middle-cut and false switching problems.

Now I want the same thing for calculating top n dominant speaker simultaneously (to forward from video-bridge or media to other clients) rather than top n loudest (just top n packets with highest RMS) which has the speech-middle-cut and false switch problems defined in the paper above. Is this idea sounds crazy or is it doable which can have improved performance than the top n loudest? If so then how can I approach the class?
Thanks in advance for any kind of help :heart:

** I have my own SFU media with my own audio packets. I just need the idea and algo.