Assignment Problem
I am working with two third-party services: Pyannote AI for Speaker Diarization and Whisper for Speech-To-Text.
The reason I am using Pyannote is because Whisper does not innitially support Speaker Diarization (for what I am informed about).
So what I need to do, is match the Speakers from the Diarization response of Pyannote with the segments from the Transcription response from Whisper. Since both responses provide time parameters (start and stop), I am thinking to rely on them...
Here are some example responses (real case): https://pastebin.com/JDjTQf9s
Does any of you have any experience with similar "assignment" problems, or does any of you have any suggestion on what's the best way to approach this sort of problem...
Pastebin
Example Inputs - Assignment Problem - Pastebin.com
Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.
9 Replies
looks similar to something i wrote many years ago. you can match between them using a simple query
data.start <= query.end && data.end >= query.start
would get you all overlapping segments.
for this case then taking the segment that overlaps the most is probably easiest
so start with transcriptions and for each use that to find matching speaker segments, then pick the one that overlap the mostI tried that approach, but it is not quite efficient... It really varies, when speakers overlap quite often, and there are abrupt speaker changes, it fails to choose the correct one...
This method is the best i've got so far...
but it is still not efficient enough
gave it a try, hard to do much with so limited data

my algo attempt
Hey, thanks for your help. I tried your algo and it is quite fast, but the method I provided seems to be more efficient when there are abrupt speaker changes and multiple overlaps.
Here you can find your code but with more data: https://pastebin.com/H1ehmNHQ
For example, in this case shown in these screenshots, it miss-selects Speaker_02 when actually Speaker_01 is speaking...
Is it even possible to effeciently handle such use-cases without any machine-learning approach? Do you think your algo could be improved to be more efficient?



regarding the "limited data", that's how my app works, it processes a media stream in 1minute chunks...
don't really have many ideas on this. LLM's are good at language so maybe could be used to untangle the mixed lines and assign them to different speakers
not sure how simpler heuristics would fare. like make it less likely to pick the same speaker as was used for previous segment
found https://github.com/MahmoudAshraf97/whisper-diarization that looks to be the same problem at a quick glance
GitHub
GitHub - MahmoudAshraf97/whisper-diarization: Automatic Speech Reco...
Automatic Speech Recognition with Speaker Diarization based on OpenAI Whisper - MahmoudAshraf97/whisper-diarization