C
C#2w ago
Pan!cKk

Assignment Problem

I am working with two third-party services: Pyannote AI for Speaker Diarization and Whisper for Speech-To-Text. The reason I am using Pyannote is because Whisper does not innitially support Speaker Diarization (for what I am informed about). So what I need to do, is match the Speakers from the Diarization response of Pyannote with the segments from the Transcription response from Whisper. Since both responses provide time parameters (start and stop), I am thinking to rely on them... Here are some example responses (real case): https://pastebin.com/JDjTQf9s Does any of you have any experience with similar "assignment" problems, or does any of you have any suggestion on what's the best way to approach this sort of problem...
Pastebin
Example Inputs - Assignment Problem - Pastebin.com
Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.
9 Replies
Sehra
Sehra2w ago
looks similar to something i wrote many years ago. you can match between them using a simple query data.start <= query.end && data.end >= query.start would get you all overlapping segments. for this case then taking the segment that overlaps the most is probably easiest so start with transcriptions and for each use that to find matching speaker segments, then pick the one that overlap the most
Pan!cKk
Pan!cKkOP2w ago
I tried that approach, but it is not quite efficient... It really varies, when speakers overlap quite often, and there are abrupt speaker changes, it fails to choose the correct one... This method is the best i've got so far...
private static List<Segment> AssignSpeakers(List<DiarizationModel> diarization, List<Segment> segments)
{
string previousSpeaker = null;

foreach (var segment in segments)
{
string assignedSpeaker = null;
double bestScore = 0.0;

foreach (var entry in diarization)
{
double overlapStart = Math.Max(segment.StartTime, entry.Start);
double overlapEnd = Math.Min(segment.EndTime, entry.End);
double overlap = Math.Max(0, overlapEnd - overlapStart);

if (overlap <= 0)
continue;

double score = overlap;
double entryDuration = entry.End - entry.Start;
double ratio = overlap / entryDuration;

score += ratio;

if (Math.Abs(segment.StartTime - entry.Start) < 1.0)
score += 0.5;

if (Math.Abs(segment.EndTime - entry.End) < 1.0)
score += 0.5;

if (score > bestScore)
{
bestScore = score;
assignedSpeaker = entry.Speaker;
}
}

if (assignedSpeaker == null)
{
double closestDistance = double.MaxValue;
foreach (var entry in diarization)
{
double distance = Math.Min(Math.Abs(segment.StartTime - entry.End), Math.Abs(segment.EndTime - entry.Start));

if (distance < closestDistance)
{
closestDistance = distance;
assignedSpeaker = entry.Speaker;
}
}

}

segment.Speaker = assignedSpeaker;
previousSpeaker = assignedSpeaker;
}

return segments;
}
private static List<Segment> AssignSpeakers(List<DiarizationModel> diarization, List<Segment> segments)
{
string previousSpeaker = null;

foreach (var segment in segments)
{
string assignedSpeaker = null;
double bestScore = 0.0;

foreach (var entry in diarization)
{
double overlapStart = Math.Max(segment.StartTime, entry.Start);
double overlapEnd = Math.Min(segment.EndTime, entry.End);
double overlap = Math.Max(0, overlapEnd - overlapStart);

if (overlap <= 0)
continue;

double score = overlap;
double entryDuration = entry.End - entry.Start;
double ratio = overlap / entryDuration;

score += ratio;

if (Math.Abs(segment.StartTime - entry.Start) < 1.0)
score += 0.5;

if (Math.Abs(segment.EndTime - entry.End) < 1.0)
score += 0.5;

if (score > bestScore)
{
bestScore = score;
assignedSpeaker = entry.Speaker;
}
}

if (assignedSpeaker == null)
{
double closestDistance = double.MaxValue;
foreach (var entry in diarization)
{
double distance = Math.Min(Math.Abs(segment.StartTime - entry.End), Math.Abs(segment.EndTime - entry.Start));

if (distance < closestDistance)
{
closestDistance = distance;
assignedSpeaker = entry.Speaker;
}
}

}

segment.Speaker = assignedSpeaker;
previousSpeaker = assignedSpeaker;
}

return segments;
}
but it is still not efficient enough
Sehra
Sehra2w ago
gave it a try, hard to do much with so limited data
No description
Sehra
Sehra2w ago
foreach (var t in ts![0]["segments"])
{
var d = ds!
.Where(d => d.Start <= t.End && d.End >= t.Start)
.Select(d => new { d.Speaker, Start = Math.Clamp(d.Start, t.Start, t.End), End = Math.Clamp(d.End, t.Start, t.End) })
.GroupBy(d => d.Speaker)
.Select(d => new { Speaker = d.Key, Total = d.Sum(x => x.End - x.Start) })
.OrderByDescending(d => d.Total)
.FirstOrDefault();
}
foreach (var t in ts![0]["segments"])
{
var d = ds!
.Where(d => d.Start <= t.End && d.End >= t.Start)
.Select(d => new { d.Speaker, Start = Math.Clamp(d.Start, t.Start, t.End), End = Math.Clamp(d.End, t.Start, t.End) })
.GroupBy(d => d.Speaker)
.Select(d => new { Speaker = d.Key, Total = d.Sum(x => x.End - x.Start) })
.OrderByDescending(d => d.Total)
.FirstOrDefault();
}
my algo attempt
Pan!cKk
Pan!cKkOP2w ago
Hey, thanks for your help. I tried your algo and it is quite fast, but the method I provided seems to be more efficient when there are abrupt speaker changes and multiple overlaps. Here you can find your code but with more data: https://pastebin.com/H1ehmNHQ For example, in this case shown in these screenshots, it miss-selects Speaker_02 when actually Speaker_01 is speaking... Is it even possible to effeciently handle such use-cases without any machine-learning approach? Do you think your algo could be improved to be more efficient?
No description
No description
Pan!cKk
Pan!cKkOP2w ago
No description
Pan!cKk
Pan!cKkOP2w ago
regarding the "limited data", that's how my app works, it processes a media stream in 1minute chunks...
Sehra
Sehra2w ago
don't really have many ideas on this. LLM's are good at language so maybe could be used to untangle the mixed lines and assign them to different speakers not sure how simpler heuristics would fare. like make it less likely to pick the same speaker as was used for previous segment
Sehra
Sehra2w ago
found https://github.com/MahmoudAshraf97/whisper-diarization that looks to be the same problem at a quick glance
GitHub
GitHub - MahmoudAshraf97/whisper-diarization: Automatic Speech Reco...
Automatic Speech Recognition with Speaker Diarization based on OpenAI Whisper - MahmoudAshraf97/whisper-diarization

Did you find this page helpful?