C#•3mo ago

Assignment Problem

I am working with two third-party services: Pyannote AI for Speaker Diarization and Whisper for Speech-To-Text. The reason I am using Pyannote is because Whisper does not innitially support Speaker Diarization (for what I am informed about). So what I need to do, is match the Speakers from the Diarization response of Pyannote with the segments from the Transcription response from Whisper. Since both responses provide time parameters (start and stop), I am thinking to rely on them... Here are some example responses (real case): https://pastebin.com/JDjTQf9s Does any of you have any experience with similar "assignment" problems, or does any of you have any suggestion on what's the best way to approach this sort of problem...

Pastebin

Example Inputs - Assignment Problem - Pastebin.com

Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.

9 Replies

Sehra•3mo ago

looks similar to something i wrote many years ago. you can match between them using a simple query data.start <= query.end && data.end >= query.start would get you all overlapping segments. for this case then taking the segment that overlaps the most is probably easiest so start with transcriptions and for each use that to find matching speaker segments, then pick the one that overlap the most

Pan!cKkOP•3mo ago

I tried that approach, but it is not quite efficient... It really varies, when speakers overlap quite often, and there are abrupt speaker changes, it fails to choose the correct one... This method is the best i've got so far...

private static List<Segment> AssignSpeakers(List<DiarizationModel> diarization, List<Segment> segments)
    {
        string previousSpeaker = null;

        foreach (var segment in segments)
        {
            string assignedSpeaker = null;
            double bestScore = 0.0;

            foreach (var entry in diarization)
            {
                double overlapStart = Math.Max(segment.StartTime, entry.Start);
                double overlapEnd = Math.Min(segment.EndTime, entry.End);
                double overlap = Math.Max(0, overlapEnd - overlapStart);

                if (overlap <= 0)
                    continue;

                double score = overlap;
                double entryDuration = entry.End - entry.Start;
                double ratio = overlap / entryDuration;

                score += ratio;

                if (Math.Abs(segment.StartTime - entry.Start) < 1.0)
                    score += 0.5;

                if (Math.Abs(segment.EndTime - entry.End) < 1.0)
                    score += 0.5;

                if (score > bestScore)
                {
                    bestScore = score;
                    assignedSpeaker = entry.Speaker;
                }
            }

            if (assignedSpeaker == null)
            {
                double closestDistance = double.MaxValue;
                foreach (var entry in diarization)
                {
                    double distance = Math.Min(Math.Abs(segment.StartTime - entry.End),    Math.Abs(segment.EndTime - entry.Start));

                    if (distance < closestDistance)
                    {
                        closestDistance = distance;
                        assignedSpeaker = entry.Speaker;
                    }
                }

            }
            
            segment.Speaker = assignedSpeaker;
            previousSpeaker = assignedSpeaker;
        }

        return segments;
    }

private static List<Segment> AssignSpeakers(List<DiarizationModel> diarization, List<Segment> segments)
    {
        string previousSpeaker = null;

        foreach (var segment in segments)
        {
            string assignedSpeaker = null;
            double bestScore = 0.0;

            foreach (var entry in diarization)
            {
                double overlapStart = Math.Max(segment.StartTime, entry.Start);
                double overlapEnd = Math.Min(segment.EndTime, entry.End);
                double overlap = Math.Max(0, overlapEnd - overlapStart);

                if (overlap <= 0)
                    continue;

                double score = overlap;
                double entryDuration = entry.End - entry.Start;
                double ratio = overlap / entryDuration;

                score += ratio;

                if (Math.Abs(segment.StartTime - entry.Start) < 1.0)
                    score += 0.5;

                if (Math.Abs(segment.EndTime - entry.End) < 1.0)
                    score += 0.5;

                if (score > bestScore)
                {
                    bestScore = score;
                    assignedSpeaker = entry.Speaker;
                }
            }

            if (assignedSpeaker == null)
            {
                double closestDistance = double.MaxValue;
                foreach (var entry in diarization)
                {
                    double distance = Math.Min(Math.Abs(segment.StartTime - entry.End),    Math.Abs(segment.EndTime - entry.Start));

                    if (distance < closestDistance)
                    {
                        closestDistance = distance;
                        assignedSpeaker = entry.Speaker;
                    }
                }

            }
            
            segment.Speaker = assignedSpeaker;
            previousSpeaker = assignedSpeaker;
        }

        return segments;
    }

but it is still not efficient enough

Sehra•3mo ago

gave it a try, hard to do much with so limited data

Sehra•3mo ago

foreach (var t in ts![0]["segments"])
{
  var d = ds!
    .Where(d => d.Start <= t.End && d.End >= t.Start)
    .Select(d => new { d.Speaker, Start = Math.Clamp(d.Start, t.Start, t.End), End = Math.Clamp(d.End, t.Start, t.End) })
    .GroupBy(d => d.Speaker)
    .Select(d => new { Speaker = d.Key, Total = d.Sum(x => x.End - x.Start) })
    .OrderByDescending(d => d.Total)
    .FirstOrDefault();
}

foreach (var t in ts![0]["segments"])
{
  var d = ds!
    .Where(d => d.Start <= t.End && d.End >= t.Start)
    .Select(d => new { d.Speaker, Start = Math.Clamp(d.Start, t.Start, t.End), End = Math.Clamp(d.End, t.Start, t.End) })
    .GroupBy(d => d.Speaker)
    .Select(d => new { Speaker = d.Key, Total = d.Sum(x => x.End - x.Start) })
    .OrderByDescending(d => d.Total)
    .FirstOrDefault();
}

my algo attempt

Pan!cKkOP•3mo ago

Hey, thanks for your help. I tried your algo and it is quite fast, but the method I provided seems to be more efficient when there are abrupt speaker changes and multiple overlaps. Here you can find your code but with more data: https://pastebin.com/H1ehmNHQ For example, in this case shown in these screenshots, it miss-selects Speaker_02 when actually Speaker_01 is speaking... Is it even possible to effeciently handle such use-cases without any machine-learning approach? Do you think your algo could be improved to be more efficient?

Pan!cKkOP•3mo ago

regarding the "limited data", that's how my app works, it processes a media stream in 1minute chunks...

Sehra•3mo ago

don't really have many ideas on this. LLM's are good at language so maybe could be used to untangle the mixed lines and assign them to different speakers not sure how simpler heuristics would fare. like make it less likely to pick the same speaker as was used for previous segment

Sehra•3mo ago

found https://github.com/MahmoudAshraf97/whisper-diarization that looks to be the same problem at a quick glance

GitHub

GitHub - MahmoudAshraf97/whisper-diarization: Automatic Speech Reco...

Automatic Speech Recognition with Speaker Diarization based on OpenAI Whisper - MahmoudAshraf97/whisper-diarization

Gaming

Programming

Assignment Problem

Did you find this page helpful?