RunPod•5d ago

Avoiding hallucinations/repetitions when using the faster whisper worker ?

worker: https://github.com/runpod-workers/worker-faster_whisper Hi everyone, as the title suggests, I'm encountering an issue where the transcription occasionally might repeat the same word/sentence. When this occurs it ruins the entire transcription from the point where it happens. My use case 90% of the time will be large audio recordings ranging from 40 to 120 minutes. From what I read this seems like a semi-common whisper issue but I haven't found any consistent solutions. Some things I've tried: - Using large-v2 instead of large-v3 - enabling VAD Other than that I haven't adjusted any different parameters. Any help will be greatly appreciated! 🙏🙏🙏:poddy:

GitHub

GitHub - runpod-workers/worker-faster_whisper: 🎧 | RunPod worker of...

🎧 | RunPod worker of the faster-whisper model for Serverless Endpoint. - runpod-workers/worker-faster_whisper

1 Reply

nerdylive•4d ago

i dont have alot experience on this, but maybe you can tweak some settings on the whisper? this might worth trying tho: Instead of transcribing the entire 40-120 minute file at once, split it into smaller chunks (e.g., 5-10 minute segments). Transcribe each chunk individually and then concatenate the results. This can help prevent the model from getting lost in long files. Be aware that you may have to manually adjust the seams between the chunks. also check these:

temperature: This parameter controls the randomness of the generated text. A lower temperature (e.g., 0.0 or 0.2) makes the output more deterministic and less prone to repetition. The default value is often 0.0 to 1.0. Setting this too low can also cause issues.

beam_size: This affects the search process during decoding. A larger beam_size (e.g., 5) uses more memory but can sometimes improve accuracy.

patience: This parameter can help with repetition. It determines how long the model should wait before emitting a token. Experiment with values greater than 1.

best_of: This parameter will generate N candidate transcriptions (where N is the value of best_of) and returns the one with the highest average log probability.

temperature: This parameter controls the randomness of the generated text. A lower temperature (e.g., 0.0 or 0.2) makes the output more deterministic and less prone to repetition. The default value is often 0.0 to 1.0. Setting this too low can also cause issues.

beam_size: This affects the search process during decoding. A larger beam_size (e.g., 5) uses more memory but can sometimes improve accuracy.

patience: This parameter can help with repetition. It determines how long the model should wait before emitting a token. Experiment with values greater than 1.

best_of: This parameter will generate N candidate transcriptions (where N is the value of best_of) and returns the one with the highest average log probability.

Gaming

Programming

Avoiding hallucinations/repetitions when using the faster whisper worker ?

Did you find this page helpful?