Curious if anyone working with speech systems has seen this:
When using ASR (Azure), I’m seeing cases where incorrect pronunciation is “corrected” by the model — e.g. “tanks” gets recognised as “thanks”.
This makes it hard to evaluate pronunciation because:
the transcript looks correct
phoneme signals aren’t reliable enough
Is this just expected behaviour with current ASR models, or are there ways people handle this (especially for detecting contrast errors like /θ/ vs /t/)?