I wonder how OpenAI scraped subtitles from amara.org to compose the datasets for the Whisper AI transcription & translation models
Here are a couple of fill-in lines that the AI resorts to writing when a piece of speech is unintelligible:
"Thanks for watching the video, see you next time!"
"Subscribe, give it a like, and turn on notifications!"
"Subtitles created by the amara.org community"
"Subtitles by [insert real person's name here]"
More importantly, why didn't OpenAI wipe the attributions of amara.org community authors from their datasets before training their Whisper models?