How good is the realtime API at understanding english word pronunciation and speaking accent? | OpenAI | Page 1

pastel pagoda Nov 22, 2024, 7:24 PM

I am trying to evaluate audio samples and use the realtime API to grade them based on some criteria.
I played with it using the playground but it seems it has some type of understanding of speaking accent and sometimes pronunciation, but sometimes it just seems like it's hallucinating and it's not consistent.

Anybody who used it for a similar purpose and has some tips & tricks or knows how accurate and consistent it is?

karmic groveBOT Nov 22, 2024, 7:24 PM

Hi @pastel pagoda. I'm kapa.ai, your friendly support bot. I will try to answer your question by searching through the knowledge base. :wave:

The Realtime API from OpenAI is designed to produce audio with a text transcript that semantically matches the audio. However, there can be deviations between the text transcript and the voice output, such as minor turns of phrase or skipped verbalizations like blocks of code (Source: OpenAI Docs (https://platform.openai.com/docs/guides/realtime#input-and-output-transcription)).

Regarding the understanding of English word pronunciation and speaking accent, the Realtime API offers a range of voices that can express different emotions and tones, which can be steered with prompts. These voices can be configured at the session level, and they respond to instructions on tone, accent, speed, and volume. The newer voices are more expressive and dynamic, which might help in achieving a more natural and energetic output (Source: OpenAI Docs (https://platform.openai.com/docs/guides/realtime#voices)).

However, the API's performance in understanding and grading pronunciation and accents might not be entirely consistent, as you have observed. This could be due to the inherent complexity of accurately capturing and evaluating nuanced aspects of speech, such as accents and pronunciation variations.

If you are experiencing inconsistencies or hallucinations, it might be beneficial to experiment with different prompts and voice configurations to see if they yield better results. Additionally, you might want to consider using a combination of the Realtime API with other tools or methods to cross-verify the accuracy of the evaluations.

For more detailed guidance, you can refer to the Realtime API documentation (https://platform.openai.com/docs/guides/realtime#voices) to explore the available options and configurations. If you have specific criteria for grading, you might need to tailor your approach based on the voice and prompt settings that best align with your requirements.

#How good is the realtime API at understanding english word pronunciation and speaking accent?

Note: you can ask me a follow up question by @ mentioning me again :speech_balloon: