#Question on nova2-general vs. nova2-phonecall or -conversationalai

1 messages · Page 1 of 1 (latest)

sand scroll
#

Hi,
Reading this doc, I see there are different nova2 models with specific use cases.

Seeing one of old posts, it seems like "primary difference is the data that was used to train the model." And "[I] would need to try audio to determine which is better suited."

It makes sense that different audio was used (presumably different audio quality etc), but does it mean both models are trained the same in terms of amount of data?

Context is I currently use -general model but am trying to see if other models (likely -phonecall or -converstaionalai) would perform any better, as our application is converstaion ai over phone call.
But what I am worried is if general would generally perform better due to amount of data used to train each model.

Any tips/guidance would be very appreciated 🙏

Another side question is, do you happen to have any guidance on measuring quality of transcriptions?
I know confidence value is returned from Deepgram, but wonder if there is any recommended way of measuring transcription quality.

Thanks!

Deepgram Docs

Model options allows you to supply a model to use to process submitted audio.

night orchidBOT
#

Thanks for asking your question. Please be sure to reply with as much detail as possible so we can assist you efficiently.

#

It looks like we're missing some important information to help debug your issue. Would you mind providing us with the following details in a reply?

  • The programming language you are working in (e.g. JavaScript, Python).
  • The deepgram product you are using (e.g Speech to Text, Agent API)
  • A request ID that triggered your error or issue.
alpine nimbusBOT
#

Hi,

You're correct that the primary difference between the models is the data that was used to train them. Each model is optimized for a specific use case and the type of audio it's expected to handle. For example, the phonecall model is optimized for low-bandwidth audio phone calls, while the conversationalai model is optimized for use cases where a human is talking to an automated bot, such as IVR, a voice assistant, or an automated kiosk source.

While the general model is designed to handle a wide range of audio types, it might not always be the best choice for specific use cases. If your application involves conversation AI over phone calls, testing with the phonecall or conversationalai models could potentially yield better results.

As for the amount of data used to train each model, the documentation doesn't provide specific details. However, it's important to note that the effectiveness of a model isn't solely determined by the amount of training data, but also by the relevance and quality of that data. A model trained on a large amount of irrelevant data might not perform as well as a model trained on a smaller amount of highly relevant data.

Regarding measuring the quality of transcriptions, the confidence value returned by Deepgram is a good starting point. It represents the model's confidence in the accuracy of the transcription. However, for a more comprehensive evaluation, you might want to consider additional metrics such as Word Error Rate (WER), which measures the number of errors (substitutions, deletions, and insertions) divided by the number of words in the reference transcription.

I hope this helps! Let me know if you have any other questions.

sand scroll
#

thanks for prompt and detailed answer!
I didn't know about Word Error Rate but upon a quick search, it seems like it is something that I will have to implement on our side with a set of test data?
e.g. have a test audio transcribed somehow (i.e. manually), compare it with the transcriptions returned from Deepgram, and calculate WER to get the metrics?

#

iow, I don't think there is a good way to measure the quality of transcription for live streaming audio?

#

(unless I process them separately to separately measure WER)

alpine nimbusBOT
#

Correct, that would be something you'd calculate yourself either through manual transcription or by comparing with a service you know works with your specific audio.

Just reading your post, I'd love to double-check your use case. Are you transcribing existing phone calls between a user and an AI voice agent?

sand scroll
#

no. our use case is live transcription via streaming.

#

so I don't think calculating WER would be straight forward for our use cases for live calls, unless we manually transcribe them separately.

OR do you have any suggestions on calculating WER for live streaming audio?
I really like the idea of WER but am not sure what's the best way to calculate the total number of words in the reference transcript