#behind the phone model
1 messages · Page 1 of 1 (latest)
Hi @glacial forge. I'm kapa.ai, your friendly support bot. I will try to answer your question by searching through the knowledge base. :wave:
Based on the information provided in the knowledge sources, the best speech-to-text model for live transcription of phone calls from Twilio using Deepgram is the Nova-2-phonecall model.
According to the official Deepgram discussion (https://github.com/orgs/deepgram/discussions/594), "Nova-2-phonecall is your best bet for phonecall accuracy. Failing that, Nova-2 may work. Depends on the quality of audio."
When integrating Twilio with Deepgram for live transcription, you'll need to consider a few key points:
-
Audio format: Twilio streams audio in mulaw format at 8000 Hz sample rate. You'll need to specify this in your Deepgram configuration.
-
Encoding: The audio from Twilio is base64 encoded, so you'll need to decode it before sending it to Deepgram.
-
Streaming setup: You'll need to set up a WebSocket connection to receive the audio stream from Twilio and forward it to Deepgram.
Here's a basic example of how you might set up the Deepgram live transcription client for a Twilio phone call stream in Node.js:
const deepgramLive = transcriber.transcription.live({
punctuate: true,
endpointing: true,
language: "en-GB",
encoding: "mulaw",
sample_rate: 8000,
model: "nova-2-phonecall"
});
Remember to handle the base64 decoding of the Twilio audio payload before sending it to Deepgram:
case "media":
const buffer = Buffer.from(msg.media.payload, "base64");
stream.write(buffer);
break;
For a more detailed implementation, you can refer to the Deepgram Twilio streaming guide (https://deepgram.com/learn/deepgram-twilio-streaming) which provides examples in both Python and Node.js.
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
@glacial forge Kapa.ai is still learning and improving, please let me know how I did by reacting below.
@dull gull what is endpointing: true? what's the difference of doing that as opposed to choosing a number
Let me check the knowledge base for you again. One moment! :books:
Endpointing is a feature in Deepgram's live transcription that helps detect pauses in speech and finalize transcripts accordingly. Let me explain the difference between using endpointing: true and specifying a number:
-
endpointing: true:
When you setendpointing: true, you're using Deepgram's default endpointing behavior. According to the Deepgram Endpointing documentation (https://developers.deepgram.com/docs/endpointing), this default setting will return transcripts after detecting 10 milliseconds of silence. It relies on a Voice Activity Detector (VAD) to monitor incoming audio and trigger when a sufficiently long pause is detected. -
Specifying a number:
When you specify a number for endpointing (e.g.,endpointing: 500), you're customizing the length of silence required to trigger an endpoint. This allows you to fine-tune the endpointing behavior based on your specific use case. For example, settingendpointing: 500would wait until 500 milliseconds of silence has passed before finalizing and returning transcripts.
The key differences are:
- Flexibility: Specifying a number gives you more control over when endpoints are detected, allowing you to adjust for different speaking patterns or audio environments.
- Sensitivity: The default
truesetting (10ms) is more sensitive to short pauses, which might be suitable for quick back-and-forth conversations. A higher number can be better for speakers who take longer pauses between phrases.
It's important to note that, as mentioned in the Deepgram discussion on endpointing (https://github.com/orgs/deepgram/discussions/177), longer endpointing times (e.g., >1000ms) can be problematic, especially with noisy audio like phone calls. In such cases, it's rare to have long periods of complete silence, which can prevent the endpointing from triggering.
For live transcription of phone calls, you might want to experiment with different endpointing values to find the best balance for your specific use case. Alternatively, for longer pauses, you could consider using the utterance_end_ms feature, which works based on detected silence between words in the transcript rather than in the audio signal itself.
Remember, regardless of the endpointing setting, Deepgram will still send back parts of finalized transcripts with is_final: true messages before the endpointing silence threshold is exceeded, to ensure you receive timely results for longer utterances as explained in this Deepgram discussion (https://github.com/orgs/deepgram/discussions/319).
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
@glacial forge Kapa.ai is still learning and improving, please let me know how I did by reacting below.
@dull gull so is this how you would have it:
encoding: 'mulaw',
sample_rate: '8000',
model: 'nova-2-phonecall',
smart_format: true,
interim_results: true,
//endpointing: 200,
endpointing: true,
utterance_end_ms: 1000,
vad_events: true,
Let me check the knowledge base for you again. One moment! :books:
Based on the information from the knowledge sources and your specific use case for transcribing Twilio phone calls, your configuration looks good with a few minor adjustments. Here's an analysis of your settings:
{
encoding: 'mulaw',
sample_rate: 8000,
model: 'nova-2-phonecall',
smart_format: true,
interim_results: true,
endpointing: true,
utterance_end_ms: '1000',
vad_events: true
}
Let's break it down:
-
encodingandsample_rate: These are correct for Twilio audio streams as mentioned in the Deepgram discussion (https://github.com/orgs/deepgram/discussions/127). -
model: 'nova-2-phonecall' is the recommended model for phone calls according to Deepgram's official response (https://github.com/orgs/deepgram/discussions/594). -
smart_format: This is fine to use, it helps improve readability of the transcript. -
interim_results: This is necessary, especially when usingutterance_end_msas stated in the Deepgram documentation (https://developers.deepgram.com/docs/understanding-end-of-speech-detection#using-utteranceend). -
endpointing: Usingtrueis fine, it will use the default 10ms silence detection. -
utterance_end_ms: This is a good addition, especially for phone calls where background noise might interfere with regular endpointing. However, make sure to pass it as a string:'1000'. -
vad_events: This is fine to include if you want to receive Voice Activity Detection events.
A couple of notes:
- For
sample_rate, you should pass it as a number (8000) rather than a string ('8000'). - You might want to consider adding
punctuate: truefor better readability.
Here's the slightly adjusted configuration:
{
encoding: 'mulaw',
sample_rate: 8000,
model: 'nova-2-phonecall',
smart_format: true,
interim_results: true,
endpointing: true,
utterance_end_ms: '1000',
vad_events: true,
punctuate: true
}