#How to specify output codec with Voice Agent?
1 messages · Page 1 of 1 (latest)
Hi @short egret. I'm kapa.ai, your friendly support bot. I will try to answer your question by searching through the knowledge base. :wave:
You can control the Voice Agent’s output audio format via the audio.output section of the Settings message. This is where you effectively choose the “codec” (encoding), plus related media settings.
From the docs, the supported output fields are: encoding, sample_rate, bitrate (for compressed formats), and container.[Settings][Media I/O]
Example Settings payload using MP3 output:
{
"type": "Settings",
"audio": {
"input": {
"encoding": "linear16",
"sample_rate": 24000
},
"output": {
"encoding": "mp3",
"sample_rate": 24000,
"bitrate": 48000,
"container": "none"
}
},
"agent": {
"language": "en",
"listen": {
"provider": {
"type": "deepgram",
"model": "nova-3"
}
},
"think": {
"provider": {
"type": "open_ai",
"model": "gpt-4o-mini"
}
},
"speak": {
"provider": {
"type": "deepgram",
"model": "aura-2-thalia-en"
}
}
}
}
This pattern (setting audio.output.encoding and friends) is shown in the Voice Agent configuration and full examples.[Voice Agent settings][Full example]
So yes, specifying the output codec is supported via audio.output.encoding in the Settings message.
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
@short egret Kapa.ai is still learning and improving, please let me know how I did by reacting below.
@peak gate W (34925) VoiceAgent: Warning: {"type":"Warning","description":"Error making request to speak provider. Status code: 400 Bad Request. Body: {"err_code":"UNSUPPORTED_AUDIO_FORMAT","err_msg":"Unsupported audio format: container is not applicable when encoding=mp3.","request_id":"81fdd51c-5a5d-4c1a-855f-9e9edca162e3"}","code":"SPEAK_REQUEST_FAILED"}
Let me check the knowledge base for you again. One moment! :books:
The error is coming from your TTS output format: you’re requesting encoding=mp3 and specifying a container, which is not allowed.
From the TTS media output rules:
- For
mp3, container is “Not Applicable” – you cannot setcontainerat all.[Media output] - The error text matches the documented case: “
containeris not applicable whenencoding=[encoding]”.[TTS errors]
In a Voice Agent Settings message, fix your audio.output to remove container when using mp3:
{
"type": "Settings",
"audio": {
"input": {
"encoding": "linear16",
"sample_rate": 24000
},
"output": {
"encoding": "mp3",
"sample_rate": 24000,
"bitrate": 48000
// no "container" field here for mp3
}
},
"agent": {
"language": "en",
"listen": {
"provider": {
"type": "deepgram",
"model": "nova-3"
}
},
"think": {
"provider": {
"type": "open_ai",
"model": "gpt-4o-mini"
}
},
"speak": {
"provider": {
"type": "deepgram",
"model": "aura-2-thalia-en"
}
}
}
}
If you want to control the container (e.g., wav vs none), use an encoding that supports containers like linear16, mulaw, or alaw instead of mp3.[Audio formats]
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
@short egret Kapa.ai is still learning and improving, please let me know how I did by reacting below.
ya bro, i specified none as a container
@peak gate i completely commented out the container, it is not appended as JSON and i still get the error
Let me check the knowledge base for you again. One moment! :books:
You’re running into a TTS format error, but the knowledge sources only show the generic cause: it’s triggered whenever the query parameters describe an unsupported combination of encoding, container, sample_rate, or bit_rate (e.g., “container is not applicable when encoding=mp3”).[TTS errors][Media output]
However, they do not explain why you’d still see that error if you’ve removed container from your Settings JSON. That means I can’t definitively say what’s wrong in your specific setup.
What you can double‑check against the docs:
-
Voice Agent
audio.outputmust match TTS rulesFor
mp3:encoding: "mp3"is allowed.containermust be omitted entirely (not"none").[Media output]sample_rateis not configurable formp3(it’s fixed to 22050 Hz), so including asample_ratemay also cause an “unsupported audio format” error.[Media output]bitratemust be one of the allowed values (32 000 or 48 000).[Media output]
So the safest
audio.outputfor MP3, per docs, is:"audio": { "input": { "encoding": "linear16", "sample_rate": 24000 }, "output": { "encoding": "mp3", "bitrate": 48000 // no container // no sample_rate } } -
Confirm what is actually sent on the wire
The error body you pasted is exactly the documented error for an invalid combination of TTS query parameters.[TTS errors]
Since the docs don’t cover client‑side behavior, I can’t see whether your SDK or client might still be injecting container or sample_rate even if you commented them out in your code.
Given the limits of the docs, the only concrete, doc‑backed guidance I can give is:
- For
encoding=mp3, send onlyencodingand a validbitrate, and omit bothcontainerandsample_rateinaudio.output.[Media output] - If you still get the same error, it likely means something in your stack is still adding
containeror an invalidsample_rate/bit_rate, but the knowledge sources don’t provide enough detail to diagnose that path.
If you can share the exact JSON Settings payload as it’s actually sent over the WebSocket (redacting keys), I can help you compare it line‑by‑line against the documented valid combinations.
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
@short egret Kapa.ai is still learning and improving, please let me know how I did by reacting below.
@peak gate this is my payload: static void sendSettingsMessage() {
Serial.println("[VoiceAgent] Sending Settings message");
cJSON* root = cJSON_CreateObject();
cJSON_AddStringToObject(root, "type", "Settings");
// Audio input/output
cJSON* audio = cJSON_AddObjectToObject(root, "audio");
cJSON* input = cJSON_AddObjectToObject(audio, "input");
cJSON_AddStringToObject(input, "encoding", VA_INPUT_ENCODING);
cJSON_AddNumberToObject(input, "sample_rate", VA_INPUT_SAMPLE_RATE);
cJSON* output = cJSON_AddObjectToObject(audio, "output");
cJSON_AddStringToObject(output, "encoding", VA_OUTPUT_ENCODING);
#ifdef VA_OUTPUT_BITRATE
cJSON_AddNumberToObject(output, "bitrate", VA_OUTPUT_BITRATE);
#endif
// #if VA_OUTPUT_MP3 == 0
// cJSON_AddStringToObject(output, "container", "none");
// cJSON_AddNumberToObject(output, "sample_rate", VA_OUTPUT_SAMPLE_RATE);
// #endif
[...]
Let me check the knowledge base for you again. One moment! :books:
You’re hitting a documented TTS validation rule: MP3 does not support a container or a configurable sample_rate, and only specific bitrates are allowed.[Media output][TTS errors]
From your snippet, the generated audio.output looks like:
"output": {
"encoding": "<VA_OUTPUT_ENCODING>",
"bitrate": <VA_OUTPUT_BITRATE?>
// no container
// no sample_rate
}
To be valid for MP3 with Voice Agent, the JSON that actually goes over the WebSocket must satisfy:
audio.output.encoding="mp3"audio.output.bitrate=32000or48000(bits per second)- No
audio.output.containerfield at all - No
audio.output.sample_ratefield at all (MP3 is fixed to 22050 Hz and not configurable)[Media output]
So the target payload should be:
"audio": {
"input": {
"encoding": "<name>",
"sample_rate": VA_INPUT_SAMPLE_RATE
},
"output": {
"encoding": "mp3",
"bitrate": 48000
}
}
Given that you still see:
Unsupported audio format: container is not applicable when encoding=mp3
the most likely explanations, based on the docs, are:
- Something in your code path still adds
containerorsample_ratebefore sending (e.g., another layer building or mutating the JSON). VA_OUTPUT_ENCODINGis not actually"mp3"at runtime, and you’re hitting a different invalid combination that surfaces with that generic error message. The error text is exactly one of the documented messages for “unsupported audio format” when any ofencoding,container,sample_rate, orbit_rateare incompatible.[TTS errors]
The knowledge sources don’t expose Voice Agent’s internal mapping from audio.output to the underlying TTS query, so I can’t see more than this: ensure that the final JSON on the wire has only encoding: "mp3" and a valid bitrate, with no container or sample_rate fields. If that still produces the same error, it’s behavior not covered in the docs and would need Deepgram support to inspect the request_id you’re seeing.
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
@short egret Kapa.ai is still learning and improving, please let me know how I did by reacting below.
@peak gate give me the output settings for opus
Let me check the knowledge base for you again. One moment! :books:
Hey there, it looks like you've been talking with Kapa for a bit. If you need additional help, you should consider moving to #1115960287183511643 where our team can answer your query.
For Voice Agent, you configure Opus output via the audio.output section of the Settings message, and it must follow the same TTS media rules as Deepgram’s TTS API.
From the TTS media output docs, valid Opus settings are:[Media output settings]
encoding:"opus"container:"ogg"(default and only supported for Opus)sample_rate: fixed to48000Hz (not configurable)bitrate: default12000, allowed range>= 4000and<= 650000(bps)
So a Voice Agent Settings snippet for Opus output should look like:
{
"type": "Settings",
"audio": {
"input": {
"encoding": "linear16",
"sample_rate": 24000
},
"output": {
"encoding": "opus",
"container": "ogg",
"bitrate": 12000
// sample_rate is implied as 48000 for opus
}
},
"agent": {
"language": "en",
"listen": {
"provider": {
"type": "deepgram",
"model": "nova-3"
}
},
"think": {
"provider": {
"type": "open_ai",
"model": "gpt-4o-mini"
}
},
"speak": {
"provider": {
"type": "deepgram",
"model": "aura-2-thalia-en"
}
}
}
}
The general pattern (using audio.output.encoding, bitrate, and container) is documented in the Voice Agent settings and full configuration examples.[Voice Agent settings][Configure Voice Agent]
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
@short egret Kapa.ai is still learning and improving, please let me know how I did by reacting below.
@peak gate i can 100% confirm that opus/mp3 codec is not supported by the voice agent api, at least it is not enabled for now
Let me check the knowledge base for you again. One moment! :books:
You’re correct to flag this, and there’s an important nuance:
- The underlying TTS service clearly supports
mp3andopusencodings, with the documented combinations (e.g.,encoding=mp3with no container;encoding=opuswithcontainer=ogg).[TTS encoding][TTS media formats] - The Voice Agent docs only show examples using
linear16(and mention thataudio.output.encoding/bitrate/containerexist), but they do not explicitly confirm that all TTS encodings (likemp3/opus) are currently supported end‑to‑end in the Voice Agent pipeline.[Voice Agent settings][Configure Voice Agent]
Given that:
- You’re seeing
UNSUPPORTED_AUDIO_FORMATerrors when trying validmp3/opuscombinations. - The Voice Agent documentation does not explicitly state that
mp3oropusare supported output encodings for the agent, only that you can configureaudio.outputin general.
I have to acknowledge that, based on the docs alone, I cannot confirm that mp3 or opus are currently enabled/supported as Voice Agent output codecs, even though they are supported by the standalone TTS API. The behavior you’re observing is consistent with those encodings not being wired through (or not yet enabled) for Voice Agent.
So, from the available knowledge sources:
- Supported in TTS API: yes,
mp3andopusare documented. - Explicitly supported in Voice Agent: not clearly documented; your runtime errors strongly suggest they are not available in Voice Agent right now.
If you need a definitive answer or enablement, the next step would be to contact Deepgram support with a request_id from one of those failing Voice Agent calls, since the public docs don’t go further than this.
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
@short egret Kapa.ai is still learning and improving, please let me know how I did by reacting below.
@peak gate i will send a support request but i think i might switch to anthropic or openai
Let me check the knowledge base for you again. One moment! :books:
Based on the docs, the Voice Agent API is designed to work with multiple LLM providers, including OpenAI and Anthropic, via the agent.think.provider settings.[Agent config]
You don’t have to switch away from Deepgram entirely to use them—you can:
- Keep Deepgram for STT + TTS (where it’s working for you), and
- Plug in OpenAI or Anthropic just for the “think” step.
Example (OpenAI as the LLM):
"agent": {
"listen": {
"provider": {
"type": "deepgram",
"model": "nova-3"
}
},
"think": {
"provider": {
"type": "open_ai",
"model": "gpt-4o-mini",
"temperature": 0.7
}
},
"speak": {
"provider": {
"type": "deepgram",
"model": "aura-2-thalia-en"
}
}
}
Anthropic and other third‑party LLMs are supported via agent.think.provider.type plus a custom endpoint when required.[Agent config]
So if your main blocker is codec support in Voice Agent, you can still file the support ticket for that, and in parallel switch the LLM piece to OpenAI/Anthropic without rebuilding your whole stack.
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
@short egret Kapa.ai is still learning and improving, please let me know how I did by reacting below.
@peak gate can i change the output sample rate or codec to opus for low-latency streaming?
Let me check the knowledge base for you again. One moment! :books:
Short answer: with the current Voice Agent docs, you should assume no for Opus, and limited control over output format.
What is documented:
- Voice Agent lets you configure output media via
audio.output.encoding,audio.output.sample_rate,audio.output.bitrate, andaudio.output.containerin theSettingsmessage.[Voice Agent settings][Media I/O] - The examples and guidance for Voice Agent only show
linear16output (e.g.,encoding: "linear16",sample_rate: 24000,container: "none").[Voice Agent settings][Full example]
For Text-to-Speech in general (outside Voice Agent):
- Deepgram TTS supports multiple encodings including
linear16,mulaw,alaw,mp3,opus,flac,aac, with specific allowed combinations of encoding/container/sample rate/bitrate.[TTS encoding][TTS formats] - For
opusspecifically, TTS usesencoding=opus,container=ogg, fixedsample_rate=48000, and configurable bitrate.[TTS formats]
However, the Voice Agent docs do not explicitly state that all of these TTS encodings (including opus) are available as audio.output.encoding options for the Voice Agent, and they only demonstrate linear16.[Voice Agent settings][Configure Voice Agent]
Given that:
- The knowledge sources don’t confirm Opus as a supported Voice Agent output codec.
- You’ve already observed
UNSUPPORTED_AUDIO_FORMATerrors when tryingmp3/opuswith Voice Agent.
The safest, doc-backed guidance is:
- You can change output sample rate and encoding when using formats like
linear16(e.g., 8k/16k/24k/32k/48k forlinear16).[TTS sample rate][TTS formats] - The knowledge sources do not confirm that
opusis currently supported as a Voice Agent output codec, so you should not rely on it for low-latency streaming with Voice Agent based on the available docs.
If you specifically need Opus for low-latency streaming, the docs don’t provide enough information to say it’s supported in Voice Agent today; you’d need to confirm with Deepgram support.
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
@short egret Kapa.ai is still learning and improving, please let me know how I did by reacting below.
@peak gate what is the average word/token per second?
Let me check the knowledge base for you again. One moment! :books:
The knowledge sources don’t provide any metric expressed as “words per second” or “tokens per second” for Voice Agent or TTS/STT.
What is documented:
- For TTS, there is a character throughput limit of 2400 characters per minute on the streaming TTS WebSocket, measured as characters sent to the socket.[TTS limits]
- For self‑hosted TTS benchmarking, throughput is measured as characters per second (input characters whose audio is generated per second), but no concrete numbers are given—only that the scripts report “X characters/second” at different concurrency levels.[TTS k6 benchmark]