#How to specify output codec with Voice Agent?

1 messages · Page 1 of 1 (latest)

short egret
#

How to specify output codec with Voice Agent? Is it supported?

peak gateBOT
#

Hi @short egret. I'm kapa.ai, your friendly support bot. I will try to answer your question by searching through the knowledge base. :wave:

#

You can control the Voice Agent’s output audio format via the audio.output section of the Settings message. This is where you effectively choose the “codec” (encoding), plus related media settings.

From the docs, the supported output fields are: encoding, sample_rate, bitrate (for compressed formats), and container.[Settings][Media I/O]

Example Settings payload using MP3 output:

{
  "type": "Settings",
  "audio": {
    "input": {
      "encoding": "linear16",
      "sample_rate": 24000
    },
    "output": {
      "encoding": "mp3",
      "sample_rate": 24000,
      "bitrate": 48000,
      "container": "none"
    }
  },
  "agent": {
    "language": "en",
    "listen": {
      "provider": {
        "type": "deepgram",
        "model": "nova-3"
      }
    },
    "think": {
      "provider": {
        "type": "open_ai",
        "model": "gpt-4o-mini"
      }
    },
    "speak": {
      "provider": {
        "type": "deepgram",
        "model": "aura-2-thalia-en"
      }
    }
  }
}

This pattern (setting audio.output.encoding and friends) is shown in the Voice Agent configuration and full examples.[Voice Agent settings][Full example]

So yes, specifying the output codec is supported via audio.output.encoding in the Settings message.

Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:

#

@short egret Kapa.ai is still learning and improving, please let me know how I did by reacting below.

short egret
#

@peak gate W (34925) VoiceAgent: Warning: {"type":"Warning","description":"Error making request to speak provider. Status code: 400 Bad Request. Body: {"err_code":"UNSUPPORTED_AUDIO_FORMAT","err_msg":"Unsupported audio format: container is not applicable when encoding=mp3.","request_id":"81fdd51c-5a5d-4c1a-855f-9e9edca162e3"}","code":"SPEAK_REQUEST_FAILED"}

peak gateBOT
#

Let me check the knowledge base for you again. One moment! :books:

#

The error is coming from your TTS output format: you’re requesting encoding=mp3 and specifying a container, which is not allowed.

From the TTS media output rules:

  • For mp3, container is “Not Applicable” – you cannot set container at all.[Media output]
  • The error text matches the documented case: “container is not applicable when encoding=[encoding]”.[TTS errors]

In a Voice Agent Settings message, fix your audio.output to remove container when using mp3:

{
  "type": "Settings",
  "audio": {
    "input": {
      "encoding": "linear16",
      "sample_rate": 24000
    },
    "output": {
      "encoding": "mp3",
      "sample_rate": 24000,
      "bitrate": 48000
      // no "container" field here for mp3
    }
  },
  "agent": {
    "language": "en",
    "listen": {
      "provider": {
        "type": "deepgram",
        "model": "nova-3"
      }
    },
    "think": {
      "provider": {
        "type": "open_ai",
        "model": "gpt-4o-mini"
      }
    },
    "speak": {
      "provider": {
        "type": "deepgram",
        "model": "aura-2-thalia-en"
      }
    }
  }
}

If you want to control the container (e.g., wav vs none), use an encoding that supports containers like linear16, mulaw, or alaw instead of mp3.[Audio formats]

Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:

#

@short egret Kapa.ai is still learning and improving, please let me know how I did by reacting below.

short egret
#

ya bro, i specified none as a container

#

@peak gate i completely commented out the container, it is not appended as JSON and i still get the error

peak gateBOT
#

Let me check the knowledge base for you again. One moment! :books:

#

You’re running into a TTS format error, but the knowledge sources only show the generic cause: it’s triggered whenever the query parameters describe an unsupported combination of encoding, container, sample_rate, or bit_rate (e.g., “container is not applicable when encoding=mp3”).[TTS errors][Media output]

However, they do not explain why you’d still see that error if you’ve removed container from your Settings JSON. That means I can’t definitively say what’s wrong in your specific setup.

What you can double‑check against the docs:

  1. Voice Agent audio.output must match TTS rules

    For mp3:

    • encoding: "mp3" is allowed.
    • container must be omitted entirely (not "none").[Media output]
    • sample_rate is not configurable for mp3 (it’s fixed to 22050 Hz), so including a sample_rate may also cause an “unsupported audio format” error.[Media output]
    • bitrate must be one of the allowed values (32 000 or 48 000).[Media output]

    So the safest audio.output for MP3, per docs, is:

    "audio": {
      "input": {
        "encoding": "linear16",
        "sample_rate": 24000
      },
      "output": {
        "encoding": "mp3",
        "bitrate": 48000
        // no container
        // no sample_rate
      }
    }
    
  2. Confirm what is actually sent on the wire

#

The error body you pasted is exactly the documented error for an invalid combination of TTS query parameters.[TTS errors]
Since the docs don’t cover client‑side behavior, I can’t see whether your SDK or client might still be injecting container or sample_rate even if you commented them out in your code.

Given the limits of the docs, the only concrete, doc‑backed guidance I can give is:

  • For encoding=mp3, send only encoding and a valid bitrate, and omit both container and sample_rate in audio.output.[Media output]
  • If you still get the same error, it likely means something in your stack is still adding container or an invalid sample_rate/bit_rate, but the knowledge sources don’t provide enough detail to diagnose that path.

If you can share the exact JSON Settings payload as it’s actually sent over the WebSocket (redacting keys), I can help you compare it line‑by‑line against the documented valid combinations.

Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:

#

@short egret Kapa.ai is still learning and improving, please let me know how I did by reacting below.

short egret
#

@peak gate this is my payload: static void sendSettingsMessage() {
Serial.println("[VoiceAgent] Sending Settings message");

cJSON* root = cJSON_CreateObject();
cJSON_AddStringToObject(root, "type", "Settings");

// Audio input/output
cJSON* audio = cJSON_AddObjectToObject(root, "audio");
cJSON* input = cJSON_AddObjectToObject(audio, "input");
cJSON_AddStringToObject(input, "encoding", VA_INPUT_ENCODING);
cJSON_AddNumberToObject(input, "sample_rate", VA_INPUT_SAMPLE_RATE);
cJSON* output = cJSON_AddObjectToObject(audio, "output");
cJSON_AddStringToObject(output, "encoding", VA_OUTPUT_ENCODING);

#ifdef VA_OUTPUT_BITRATE
cJSON_AddNumberToObject(output, "bitrate", VA_OUTPUT_BITRATE);
#endif
// #if VA_OUTPUT_MP3 == 0
// cJSON_AddStringToObject(output, "container", "none");
// cJSON_AddNumberToObject(output, "sample_rate", VA_OUTPUT_SAMPLE_RATE);
// #endif
[...]

peak gateBOT
#

Let me check the knowledge base for you again. One moment! :books:

#

You’re hitting a documented TTS validation rule: MP3 does not support a container or a configurable sample_rate, and only specific bitrates are allowed.[Media output][TTS errors]

From your snippet, the generated audio.output looks like:

"output": {
  "encoding": "<VA_OUTPUT_ENCODING>",
  "bitrate": <VA_OUTPUT_BITRATE?>
  // no container
  // no sample_rate
}

To be valid for MP3 with Voice Agent, the JSON that actually goes over the WebSocket must satisfy:

  • audio.output.encoding = "mp3"
  • audio.output.bitrate = 32000 or 48000 (bits per second)
  • No audio.output.container field at all
  • No audio.output.sample_rate field at all (MP3 is fixed to 22050 Hz and not configurable)[Media output]

So the target payload should be:

"audio": {
  "input": {
    "encoding": "<name>",
    "sample_rate": VA_INPUT_SAMPLE_RATE
  },
  "output": {
    "encoding": "mp3",
    "bitrate": 48000
  }
}

Given that you still see:

Unsupported audio format: container is not applicable when encoding=mp3

the most likely explanations, based on the docs, are:

#
  1. Something in your code path still adds container or sample_rate before sending (e.g., another layer building or mutating the JSON).
  2. VA_OUTPUT_ENCODING is not actually "mp3" at runtime, and you’re hitting a different invalid combination that surfaces with that generic error message. The error text is exactly one of the documented messages for “unsupported audio format” when any of encoding, container, sample_rate, or bit_rate are incompatible.[TTS errors]

The knowledge sources don’t expose Voice Agent’s internal mapping from audio.output to the underlying TTS query, so I can’t see more than this: ensure that the final JSON on the wire has only encoding: "mp3" and a valid bitrate, with no container or sample_rate fields. If that still produces the same error, it’s behavior not covered in the docs and would need Deepgram support to inspect the request_id you’re seeing.

Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:

#

@short egret Kapa.ai is still learning and improving, please let me know how I did by reacting below.

short egret
#

@peak gate give me the output settings for opus

peak gateBOT
#

Let me check the knowledge base for you again. One moment! :books:

oak boneBOT
#

Hey there, it looks like you've been talking with Kapa for a bit. If you need additional help, you should consider moving to #1115960287183511643 where our team can answer your query.

peak gateBOT
#

For Voice Agent, you configure Opus output via the audio.output section of the Settings message, and it must follow the same TTS media rules as Deepgram’s TTS API.

From the TTS media output docs, valid Opus settings are:[Media output settings]

  • encoding: "opus"
  • container: "ogg" (default and only supported for Opus)
  • sample_rate: fixed to 48000 Hz (not configurable)
  • bitrate: default 12000, allowed range >= 4000 and <= 650000 (bps)

So a Voice Agent Settings snippet for Opus output should look like:

{
  "type": "Settings",
  "audio": {
    "input": {
      "encoding": "linear16",
      "sample_rate": 24000
    },
    "output": {
      "encoding": "opus",
      "container": "ogg",
      "bitrate": 12000
      // sample_rate is implied as 48000 for opus
    }
  },
  "agent": {
    "language": "en",
    "listen": {
      "provider": {
        "type": "deepgram",
        "model": "nova-3"
      }
    },
    "think": {
      "provider": {
        "type": "open_ai",
        "model": "gpt-4o-mini"
      }
    },
    "speak": {
      "provider": {
        "type": "deepgram",
        "model": "aura-2-thalia-en"
      }
    }
  }
}

The general pattern (using audio.output.encoding, bitrate, and container) is documented in the Voice Agent settings and full configuration examples.[Voice Agent settings][Configure Voice Agent]

Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:

#

@short egret Kapa.ai is still learning and improving, please let me know how I did by reacting below.

short egret
#

@peak gate i can 100% confirm that opus/mp3 codec is not supported by the voice agent api, at least it is not enabled for now

peak gateBOT
#

Let me check the knowledge base for you again. One moment! :books:

#

You’re correct to flag this, and there’s an important nuance:

  • The underlying TTS service clearly supports mp3 and opus encodings, with the documented combinations (e.g., encoding=mp3 with no container; encoding=opus with container=ogg).[TTS encoding][TTS media formats]
  • The Voice Agent docs only show examples using linear16 (and mention that audio.output.encoding / bitrate / container exist), but they do not explicitly confirm that all TTS encodings (like mp3/opus) are currently supported end‑to‑end in the Voice Agent pipeline.[Voice Agent settings][Configure Voice Agent]

Given that:

  • You’re seeing UNSUPPORTED_AUDIO_FORMAT errors when trying valid mp3/opus combinations.
  • The Voice Agent documentation does not explicitly state that mp3 or opus are supported output encodings for the agent, only that you can configure audio.output in general.

I have to acknowledge that, based on the docs alone, I cannot confirm that mp3 or opus are currently enabled/supported as Voice Agent output codecs, even though they are supported by the standalone TTS API. The behavior you’re observing is consistent with those encodings not being wired through (or not yet enabled) for Voice Agent.

So, from the available knowledge sources:

  • Supported in TTS API: yes, mp3 and opus are documented.
  • Explicitly supported in Voice Agent: not clearly documented; your runtime errors strongly suggest they are not available in Voice Agent right now.
#

If you need a definitive answer or enablement, the next step would be to contact Deepgram support with a request_id from one of those failing Voice Agent calls, since the public docs don’t go further than this.

Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:

#

@short egret Kapa.ai is still learning and improving, please let me know how I did by reacting below.

short egret
#

@peak gate i will send a support request but i think i might switch to anthropic or openai

peak gateBOT
#

Let me check the knowledge base for you again. One moment! :books:

#

Based on the docs, the Voice Agent API is designed to work with multiple LLM providers, including OpenAI and Anthropic, via the agent.think.provider settings.[Agent config]

You don’t have to switch away from Deepgram entirely to use them—you can:

  • Keep Deepgram for STT + TTS (where it’s working for you), and
  • Plug in OpenAI or Anthropic just for the “think” step.

Example (OpenAI as the LLM):

"agent": {
  "listen": {
    "provider": {
      "type": "deepgram",
      "model": "nova-3"
    }
  },
  "think": {
    "provider": {
      "type": "open_ai",
      "model": "gpt-4o-mini",
      "temperature": 0.7
    }
  },
  "speak": {
    "provider": {
      "type": "deepgram",
      "model": "aura-2-thalia-en"
    }
  }
}

Anthropic and other third‑party LLMs are supported via agent.think.provider.type plus a custom endpoint when required.[Agent config]

So if your main blocker is codec support in Voice Agent, you can still file the support ticket for that, and in parallel switch the LLM piece to OpenAI/Anthropic without rebuilding your whole stack.

Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:

#

@short egret Kapa.ai is still learning and improving, please let me know how I did by reacting below.

short egret
#

@peak gate can i change the output sample rate or codec to opus for low-latency streaming?

peak gateBOT
#

Let me check the knowledge base for you again. One moment! :books:

#

Short answer: with the current Voice Agent docs, you should assume no for Opus, and limited control over output format.

What is documented:

  • Voice Agent lets you configure output media via audio.output.encoding, audio.output.sample_rate, audio.output.bitrate, and audio.output.container in the Settings message.[Voice Agent settings][Media I/O]
  • The examples and guidance for Voice Agent only show linear16 output (e.g., encoding: "linear16", sample_rate: 24000, container: "none").[Voice Agent settings][Full example]

For Text-to-Speech in general (outside Voice Agent):

  • Deepgram TTS supports multiple encodings including linear16, mulaw, alaw, mp3, opus, flac, aac, with specific allowed combinations of encoding/container/sample rate/bitrate.[TTS encoding][TTS formats]
  • For opus specifically, TTS uses encoding=opus, container=ogg, fixed sample_rate=48000, and configurable bitrate.[TTS formats]

However, the Voice Agent docs do not explicitly state that all of these TTS encodings (including opus) are available as audio.output.encoding options for the Voice Agent, and they only demonstrate linear16.[Voice Agent settings][Configure Voice Agent]

#

Given that:

  • The knowledge sources don’t confirm Opus as a supported Voice Agent output codec.
  • You’ve already observed UNSUPPORTED_AUDIO_FORMAT errors when trying mp3/opus with Voice Agent.

The safest, doc-backed guidance is:

  • You can change output sample rate and encoding when using formats like linear16 (e.g., 8k/16k/24k/32k/48k for linear16).[TTS sample rate][TTS formats]
  • The knowledge sources do not confirm that opus is currently supported as a Voice Agent output codec, so you should not rely on it for low-latency streaming with Voice Agent based on the available docs.

If you specifically need Opus for low-latency streaming, the docs don’t provide enough information to say it’s supported in Voice Agent today; you’d need to confirm with Deepgram support.

Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:

#

@short egret Kapa.ai is still learning and improving, please let me know how I did by reacting below.

short egret
#

@peak gate what is the average word/token per second?

peak gateBOT
#

Let me check the knowledge base for you again. One moment! :books:

#

The knowledge sources don’t provide any metric expressed as “words per second” or “tokens per second” for Voice Agent or TTS/STT.

What is documented:

  • For TTS, there is a character throughput limit of 2400 characters per minute on the streaming TTS WebSocket, measured as characters sent to the socket.[TTS limits]
  • For self‑hosted TTS benchmarking, throughput is measured as characters per second (input characters whose audio is generated per second), but no concrete numbers are given—only that the scripts report “X characters/second” at different concurrency levels.[TTS k6 benchmark]

There is no conversion in the docs from these character‑based metrics to “words per second” or “tokens per second,” and no average WPS/TPS values are published. So I can’t give a documented average word/token per second figure.

Note: you can ask me a follow up question by @ mentioning me again :speech_balloon: