#XTTS

1 messages · Page 1 of 1 (latest)

buoyant depot
#

So I have a bit of a dilemma... I spent 2 weeks getting xtts set up for voice chat with my openclaw in discord using a traind voice - all local. Worked fine on 2026.2.26.
I Updated to 2026.3.2 and XTTS breaks because the TTS stuff was hardened between 2026.2.26 and 2026.3.2.
How can I get XTTS working now that the only TTS providers allowed are OpenAI, Elevenlabs, Edge TTS?

pulsar cedarBOT
#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> ✅ Mark Solution

tranquil lindenBOT
#

As of 2026.3.2, that’s expected: TTS providers are now hard-enforced to openai | elevenlabs | edge (so there’s no way to select an xtts provider directly anymore). Docs: https://docs.openclaw.ai/tts

What does still work for “local XTTS with a trained voice” is to wrap XTTS behind an OpenAI- or ElevenLabs-compatible HTTP endpoint, and then point OpenClaw at it.

Recommended: XTTS → “OpenAI Audio API” shim (local)

OpenClaw’s OpenAI TTS supports a custom baseUrl (messages.tts.openai.baseUrl, and also under Discord voice overrides), and when you use a custom baseUrl it allows any model/voice strings.

Discord voice chat config (realtime /vc join): https://docs.openclaw.ai/channels/discord#voice-channels

{
channels: {
discord: {
voice: {
tts: {
provider: "openai",
openai: {
#
apiKey: "local-dummy-key",
baseUrl: "http://127.0.0.1:8000/v1", // no trailing slash
model: "xtts",
voice: "my_trained_voice"
}
}
}
}
}
}

Your local service just needs to implement:

  • POST {baseUrl}/audio/speech
  • accept JSON like:
  • model, input (text), voice, response_format (OpenClaw will usually ask for mp3 on Discord)
#
  • return raw audio bytes (e.g. audio/mpeg)

Alternative: XTTS → ElevenLabs shim

Also possible via messages.tts.elevenlabs.baseUrl, but OpenClaw validates voiceId format for ElevenLabs, so it’s slightly more annoying to “fake” correctly.

If you want XTTS as a first-class provider again

That’d need an upstream change (new provider or a generic “http tts provider”). Best path is a GitHub feature request: https://github.com/openclaw/openclaw/issues

If you tell me whether you’re using Discord voice channels (/vc join) vs voice message attachments, I can point you at the exact config block(s) you need (channels.discord.voice.tts vs messages.tts).

buoyant depot
#

Would it be possible to give claude code access to openclaw's code to "patch" xtts to work on top of update releases?

tranquil lindenBOT
#

You can’t really “give Claude Code access” through OpenClaw — Claude Code is a local CLI that runs on your machine (in whatever folder you point it at). So the practical workflow is:

  1. clone/fork openclaw/openclaw
  2. run Claude Code inside that repo to produce a patch
  3. you either:
  • keep a fork and rebase every release (pain), or
  • open an upstream PR / feature request (best long-term)

That said: I would not patch OpenClaw just to keep XTTS working if your goal is “survive updates”.

Better than patching: keep XTTS local via a stable HTTP shim

OpenClaw 2026.3.2 already supports pointing OpenAI TTS at a custom baseUrl (messages.tts.openai.baseUrl, and also channels.discord.voice.tts.openai.baseUrl). That means you can run XTTS locally behind an OpenAI /audio/speech compatible endpoint and you won’t have to chase upstream changes every time.

So: no fork, no patch, just config + your local shim.

If you do want a real “xtts” provider upstream

That’s a non-trivial change (provider enum/schema + config types + new TTS implementation + docs + tests). Totally doable, but it should land upstream as either:

#
  • a first-class xtts provider, or
  • a generic “HTTP TTS provider” (safer, more flexible)

If you want, paste what you were using on 2026.2.26 (was it a CLI exec, a local HTTP endpoint, etc.?) and I’ll tell you which upstream shape would be cleanest + lowest-maintenance.

buoyant depot
#

More context: OpenClaw runs on an Ubuntu VM on my Windows Host. It talks to Ollama for local models, and for TTS here is what my openclaw is providing as it helped set up the TTS:
Discord VC path: OpenClaw runs on Ubuntu VM, joins VC via /vc.,
TTS provider in OpenClaw config: channels.discord.voice.tts.provider = "openai",
But it’s not OpenAI cloud TTS — OpenClaw is pointed at your Windows XTTS service using:,
OPENAI_TTS_BASE_URL = http://<windows-host-ip>:8020/v1,
OPENAI_API_KEY = local-xtts (dummy/local token pattern),
Voice/model payload used by OpenClaw:,
openai.model = "xtts-local",
openai.voice = "eva",
Windows host service: FastAPI/uvicorn app (xtts_api.py) exposing OpenAI-compatible endpoint:,
POST /v1/audio/speech,
returns WAV bytes to OpenClaw,
So comms flow is:,

  1. Discord speech captured in Ubuntu/OpenClaw
  2. STT done locally (Whisper on Ubuntu)
  3. LLM response generated
  4. OpenClaw sends TTS request to http://<windows-ip>:8020/v1/audio/speech
  5. Windows XTTS returns WAV
  6. OpenClaw plays audio into Discord VC

Also, my openclaw is asking (possibly we're already covering this): does 2026.3.2 still honor OPENAI_TTS_BASE_URL overrides for Discord VC TTS, or force OpenAI cloud endpoints only?

tranquil lindenBOT
#

Yes — 2026.3.2 still honors OPENAI_TTS_BASE_URL for Discord VC TTS. It does not force OpenAI cloud endpoints.

What Discord VC does in 2026.3.2 is (literally):

  • merges channels.discord.voice.tts over messages.tts
  • resolves openai.baseUrl from:
  1. ...tts.openai.baseUrl (config), else
  2. OPENAI_TTS_BASE_URL (env), else
  3. the real OpenAI default
  • calls POST {baseUrl}/audio/speech with response_format: "mp3" when channel: "discord".

So if your setup “worked before” but broke after the hardening, the biggest red flag in your description is this:

your XTTS service returns WAV bytes to OpenClaw

On Discord, OpenClaw is asking your shim for MP3 (response_format=mp3). If your shim ignores response_format and always returns WAV, you can get breakage (or silent failure / decode issues) depending on the playback pipeline and how strict it got in newer builds.

#

What I’d change first (no patch needed)

  1. Update xtts_api.py to honor response_format and return MP3 bytes when requested.
  • For example: generate WAV → transcode to MP3 (ffmpeg) → return audio/mpeg.
  1. Make the baseUrl explicit in config (avoids “env not loaded” surprises):
{
channels: {
discord: {
voice: {
tts: {
provider: "openai",
openai: {
apiKey: "local-xtts",
baseUrl: "http://<windows-host-ip>:8020/v1",
model: "xtts-local",
#
voice: "eva",
},
},
},
},
},
}

One more gotcha: env vars not reaching the gateway

If you’re relying on OPENAI_TTS_BASE_URL via systemd/service startup, make sure the gateway process actually gets it (common failure mode after upgrades/reinstalls). Related doc section: https://docs.openclaw.ai/help/faq#env-vars-and-env-loading

If you paste the exact error/log line you see in 2026.3.2 (from openclaw logs --follow around the TTS attempt), I can tell you whether it’s (a) baseUrl not applied, (b) non-200 from your shim, or (c) audio decode/playback failing on the returned bytes.

buoyant depot
#

I gave up on XTTS. What I'm doing now instead is elevenlabs. I want to use it in voice chat in discord - so I would speak, it would get processed to STT, then that gets sent to openAI, which the reply is then sent to elevenlabs, which then comes back to discord (at least I think that's the workflow). I've updated openclaw to 2026.3.7. I can get /TTS to work in general chat in discord and I hear what I believe is my trained voice from elevenlabs. however, if I join the bot to voice channel, then I try speaking: I receive no response. Am I supposed to configure STT from my end somehow? I can provide my openclaw.json if that would help. Through troubleshooting with codex in vscode, i've gathered there is an issue report "OpenClaw docs explicitly warn about Discord DAVE receive instability and point to upstream discord.js issue #11419." Would that mean what I'm trying to do is not possible? Codex recommended a python sidecar for the STT, similar to what I had set up for XTTS before. I'm unsure if I should go that route or not. This was the recommendation: "If you must keep Discord VC: use a separate Discord voice bridge service (e.g., discord.py/py-cord sidecar) for STT/TTS and route turns to OpenClaw via API/chat-completions, instead of OpenClaw’s built-in Discord VC receiver." It's possible I had whisper set up before as well. however, all the processing was being handled on the windows host back when I was using XTTS. I don't know if I should continue that if I go this route, or possibly whisper on the ubuntu vm. The windows host has more resources, and I do wish to have the quickest response times to process everything. What I don't understand is: I thought there was native openclaw support for what I'm trying to achieve already, and I'm unsure why I have not been able to get it to work so far.

tranquil lindenBOT
#

What you’re trying to do is supported (Discord VC → STT → agent → TTS → play back), but two common “gotchas” make it look like it’s dead even when /tts in text chat works:

1) Discord VC receive/STT is separate from “/tts in chat”

/tts working in a text channel only proves outbound TTS is configured.

For VC, the pipeline is:

  1. Discord voice packets received
  2. Opus decode → PCM
  3. write a temp WAV
  4. run audio transcription via tools.media.audio
  5. send transcript to the agent
  6. run TTS (your ElevenLabs config)
  7. play audio into VC

So yes: there is “native OpenClaw support”, but VC requires voice receive + transcription to be working on the gateway host.

#

2) The #1 silent failure: missing opusscript (can’t decode voice)

OpenClaw’s Discord VC receiver currently decodes incoming voice with opusscript. If it’s not installed, the gateway will join the VC fine, but it will never get usable audio → no transcript → no reply.

Check your gateway logs while you speak:

  • if you see something like:
    discord voice: opusscript unavailable ... cannot decode voice audio
    that’s the reason you get “no response”.

Fix is simply: install opusscript where OpenClaw runs (Ubuntu VM). (How depends on whether you installed OpenClaw via global npm vs from-source pnpm — tell me which and I’ll give the exact command.)

3) The #2 silent failure: no transcription backend configured/available

VC transcription uses the same “media understanding (audio)” system as voice notes: tools.media.audio.

Docs: https://docs.openclaw.ai/nodes/audio

If you previously had Whisper running on Windows (sidecar), but now OpenClaw is on Ubuntu with no local whisper CLI/provider keys available, you’ll get transcription empty and the bot won’t respond.

#

Quick debugging trick: enable transcript echo so you can see whether STT is happening:

{
tools: {
media: {
audio: {
enabled: true,
echoTranscript: true,
// models: [...] // optional; if omitted it tries auto-detection (see docs)
}
}
}
}

If you get no echoed transcript when speaking in VC, the STT side is the blocker.

#

4) About the DAVE/#11419 warning

That warning means VC receive can be flaky on some setups, not that it’s impossible. It becomes relevant if your logs show repeated decrypt failures / receive errors. But before building a python sidecar again, I’d validate the two basics above (opusscript + transcription working) — those are much more common.


What I need from you (2 things)

  1. While the bot is in VC, run openclaw logs --follow and speak for ~3 seconds, then paste the log lines that mention discord voice:.
  2. Paste just these config sections (not your whole file if you don’t want):
  • channels.discord.voice
  • channels.discord.voice.tts (or if you rely on messages.tts, that block)
  • tools.media.audio

With that, we can pinpoint whether you’re stuck at receive, decode, transcribe, agent, or tts playback.

#

That “ticket” link is almost certainly a phishing/scam URL (the weird ht tp :///\\\\@... obfuscation is a common trick to bypass filters). Do not click it.

Best next steps:

  • Use Discord’s Report message / Report spam flow (or notify server moderators) so it can be removed.
  • If the poster is a real community member, assume their account may be compromised until proven otherwise.

For the actual OpenClaw VC issue: you’re already in the right place — no external “ticket” is needed. If you paste the openclaw logs --follow lines that include discord voice: while someone speaks in VC, we can diagnose whether you’re stuck at opus decode (opusscript), STT (tools.media.audio), or DAVE receive.

buoyant depot
#

I've confirmed opusscript is installed in my runtime and there are no opusscript unavailable errors in logs.
Here is the openclaw.json config section for what you requested:

What I need from you (2 things)

  1. While the bot is in VC, run openclaw logs --follow and speak for ~3 seconds, then paste the log lines that mention discord voice:.
  2. Paste just these config sections (not your whole file if you don’t want):
    channels.discord.voice
    channels.discord.voice.tts (or if you rely on messages.tts, that block)
    tools.media.audio
#

Requested config sections (redacted):

channels.discord.voice
{
"enabled": true,
"tts": {
"provider": "elevenlabs",
"elevenlabs": {
"apiKey": "REDACTED",
"voiceId": "c7aRb9qIoDCgl7IWc1OI",
"modelId": "eleven_flash_v2_5"
},
"timeoutMs": 120000
},
"daveEncryption": true,
"decryptionFailureTolerance": 120
}

messages.tts
{
"provider": "elevenlabs",
"elevenlabs": {
"apiKey": "REDACTED",
"voiceId": "c7aRb9qIoDCgl7IWc1OI",
"modelId": "eleven_flash_v2_5"
},
"edge": {
"enabled": false
},
"timeoutMs": 120000
}

tools.media.audio
{
"enabled": true,
"language": "en",
"echoTranscript": true,
"echoFormat": "📝 {transcript}",
"models": [
{
"type": "provider",
"provider": "openai",
"model": "gpt-4o-mini-transcribe"
}
]
}

tranquil lindenBOT
#

Good — if opusscript is installed and you’re not seeing the “opusscript unavailable” warning, the next issue is usually just lack of useful voice logs (because most of the VC pipeline logs are debug/verbose, not warn/error).

1) Turn file logs up to debug (so openclaw logs --follow actually shows VC capture/STT)

Per docs: https://docs.openclaw.ai/gateway/logging

Add (temporarily) to your openclaw.json:

{
logging: {
level: "debug"
}
}

Restart the gateway, then run:

#
openclaw logs --follow

Now speak in VC for ~3 seconds and paste the lines containing discord voice: (you should start seeing things like capture start, capture ready, transcription ok/empty, tts ok/failed, etc).

2) Please re-paste the config blocks (your message didn’t include them)

Can you paste these as code blocks?

  • channels.discord.voice
  • channels.discord.voice.tts (or confirm you rely on messages.tts)
  • tools.media.audio

tools.media.audio is especially important: if it’s disabled (or has no working auto-detected model/provider on the Ubuntu VM), VC will often log transcription empty and you’ll get exactly “no response” with no visible error unless debug is on.

Once you paste the discord voice: debug lines, we’ll know immediately whether you’re stuck at receive/decrypt, decode, transcription, agent reply, or tts playback.

#

Thanks — this config helps a lot. Your VC TTS side looks fine (ElevenLabs configured both in channels.discord.voice.tts and messages.tts).

The most likely blocker is actually VC STT, and your current STT config is not local:

tools.media.audio.models = [{ provider: "openai", model: "gpt-4o-mini-transcribe" }]

That means: for Discord voice chat to respond, your Ubuntu VM must have working OpenAI provider auth (e.g. OPENAI_API_KEY or an auth profile for provider openai). If you’re otherwise running “local (Ollama)” and don’t have OpenAI creds set up, the voice segments will transcribe to nothing → OpenClaw will log transcription empty (debug-only) and you’ll get exactly “no response”.

Quick 2-step confirmation

  1. Send a voice note attachment in a normal Discord text channel (not VC).
    Because you have echoTranscript: true, if STT is working you should get:
    📝 <your transcript>
  • If you get no transcript, STT is the issue (not ElevenLabs / not VC).
  1. Temporarily set file logs to debug so we can see VC pipeline:
#
{ logging: { level: "debug" } }

restart gateway → openclaw logs --follow → speak in VC → paste the discord voice: lines.

If you want STT to be local again (recommended)

Switch tools.media.audio.models to a local CLI (whisper-cpp / python whisper / sherpa). Docs + examples: https://docs.openclaw.ai/nodes/audio

If you tell me whether you have whisper-cpp (whisper-cli) installed on the Ubuntu VM (or prefer running Whisper on Windows again), I can suggest the cleanest tools.media.audio.models entry for your setup.

One more question: do you currently have an OpenAI API key configured anywhere for the gateway? If not, that OpenAI-transcribe model entry won’t ever work.

buoyant depot
#

Whisper seemed to be the missing link. I have been going back and forth with codex ever since 3.7 was released to get voice chat working. how could I have shortcut this to know whisper was what was missing? is this in documentation anywhere so that anyone who wants to get voice working knows how to do it?

tranquil lindenBOT
#

The “shortcut” is to think of Discord VC as two independent subsystems that both have to work:

  1. Voice receive → Opus decode (that’s where opusscript matters)
  2. Audio → transcript (that’s tools.media.audio, i.e. Whisper / provider transcription)

If (2) isn’t working, VC looks like “the bot hears nothing” even though TTS in text chat works perfectly.

How you could have found it faster

  • Do an STT smoke-test before VC:

  • enable tools.media.audio.echoTranscript: true (you already did)

  • send a normal voice note attachment in a text channel/DM

  • if you don’t get 📝 <transcript>, then VC will never respond either (because VC uses the same transcription pipeline under the hood).

  • Turn on debug file logs when debugging VC:
    Set logging.level: "debug" temporarily and then watch for discord voice: lines like transcription empty/ok. Without debug, you often won’t see the “why”.

  • Run openclaw doctor when something “should work” but doesn’t — it can surface missing routing/config around transcription (tools.media.audio.models). (It won’t magically install Whisper, but it helps narrow down “STT isn’t configured/available”.)

#

Is this documented?

Yes, but it’s currently a bit indirect:

If you want to help the next person: this is a great docs issue/PR request — “Add a Prerequisites bullet list under Discord Voice Channels: opusscript + tools.media.audio (Whisper) required, plus a quick ‘echoTranscript’ smoke test.” You’d file that here: https://github.com/openclaw/openclaw/issues

If you want, paste the final “working” tools.media.audio block you landed on (CLI whisper vs provider), and I can suggest a clean minimal snippet to include in that docs request.

vital spear
#

@tranquil linden So if everything changed with a recent update, what is the recommended flow for 'local' discord voice chat?

tranquil lindenBOT
#

Don’t patch anything for “local Discord VC” — the recommended flow now is basically:

Recommended “local Discord voice chat” flow (2026.3.x)

1) Use OpenClaw’s native Discord VC

2) Make STT explicitly local (this is the usual missing piece)

Configure tools.media.audio to use a local transcription CLI (Whisper / whisper-cpp / sherpa, etc). Docs: https://docs.openclaw.ai/nodes/audio

If STT isn’t working, VC will look like “no response” even if TTS works in text chat.

3) LLM local (Ollama/etc)

Keep your agent model pointed at your local provider as you already do.

4) TTS “local” is still possible (despite the provider hardening)

TTS providers are now hard-limited to openai | elevenlabs | edge (docs: https://docs.openclaw.ai/tts), but “openai” supports a custom baseUrl — so you can run XTTS/Kokoro/LocalAI/etc behind an OpenAI-compatible /v1/audio/speech endpoint and point OpenClaw at it via: