So I have a bit of a dilemma... I spent 2 weeks getting xtts set up for voice chat with my openclaw in discord using a traind voice - all local. Worked fine on 2026.2.26.
I Updated to 2026.3.2 and XTTS breaks because the TTS stuff was hardened between 2026.2.26 and 2026.3.2.
How can I get XTTS working now that the only TTS providers allowed are OpenAI, Elevenlabs, Edge TTS?
#XTTS
1 messages · Page 1 of 1 (latest)
As of 2026.3.2, thatâs expected: TTS providers are now hard-enforced to openai | elevenlabs | edge (so thereâs no way to select an xtts provider directly anymore). Docs: https://docs.openclaw.ai/tts
What does still work for âlocal XTTS with a trained voiceâ is to wrap XTTS behind an OpenAI- or ElevenLabs-compatible HTTP endpoint, and then point OpenClaw at it.
Recommended: XTTS â âOpenAI Audio APIâ shim (local)
OpenClawâs OpenAI TTS supports a custom baseUrl (messages.tts.openai.baseUrl, and also under Discord voice overrides), and when you use a custom baseUrl it allows any model/voice strings.
Discord voice chat config (realtime /vc join): https://docs.openclaw.ai/channels/discord#voice-channels
{
channels: {
discord: {
voice: {
tts: {
provider: "openai",
openai: {
apiKey: "local-dummy-key",
baseUrl: "http://127.0.0.1:8000/v1", // no trailing slash
model: "xtts",
voice: "my_trained_voice"
}
}
}
}
}
}
Your local service just needs to implement:
POST {baseUrl}/audio/speech- accept JSON like:
model,input(text),voice,response_format(OpenClaw will usually ask formp3on Discord)
- return raw audio bytes (e.g.
audio/mpeg)
Alternative: XTTS â ElevenLabs shim
Also possible via messages.tts.elevenlabs.baseUrl, but OpenClaw validates voiceId format for ElevenLabs, so itâs slightly more annoying to âfakeâ correctly.
If you want XTTS as a first-class provider again
Thatâd need an upstream change (new provider or a generic âhttp tts providerâ). Best path is a GitHub feature request: https://github.com/openclaw/openclaw/issues
If you tell me whether youâre using Discord voice channels (/vc join) vs voice message attachments, I can point you at the exact config block(s) you need (channels.discord.voice.tts vs messages.tts).
Would it be possible to give claude code access to openclaw's code to "patch" xtts to work on top of update releases?
You canât really âgive Claude Code accessâ through OpenClaw â Claude Code is a local CLI that runs on your machine (in whatever folder you point it at). So the practical workflow is:
- clone/fork
openclaw/openclaw - run Claude Code inside that repo to produce a patch
- you either:
- keep a fork and rebase every release (pain), or
- open an upstream PR / feature request (best long-term)
That said: I would not patch OpenClaw just to keep XTTS working if your goal is âsurvive updatesâ.
Better than patching: keep XTTS local via a stable HTTP shim
OpenClaw 2026.3.2 already supports pointing OpenAI TTS at a custom baseUrl (messages.tts.openai.baseUrl, and also channels.discord.voice.tts.openai.baseUrl). That means you can run XTTS locally behind an OpenAI /audio/speech compatible endpoint and you wonât have to chase upstream changes every time.
So: no fork, no patch, just config + your local shim.
If you do want a real âxttsâ provider upstream
Thatâs a non-trivial change (provider enum/schema + config types + new TTS implementation + docs + tests). Totally doable, but it should land upstream as either:
- a first-class
xttsprovider, or - a generic âHTTP TTS providerâ (safer, more flexible)
If you want, paste what you were using on 2026.2.26 (was it a CLI exec, a local HTTP endpoint, etc.?) and Iâll tell you which upstream shape would be cleanest + lowest-maintenance.
More context: OpenClaw runs on an Ubuntu VM on my Windows Host. It talks to Ollama for local models, and for TTS here is what my openclaw is providing as it helped set up the TTS:
Discord VC path: OpenClaw runs on Ubuntu VM, joins VC via /vc.,
TTS provider in OpenClaw config: channels.discord.voice.tts.provider = "openai",
But itâs not OpenAI cloud TTS â OpenClaw is pointed at your Windows XTTS service using:,
OPENAI_TTS_BASE_URL = http://<windows-host-ip>:8020/v1,
OPENAI_API_KEY = local-xtts (dummy/local token pattern),
Voice/model payload used by OpenClaw:,
openai.model = "xtts-local",
openai.voice = "eva",
Windows host service: FastAPI/uvicorn app (xtts_api.py) exposing OpenAI-compatible endpoint:,
POST /v1/audio/speech,
returns WAV bytes to OpenClaw,
So comms flow is:,
- Discord speech captured in Ubuntu/OpenClaw
- STT done locally (Whisper on Ubuntu)
- LLM response generated
- OpenClaw sends TTS request to http://<windows-ip>:8020/v1/audio/speech
- Windows XTTS returns WAV
- OpenClaw plays audio into Discord VC
Also, my openclaw is asking (possibly we're already covering this): does 2026.3.2 still honor OPENAI_TTS_BASE_URL overrides for Discord VC TTS, or force OpenAI cloud endpoints only?
Yes â 2026.3.2 still honors OPENAI_TTS_BASE_URL for Discord VC TTS. It does not force OpenAI cloud endpoints.
What Discord VC does in 2026.3.2 is (literally):
- merges
channels.discord.voice.ttsovermessages.tts - resolves
openai.baseUrlfrom:
...tts.openai.baseUrl(config), elseOPENAI_TTS_BASE_URL(env), else- the real OpenAI default
- calls
POST {baseUrl}/audio/speechwithresponse_format: "mp3"whenchannel: "discord".
So if your setup âworked beforeâ but broke after the hardening, the biggest red flag in your description is this:
your XTTS service returns WAV bytes to OpenClaw
On Discord, OpenClaw is asking your shim for MP3 (response_format=mp3). If your shim ignores response_format and always returns WAV, you can get breakage (or silent failure / decode issues) depending on the playback pipeline and how strict it got in newer builds.
What Iâd change first (no patch needed)
- Update
xtts_api.pyto honorresponse_formatand return MP3 bytes when requested.
- For example: generate WAV â transcode to MP3 (ffmpeg) â return
audio/mpeg.
- Make the baseUrl explicit in config (avoids âenv not loadedâ surprises):
{
channels: {
discord: {
voice: {
tts: {
provider: "openai",
openai: {
apiKey: "local-xtts",
baseUrl: "http://<windows-host-ip>:8020/v1",
model: "xtts-local",
voice: "eva",
},
},
},
},
},
}
One more gotcha: env vars not reaching the gateway
If youâre relying on OPENAI_TTS_BASE_URL via systemd/service startup, make sure the gateway process actually gets it (common failure mode after upgrades/reinstalls). Related doc section: https://docs.openclaw.ai/help/faq#env-vars-and-env-loading
If you paste the exact error/log line you see in 2026.3.2 (from openclaw logs --follow around the TTS attempt), I can tell you whether itâs (a) baseUrl not applied, (b) non-200 from your shim, or (c) audio decode/playback failing on the returned bytes.
I gave up on XTTS. What I'm doing now instead is elevenlabs. I want to use it in voice chat in discord - so I would speak, it would get processed to STT, then that gets sent to openAI, which the reply is then sent to elevenlabs, which then comes back to discord (at least I think that's the workflow). I've updated openclaw to 2026.3.7. I can get /TTS to work in general chat in discord and I hear what I believe is my trained voice from elevenlabs. however, if I join the bot to voice channel, then I try speaking: I receive no response. Am I supposed to configure STT from my end somehow? I can provide my openclaw.json if that would help. Through troubleshooting with codex in vscode, i've gathered there is an issue report "OpenClaw docs explicitly warn about Discord DAVE receive instability and point to upstream discord.js issue #11419." Would that mean what I'm trying to do is not possible? Codex recommended a python sidecar for the STT, similar to what I had set up for XTTS before. I'm unsure if I should go that route or not. This was the recommendation: "If you must keep Discord VC: use a separate Discord voice bridge service (e.g., discord.py/py-cord sidecar) for STT/TTS and route turns to OpenClaw via API/chat-completions, instead of OpenClawâs built-in Discord VC receiver." It's possible I had whisper set up before as well. however, all the processing was being handled on the windows host back when I was using XTTS. I don't know if I should continue that if I go this route, or possibly whisper on the ubuntu vm. The windows host has more resources, and I do wish to have the quickest response times to process everything. What I don't understand is: I thought there was native openclaw support for what I'm trying to achieve already, and I'm unsure why I have not been able to get it to work so far.
What youâre trying to do is supported (Discord VC â STT â agent â TTS â play back), but two common âgotchasâ make it look like itâs dead even when /tts in text chat works:
1) Discord VC receive/STT is separate from â/tts in chatâ
/tts working in a text channel only proves outbound TTS is configured.
For VC, the pipeline is:
- Discord voice packets received
- Opus decode â PCM
- write a temp WAV
- run audio transcription via
tools.media.audio - send transcript to the agent
- run TTS (your ElevenLabs config)
- play audio into VC
So yes: there is ânative OpenClaw supportâ, but VC requires voice receive + transcription to be working on the gateway host.
2) The #1 silent failure: missing opusscript (canât decode voice)
OpenClawâs Discord VC receiver currently decodes incoming voice with opusscript. If itâs not installed, the gateway will join the VC fine, but it will never get usable audio â no transcript â no reply.
Check your gateway logs while you speak:
- if you see something like:
discord voice: opusscript unavailable ... cannot decode voice audio
thatâs the reason you get âno responseâ.
Fix is simply: install opusscript where OpenClaw runs (Ubuntu VM). (How depends on whether you installed OpenClaw via global npm vs from-source pnpm â tell me which and Iâll give the exact command.)
3) The #2 silent failure: no transcription backend configured/available
VC transcription uses the same âmedia understanding (audio)â system as voice notes: tools.media.audio.
Docs: https://docs.openclaw.ai/nodes/audio
If you previously had Whisper running on Windows (sidecar), but now OpenClaw is on Ubuntu with no local whisper CLI/provider keys available, youâll get transcription empty and the bot wonât respond.
Quick debugging trick: enable transcript echo so you can see whether STT is happening:
{
tools: {
media: {
audio: {
enabled: true,
echoTranscript: true,
// models: [...] // optional; if omitted it tries auto-detection (see docs)
}
}
}
}
If you get no echoed transcript when speaking in VC, the STT side is the blocker.
4) About the DAVE/#11419 warning
That warning means VC receive can be flaky on some setups, not that itâs impossible. It becomes relevant if your logs show repeated decrypt failures / receive errors. But before building a python sidecar again, Iâd validate the two basics above (opusscript + transcription working) â those are much more common.
What I need from you (2 things)
- While the bot is in VC, run
openclaw logs --followand speak for ~3 seconds, then paste the log lines that mentiondiscord voice:. - Paste just these config sections (not your whole file if you donât want):
channels.discord.voicechannels.discord.voice.tts(or if you rely onmessages.tts, that block)tools.media.audio
With that, we can pinpoint whether youâre stuck at receive, decode, transcribe, agent, or tts playback.
That âticketâ link is almost certainly a phishing/scam URL (the weird ht tp :///\\\\@... obfuscation is a common trick to bypass filters). Do not click it.
Best next steps:
- Use Discordâs Report message / Report spam flow (or notify server moderators) so it can be removed.
- If the poster is a real community member, assume their account may be compromised until proven otherwise.
For the actual OpenClaw VC issue: youâre already in the right place â no external âticketâ is needed. If you paste the openclaw logs --follow lines that include discord voice: while someone speaks in VC, we can diagnose whether youâre stuck at opus decode (opusscript), STT (tools.media.audio), or DAVE receive.
I've confirmed opusscript is installed in my runtime and there are no opusscript unavailable errors in logs.
Here is the openclaw.json config section for what you requested:
What I need from you (2 things)
- While the bot is in VC, run openclaw logs --follow and speak for ~3 seconds, then paste the log lines that mention discord voice:.
- Paste just these config sections (not your whole file if you donât want):
channels.discord.voice
channels.discord.voice.tts (or if you rely on messages.tts, that block)
tools.media.audio
Requested config sections (redacted):
channels.discord.voice
{
"enabled": true,
"tts": {
"provider": "elevenlabs",
"elevenlabs": {
"apiKey": "REDACTED",
"voiceId": "c7aRb9qIoDCgl7IWc1OI",
"modelId": "eleven_flash_v2_5"
},
"timeoutMs": 120000
},
"daveEncryption": true,
"decryptionFailureTolerance": 120
}
messages.tts
{
"provider": "elevenlabs",
"elevenlabs": {
"apiKey": "REDACTED",
"voiceId": "c7aRb9qIoDCgl7IWc1OI",
"modelId": "eleven_flash_v2_5"
},
"edge": {
"enabled": false
},
"timeoutMs": 120000
}
tools.media.audio
{
"enabled": true,
"language": "en",
"echoTranscript": true,
"echoFormat": "đ {transcript}",
"models": [
{
"type": "provider",
"provider": "openai",
"model": "gpt-4o-mini-transcribe"
}
]
}
Good â if opusscript is installed and youâre not seeing the âopusscript unavailableâ warning, the next issue is usually just lack of useful voice logs (because most of the VC pipeline logs are debug/verbose, not warn/error).
1) Turn file logs up to debug (so openclaw logs --follow actually shows VC capture/STT)
Per docs: https://docs.openclaw.ai/gateway/logging
Add (temporarily) to your openclaw.json:
{
logging: {
level: "debug"
}
}
Restart the gateway, then run:
openclaw logs --follow
Now speak in VC for ~3 seconds and paste the lines containing discord voice: (you should start seeing things like capture start, capture ready, transcription ok/empty, tts ok/failed, etc).
2) Please re-paste the config blocks (your message didnât include them)
Can you paste these as code blocks?
channels.discord.voicechannels.discord.voice.tts(or confirm you rely onmessages.tts)tools.media.audio
tools.media.audio is especially important: if itâs disabled (or has no working auto-detected model/provider on the Ubuntu VM), VC will often log transcription empty and youâll get exactly âno responseâ with no visible error unless debug is on.
Once you paste the discord voice: debug lines, weâll know immediately whether youâre stuck at receive/decrypt, decode, transcription, agent reply, or tts playback.
Thanks â this config helps a lot. Your VC TTS side looks fine (ElevenLabs configured both in channels.discord.voice.tts and messages.tts).
The most likely blocker is actually VC STT, and your current STT config is not local:
tools.media.audio.models = [{ provider: "openai", model: "gpt-4o-mini-transcribe" }]
That means: for Discord voice chat to respond, your Ubuntu VM must have working OpenAI provider auth (e.g. OPENAI_API_KEY or an auth profile for provider openai). If youâre otherwise running âlocal (Ollama)â and donât have OpenAI creds set up, the voice segments will transcribe to nothing â OpenClaw will log transcription empty (debug-only) and youâll get exactly âno responseâ.
Quick 2-step confirmation
- Send a voice note attachment in a normal Discord text channel (not VC).
Because you haveechoTranscript: true, if STT is working you should get:
đ <your transcript>
- If you get no transcript, STT is the issue (not ElevenLabs / not VC).
- Temporarily set file logs to debug so we can see VC pipeline:
{ logging: { level: "debug" } }
restart gateway â openclaw logs --follow â speak in VC â paste the discord voice: lines.
If you want STT to be local again (recommended)
Switch tools.media.audio.models to a local CLI (whisper-cpp / python whisper / sherpa). Docs + examples: https://docs.openclaw.ai/nodes/audio
If you tell me whether you have whisper-cpp (whisper-cli) installed on the Ubuntu VM (or prefer running Whisper on Windows again), I can suggest the cleanest tools.media.audio.models entry for your setup.
One more question: do you currently have an OpenAI API key configured anywhere for the gateway? If not, that OpenAI-transcribe model entry wonât ever work.
Whisper seemed to be the missing link. I have been going back and forth with codex ever since 3.7 was released to get voice chat working. how could I have shortcut this to know whisper was what was missing? is this in documentation anywhere so that anyone who wants to get voice working knows how to do it?
The âshortcutâ is to think of Discord VC as two independent subsystems that both have to work:
- Voice receive â Opus decode (thatâs where
opusscriptmatters) - Audio â transcript (thatâs
tools.media.audio, i.e. Whisper / provider transcription)
If (2) isnât working, VC looks like âthe bot hears nothingâ even though TTS in text chat works perfectly.
How you could have found it faster
-
Do an STT smoke-test before VC:
-
enable
tools.media.audio.echoTranscript: true(you already did) -
send a normal voice note attachment in a text channel/DM
-
if you donât get
đ <transcript>, then VC will never respond either (because VC uses the same transcription pipeline under the hood). -
Turn on debug file logs when debugging VC:
Setlogging.level: "debug"temporarily and then watch fordiscord voice:lines liketranscription empty/ok. Without debug, you often wonât see the âwhyâ. -
Run
openclaw doctorwhen something âshould workâ but doesnât â it can surface missing routing/config around transcription (tools.media.audio.models). (It wonât magically install Whisper, but it helps narrow down âSTT isnât configured/availableâ.)
Is this documented?
Yes, but itâs currently a bit indirect:
- Audio transcription (Whisper / provider STT) is documented here:
https://docs.openclaw.ai/nodes/audio
(including the auto-detection order: whisper CLIs first, then providers) - Discord voice channels are documented here:
https://docs.openclaw.ai/channels/discord#voice-channels
(but it doesnât scream loudly enough that VC requires a workingtools.media.audiotranscription path.)
If you want to help the next person: this is a great docs issue/PR request â âAdd a Prerequisites bullet list under Discord Voice Channels: opusscript + tools.media.audio (Whisper) required, plus a quick âechoTranscriptâ smoke test.â Youâd file that here: https://github.com/openclaw/openclaw/issues
If you want, paste the final âworkingâ tools.media.audio block you landed on (CLI whisper vs provider), and I can suggest a clean minimal snippet to include in that docs request.
@tranquil linden So if everything changed with a recent update, what is the recommended flow for 'local' discord voice chat?
Donât patch anything for âlocal Discord VCâ â the recommended flow now is basically:
Recommended âlocal Discord voice chatâ flow (2026.3.x)
1) Use OpenClawâs native Discord VC
- Enable + join with
/vc join(docs: https://docs.openclaw.ai/channels/discord#voice-channels) - Make sure the gateway host has
opusscriptinstalled (required to decode inbound voice)
2) Make STT explicitly local (this is the usual missing piece)
Configure tools.media.audio to use a local transcription CLI (Whisper / whisper-cpp / sherpa, etc). Docs: https://docs.openclaw.ai/nodes/audio
If STT isnât working, VC will look like âno responseâ even if TTS works in text chat.
3) LLM local (Ollama/etc)
Keep your agent model pointed at your local provider as you already do.
4) TTS âlocalâ is still possible (despite the provider hardening)
TTS providers are now hard-limited to openai | elevenlabs | edge (docs: https://docs.openclaw.ai/tts), but âopenaiâ supports a custom baseUrl â so you can run XTTS/Kokoro/LocalAI/etc behind an OpenAI-compatible /v1/audio/speech endpoint and point OpenClaw at it via: