#Voice
1 messages · Page 1 of 1 (latest)
Discord has two different “voice” things in OpenClaw:
- Voice channels (live conversation)
- Voice messages (a voice-note attachment with waveform preview)
1) Voice channels: how it works
When you run /vc join, OpenClaw joins the voice channel and then loops like this:
- Listen: Discord sends Opus audio packets when someone starts speaking.
- Segment: It records until ~1s of silence (so you get speech “chunks”).
- Decode → WAV: It decodes Opus → PCM and writes a temporary
.wavsegment. - STT (transcription): It transcribes that WAV using the normal audio capability pipeline (same machinery as “here’s an audio file, transcribe this”).
- Agent prompt: It forwards the transcript into the agent session (prefixed with the speaker name when available).
- Reply → TTS: It takes the agent’s text reply, runs TTS, and plays the audio back into the voice channel.
A couple of practical details:
- If someone starts talking while the bot is speaking, it stops playback to avoid talking over people.
- The voice channel is mapped to a Discord “channel peer”, so typed messages in that channel share context with the voice session.
Control commands: /vc join, /vc leave, /vc status.
Docs: https://docs.openclaw.ai/channels/discord#voice-channels
2) Voice messages (attachments) are separate
That’s when the bot sends a voice note (waveform preview). Different pipeline; requires local audio file conversion, and typically needs ffmpeg/ffprobe on the gateway host.
Docs: https://docs.openclaw.ai/channels/discord#voice-messages
Setup requirements (voice channels)
- Enable native slash commands:
channels.discord.commands.native - Configure:
channels.discord.voice - Bot permissions in the voice channel: Connect + Speak
If you tell me your openclaw --version and what you want (auto-join? listen-only? which TTS voice?), I can point at the exact config knobs for your setup.
I don't have any audio capability configured
Discord voice chat isn’t using Discord’s built-in STT — OpenClaw records what it hears and then runs the normal “audio media understanding” transcription pipeline. That pipeline is configured under:
tools.media.audio(models / provider / CLI fallbacks)
So if you truly have no audio transcription available (no provider keys + no local whisper CLI), then:
/vc joinwill still join the channel,- but it won’t produce transcripts → the agent never gets a prompt → it won’t reply (and thus won’t speak).
What to do
You’ve got two easy paths:
A) Rely on auto-detection (minimal config)
- Make sure you did not set
tools.media.audio.enabled: false - Provide at least one supported provider API key or install a supported local CLI (Whisper-ish)
- Then voice transcription should “just start working” without specifying
tools.media.audio.models.
B) Explicitly configure audio transcription (recommended)
Add something like this (model id depends on the provider you choose):
{
tools: {
media: {
audio: {
enabled: true,
models: [
{ provider: "<provider>", model: "<transcription-model-id>" }
// optionally add a CLI fallback here
],
},
},
},
}
Docs (these are the canonical references):
- Audio / transcription config: https://docs.openclaw.ai/nodes/audio
- Media understanding overview (same system voice chat uses): https://docs.openclaw.ai/nodes/media-understanding
- Discord voice channels feature: https://docs.openclaw.ai/channels/discord#voice-channels
Quick sanity check
Run with verbose logs and watch for lines like:
discord voice: opus decoder …(decode OK)discord voice: transcription ok …(STT OK)
If you paste your current tools.media.audio block (and whether you prefer “local CLI” vs “API provider”), I can point at the smallest working config for your setup.
I have no idea what to use. I don't have audio api keys
If you don’t have any audio/STT API keys, you’ve basically got 3 options for Discord voice chat:
Option A (recommended, zero keys): install a local Whisper CLI
OpenClaw’s Discord voice feature needs transcription (STT). Without STT it can join, but it can’t “understand” anything you say.
The easiest no-key setup is to install a CLI that OpenClaw auto-detects, especially:
whisper-clifrom whisper.cpp / whisper-cpp- Once
whisper-cliis onPATH, OpenClaw will auto-detect it and use it fortools.media.audioautomatically. - On macOS via Homebrew, OpenClaw even looks for a default tiny model at:
/opt/homebrew/share/whisper-cpp/for-tests-ggml-tiny.bin - If you want better quality, you can set
WHISPER_CPP_MODELto a larger model file path.
- Once
Docs: https://docs.openclaw.ai/nodes/audio (see “Auto-detection”)
Option B (also zero keys, but heavier): Python Whisper
Install a whisper CLI (Python). OpenClaw will also auto-detect a whisper binary if it’s on PATH.
This is usually more annoying (Python deps) and can be slower, but it works.
Option C (no “API key”, but still cloud): Gemini CLI
If you’re okay logging into a Google account, OpenClaw can use the Gemini CLI as a transcription backend (it probes for a gemini binary).
This avoids “API keys”, but it’s still a hosted service + has the usual “use a non-critical account” caveats.
FAQ: https://docs.openclaw.ai/help/faq#how-do-i-set-up-gemini-cli-oauth
What I’d do
If your gateway is on macOS: install whisper-cpp (you’ll likely be up fast).
If your gateway is on a Linux VPS: either compile/install whisper.cpp, or use Gemini CLI if you don’t want local CPU burn.
If you tell me what OS your gateway host is (macOS / Linux / Windows+WSL), I’ll point you at the simplest “do this, then restart gateway, then /vc join works” path for that OS.
I think I already have a whisper cli installed, what about TTS
If Whisper CLI is already working for you, that covers the STT side (turning voice → text).
For Discord voice channels, the bot “speaks back” using the normal OpenClaw TTS system:
- Base config:
messages.tts - Discord voice-channel override (optional):
channels.discord.voice.tts(this one only affects voice-channel playback)
No API keys? Use Edge TTS (default, keyless)
Edge TTS requires no API key. As long as you didn’t disable it, it should work out of the box.
Minimal “make it explicit” config:
{
messages: {
tts: {
provider: "edge",
edge: {
enabled: true
// voice: "..." // optional
}
}
},
channels: {
discord: {
voice: {
tts: { provider: "edge" } // optional override, but nice to be explicit
}
}
}
}
How to test quickly (in Discord)
Docs note: Discord has a built-in /tts, so OpenClaw registers /voice as the native command there — but typing /tts ... also works.
Try one of these:
/voice audio Hello from OpenClaw(native)/tts audio Hello from OpenClaw(text command)
If that produces an audio reply, then voice-channel playback should work too.
Docs: https://docs.openclaw.ai/tts and https://docs.openclaw.ai/channels/discord#voice-channels
tools.exec.ask controls when OpenClaw will stop and ask you for approval before running an exec command on a real machine (gateway host or a paired node).
It’s separate from tools.exec.security:
security= what is allowed in principle (deny | allowlist | full)ask= when to prompt a human (off | on-miss | always)
Values
off: never prompt.- If
security=allowlist, anything not allowlisted is just blocked (no prompt). - If
security=full, it’ll just run (no prompt).
- If
on-miss(default): prompt only when the allowlist doesn’t match.- Allowlisted commands run without bothering you.
- Non-allowlisted commands trigger an approval flow (“approve once / always / deny”).
always: prompt on every command, even if allowlisted.