#/vc with openclaw
1 messages · Page 1 of 1 (latest)
Yes — Discord voice commands (/vc join|leave|status) are gated by the same Discord guild/channel allowlist as normal Discord group messages.
If you have channels.discord.groupPolicy: "allowlist" and you configure a per‑guild channels map, then only the channels listed there are allowed — and /vc join specifically checks the target voice channel you pick.
Example (allowlist exactly one voice channel for voice commands):
{
channels: {
discord: {
commands: { native: true }, // needed for /vc
groupPolicy: "allowlist",
guilds: {
"123456789012345678": {
channels: {
// key can be the channel ID (recommended) or the slug/name
"234567890123456789": { allow: true }, // <- VOICE channel id
},
},
},
voice: { enabled: true },
},
},
}
Notes:
- If the voice channel isn’t in that allowlist, you’ll get an error like “
<#...> is not allowlisted for voice commands.” - To restrict who can run
/vc, addusers: [...]orroles: [...]under the guild/channel entry (same place). - Docs: https://docs.openclaw.ai/channels/discord (see Access control and routing + Voice channels)
What's the fastest TTS setup for voice channels on Mac M4? I'm getting very slow responses....minutes in between responses. playing catch up constantly
Voice channels are a pipeline: (1) transcribe audio → (2) run your chat model → (3) TTS → (4) play audio. “Minutes behind / catching up” usually means you’re building a backlog in (1) or (2) (not just TTS).
First: what openclaw --version are you on?
Fastest “known good” setup (Discord voice on a Mac)
1) Force a fast transcription provider (this is the big one).
By default, audio transcription auto-detects and will happily pick a local whisper CLI if you have it installed — that’s often painfully slow for Discord voice because it runs constantly on lots of tiny segments.
Configure tools.media.audio.models explicitly, e.g.:
{
tools: {
media: {
audio: {
enabled: true,
models: [
{ provider: "openai", model: "gpt-4o-mini-transcribe" }
// or Deepgram if you prefer: { provider: "deepgram", model: "nova-3" }
],
},
},
},
}
Docs: https://docs.openclaw.ai/nodes/audio
2) Set a low-latency TTS provider specifically for voice channels via channels.discord.voice.tts (this overrides messages.tts for voice playback only):
{
channels: {
discord: {
voice: {
enabled: true,
tts: {
provider: "openai",
openai: { model: "gpt-4o-mini-tts", voice: "alloy" },
},
},
},
},
}
Docs: https://docs.openclaw.ai/channels/discord and https://docs.openclaw.ai/tts
3) Make the bot answer short in that voice channel (reduces LLM time + TTS time + playback time). You can do this per channel:
{
channels: {
discord: {
guilds: {
"YOUR_GUILD_ID": {
channels: {
"YOUR_VOICE_CHANNEL_ID": {
allow: true,
systemPrompt: "Voice mode. Reply with 1 short sentence unless asked for detail.",
},
},
},
},
},
},
}
Quick debug to pinpoint where the minutes are going
Run openclaw logs --follow while reproducing. If you paste ~30 lines around the delay (redact secrets), we can usually tell whether you’re stuck in:
- transcription (slow/CLI auto-detect),
- model latency (too-slow chat model for realtime),
- or playback issues.
If you reply with:
openclaw --version- your
channels.discord.voice+tools.media.audioconfig blocks (redacted)
…I can point to the exact knob that’ll fix the backlog in your setup.
What is the exact correct way to add OpenAI as a provider in openclaw.json for TTS? Specifically how do I reference the OPENAI_API_KEY from .env? Every format I try either fails with 'baseUrl required' or 'invalid input'. Can someone share a working example?
For TTS you do not “add OpenAI as a model provider” (that’s models.providers.*). You configure it under messages.tts (and for Discord voice channels specifically: channels.discord.voice.tts). OpenAI TTS does not need a baseUrl.
1) Put the key in the right .env
On macOS (especially when running as a daemon), the reliable place is:
~/.openclaw/.env (not your project folder)
OPENAI_API_KEY=sk-...
Docs: https://docs.openclaw.ai/help/environment
Then restart:
openclaw gateway restart
2) Working TTS config (uses OPENAI_API_KEY automatically)
You can omit apiKey entirely — it falls back to OPENAI_API_KEY from env.
Global auto‑TTS for replies:
{
messages: {
tts: {
auto: "always",
provider: "openai",
openai: {
model: "gpt-4o-mini-tts",
voice: "alloy",
},
},
},
}
Discord voice channels (what /vc uses):
{
channels: {
discord: {
voice: {
enabled: true,
tts: {
provider: "openai",
openai: { voice: "alloy", model: "gpt-4o-mini-tts" },
},
},
},
},
}
Docs: https://docs.openclaw.ai/tts and https://docs.openclaw.ai/channels/discord
3) If you really want to reference the env var inside JSON5
Env-substitution works via ${VAR}:
{
messages: {
tts: {
provider: "openai",
auto: "always",
openai: {
apiKey: "${OPENAI_API_KEY}",
model: "gpt-4o-mini-tts",
voice: "alloy",
},
},
},
}
If you paste the exact snippet that’s giving you “baseUrl required”, I can tell you which section you accidentally put it in (that error almost always comes from configuring a custom OpenAI-compatible provider under models.providers, not from TTS).
got it! but why is the wait between my talking and openclaw responding so long? i asked what time it was and it took 1 min 4 seconds to answer?
That delay is basically the sum of the whole voice pipeline:
- Wait for you to stop talking (OpenClaw captures until it sees ~1s of silence)
- Decode Opus → WAV
- Transcribe (STT)
- Run the chat model
- TTS
- Play it back (and if anything is already “in flight”, it queues)
A “normal” round-trip for “what time is it?” should feel like a few seconds. ~60s usually means one of these is happening:
A) It’s not detecting “silence” quickly (open mic / background noise)
If the mic is hot and there’s constant noise, the “end after silence” trigger can take ages, so transcription doesn’t even start yet.
Fixes:
- Try push-to-talk (best quick test)
- Increase Discord input sensitivity / enable noise suppression
- Speak, then pause cleanly (no trailing words/noise)
B) STT is using a slow local CLI (most common)
Audio transcription auto-detects and will prefer local whisper/whisper-cpp if present — which can be way too slow for realtime Discord voice.
Force a fast provider explicitly (so it doesn’t auto-pick a CLI):
{
tools: {
media: {
audio: {
enabled: true,
models: [
{ provider: "openai", model: "gpt-4o-mini-transcribe" }
// or: { provider: "deepgram", model: "nova-3" }
],
},
},
},
}
Docs: https://docs.openclaw.ai/nodes/audio
C) Backlog/queuing (you spoke again, or multiple people are talking)
Voice processing + playback are serialized per voice session, so if segments pile up, it “plays catch up”.
Fix: wait for it to finish speaking before talking again (and keep replies short).
Fast way to pinpoint which step is costing 60 seconds
Run this while reproducing:
openclaw logs --follow
Look for discord voice: lines like:
capture ready (...)transcription ok (...)reply ok (...)tts ok (...)playback start/playback done
The timestamps between those tell you exactly where the minute is going.
If you paste a small redacted log chunk from one slow turn (just the discord voice: lines), I can tell you whether it’s silence detection, STT, model, or playback queue.