#Voice

1 messages · Page 1 of 1 (latest)

small ravine
#

There is now support for voice channels on Discord, but how does voice chat work?

median lagoonBOT
#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> âś… Mark Solution

tawny patrolBOT
#

Discord has two different “voice” things in OpenClaw:

  1. Voice channels (live conversation)
  2. Voice messages (a voice-note attachment with waveform preview)

1) Voice channels: how it works

When you run /vc join, OpenClaw joins the voice channel and then loops like this:

  • Listen: Discord sends Opus audio packets when someone starts speaking.
  • Segment: It records until ~1s of silence (so you get speech “chunks”).
  • Decode → WAV: It decodes Opus → PCM and writes a temporary .wav segment.
  • STT (transcription): It transcribes that WAV using the normal audio capability pipeline (same machinery as “here’s an audio file, transcribe this”).
  • Agent prompt: It forwards the transcript into the agent session (prefixed with the speaker name when available).
  • Reply → TTS: It takes the agent’s text reply, runs TTS, and plays the audio back into the voice channel.

A couple of practical details:

  • If someone starts talking while the bot is speaking, it stops playback to avoid talking over people.
#
  • The voice channel is mapped to a Discord “channel peer”, so typed messages in that channel share context with the voice session.

Control commands: /vc join, /vc leave, /vc status.

Docs: https://docs.openclaw.ai/channels/discord#voice-channels

2) Voice messages (attachments) are separate

That’s when the bot sends a voice note (waveform preview). Different pipeline; requires local audio file conversion, and typically needs ffmpeg/ffprobe on the gateway host.

Docs: https://docs.openclaw.ai/channels/discord#voice-messages

Setup requirements (voice channels)

  • Enable native slash commands: channels.discord.commands.native
  • Configure: channels.discord.voice
  • Bot permissions in the voice channel: Connect + Speak

If you tell me your openclaw --version and what you want (auto-join? listen-only? which TTS voice?), I can point at the exact config knobs for your setup.

small ravine
#

I don't have any audio capability configured

tawny patrolBOT
#

Discord voice chat isn’t using Discord’s built-in STT — OpenClaw records what it hears and then runs the normal “audio media understanding” transcription pipeline. That pipeline is configured under:

  • tools.media.audio (models / provider / CLI fallbacks)

So if you truly have no audio transcription available (no provider keys + no local whisper CLI), then:

  • /vc join will still join the channel,
  • but it won’t produce transcripts → the agent never gets a prompt → it won’t reply (and thus won’t speak).

What to do

You’ve got two easy paths:

A) Rely on auto-detection (minimal config)

  • Make sure you did not set tools.media.audio.enabled: false
  • Provide at least one supported provider API key or install a supported local CLI (Whisper-ish)
  • Then voice transcription should “just start working” without specifying tools.media.audio.models.
#

B) Explicitly configure audio transcription (recommended)
Add something like this (model id depends on the provider you choose):

{
  tools: {
    media: {
      audio: {
        enabled: true,
        models: [
          { provider: "<provider>", model: "<transcription-model-id>" }
          // optionally add a CLI fallback here
        ],
      },
    },
  },
#
}

Docs (these are the canonical references):

Quick sanity check

Run with verbose logs and watch for lines like:

  • discord voice: opus decoder … (decode OK)
  • discord voice: transcription ok … (STT OK)

If you paste your current tools.media.audio block (and whether you prefer “local CLI” vs “API provider”), I can point at the smallest working config for your setup.

small ravine
#

I have no idea what to use. I don't have audio api keys

tawny patrolBOT
#

If you don’t have any audio/STT API keys, you’ve basically got 3 options for Discord voice chat:

Option A (recommended, zero keys): install a local Whisper CLI

OpenClaw’s Discord voice feature needs transcription (STT). Without STT it can join, but it can’t “understand” anything you say.

The easiest no-key setup is to install a CLI that OpenClaw auto-detects, especially:

  • whisper-cli from whisper.cpp / whisper-cpp
    • Once whisper-cli is on PATH, OpenClaw will auto-detect it and use it for tools.media.audio automatically.
    • On macOS via Homebrew, OpenClaw even looks for a default tiny model at:
      /opt/homebrew/share/whisper-cpp/for-tests-ggml-tiny.bin
    • If you want better quality, you can set WHISPER_CPP_MODEL to a larger model file path.

Docs: https://docs.openclaw.ai/nodes/audio (see “Auto-detection”)

Option B (also zero keys, but heavier): Python Whisper

Install a whisper CLI (Python). OpenClaw will also auto-detect a whisper binary if it’s on PATH.

#

This is usually more annoying (Python deps) and can be slower, but it works.

Option C (no “API key”, but still cloud): Gemini CLI

If you’re okay logging into a Google account, OpenClaw can use the Gemini CLI as a transcription backend (it probes for a gemini binary).
This avoids “API keys”, but it’s still a hosted service + has the usual “use a non-critical account” caveats.

FAQ: https://docs.openclaw.ai/help/faq#how-do-i-set-up-gemini-cli-oauth


What I’d do

If your gateway is on macOS: install whisper-cpp (you’ll likely be up fast).
If your gateway is on a Linux VPS: either compile/install whisper.cpp, or use Gemini CLI if you don’t want local CPU burn.

If you tell me what OS your gateway host is (macOS / Linux / Windows+WSL), I’ll point you at the simplest “do this, then restart gateway, then /vc join works” path for that OS.

small ravine
#

I think I already have a whisper cli installed, what about TTS

tawny patrolBOT
#

If Whisper CLI is already working for you, that covers the STT side (turning voice → text).

For Discord voice channels, the bot “speaks back” using the normal OpenClaw TTS system:

  • Base config: messages.tts
  • Discord voice-channel override (optional): channels.discord.voice.tts (this one only affects voice-channel playback)

No API keys? Use Edge TTS (default, keyless)

Edge TTS requires no API key. As long as you didn’t disable it, it should work out of the box.

Minimal “make it explicit” config:

{
  messages: {
    tts: {
#
      provider: "edge",
      edge: {
        enabled: true
        // voice: "..." // optional
      }
    }
  },
  channels: {
    discord: {
      voice: {
        tts: { provider: "edge" } // optional override, but nice to be explicit
      }
    }
  }
}
#

How to test quickly (in Discord)

Docs note: Discord has a built-in /tts, so OpenClaw registers /voice as the native command there — but typing /tts ... also works.

Try one of these:

  • /voice audio Hello from OpenClaw (native)
  • /tts audio Hello from OpenClaw (text command)

If that produces an audio reply, then voice-channel playback should work too.

Docs: https://docs.openclaw.ai/tts and https://docs.openclaw.ai/channels/discord#voice-channels

tawny patrolBOT
#

tools.exec.ask controls when OpenClaw will stop and ask you for approval before running an exec command on a real machine (gateway host or a paired node).

It’s separate from tools.exec.security:

  • security = what is allowed in principle (deny | allowlist | full)
  • ask = when to prompt a human (off | on-miss | always)

Values

  • off: never prompt.
    • If security=allowlist, anything not allowlisted is just blocked (no prompt).
    • If security=full, it’ll just run (no prompt).
  • on-miss (default): prompt only when the allowlist doesn’t match.
    • Allowlisted commands run without bothering you.
    • Non-allowlisted commands trigger an approval flow (“approve once / always / deny”).
  • always: prompt on every command, even if allowlisted.

Where it applies