#Discord Voice STT failing on silent audio chunks — Audio transcription response missing transcript

1 messages · Page 1 of 1 (latest)

amber temple
#

OpenClaw v2026.4.12

Issue:
Discord voice STT is broken. OpenClaw processes every tiny audio chunk (including silence) individually via Deepgram. Silent chunks return an empty transcript, which triggers requireTranscriptionText to throw "Audio transcription response missing transcript", which kills the whole voice session.
Evidence:

Deepgram nova-3 works correctly when tested manually against the WAV files in /tmp/openclaw/discord-voice-*/
Short silent chunks (0.1–0.6s) return empty transcript → error
Longer speech chunks (2MB+) return correct transcription when tested manually
OpenClaw never processes the longer chunks because it fails on the silent ones first

Config:
json"tools": {
"media": {
"audio": {
"enabled": true,
"models": [{"provider": "deepgram", "model": "nova-3"}]
}
}
}
Request:
Can OpenClaw add silence filtering / minimum duration threshold before sending audio chunks to STT? Or a config option to skip empty transcript results rather than throwing an error?
Also: ElevenLabs TTS shows no provider registered despite being in plugins.allow and plugins.entries with enabled: true. Only loads when added via plugins.entries directly. Is this expected?

fickle forgeBOT
#

I checked the v2026.4.12 tag specifically.

Short version:

  • Your Deepgram STT report looks real.
  • There is already a minimum duration / size guard, but not silence filtering.
  • There is no config I could find for “skip empty transcript” or Discord voice min-duration tuning.
  • For ElevenLabs, plugins.allow alone is not enough. Explicit enablement is expected today.

What I found:

  • In extensions/discord/src/voice/manager.ts, Discord voice already drops very short captures with:
    • MIN_SEGMENT_SECONDS = 0.35
    • aborted streams use 0.2s
  • In generic audio transcription, there is also:
    • MIN_AUDIO_FILE_BYTES = 1024
    • enforced in src/media-understanding/runner.entries.ts
#

But that still does not do silence/VAD filtering. A short silent WAV can easily be bigger than 1024 bytes, so it still gets sent to STT.

Then in extensions/deepgram/audio.ts, empty Deepgram output is treated as a hard error:

  • requireTranscriptionText(..., "Audio transcription response missing transcript")

So in v2026.4.12, I do not see a “blank transcript = skip” path for Deepgram. The Discord voice manager has an if (!transcript) return; branch, but Deepgram throws before that branch can be reached.

One nuance, from source this looks like a segment-processing failure, not a literal voice disconnect. The queue catches and logs discord voice: processing failed: ..., so it should continue processing later segments. Still, the empty-transcript behavior is clearly bad and noisy.

For ElevenLabs:

  • plugins.allow is an allowlist, not an enable switch.
  • extensions/elevenlabs/openclaw.plugin.json does not set enabledByDefault: true.
  • I also do not see auto-enable logic for messages.tts.providers.elevenlabs / talk.providers.elevenlabs.

So yes, needing plugins.entries.elevenlabs.enabled: true (or openclaw plugins enable elevenlabs) is expected today. plugins.allow: ["elevenlabs"] by itself only makes it eligible to load.

amber temple
#

Thanks King @fickle forge 🦐

I will look through all of this now what would you suggest as best setup for conversational agent on the go. I had no issues setting Groq + EL up on Slack but chose to move to Discord due to cost implications in the future.

fickle forgeBOT
#

Glad that helped 🦐

If I were setting up an on-the-go conversational agent today, I’d choose between these:

1. Best overall, most reliable: Discord text + ElevenLabs replies

  • Use Discord as the chat surface
  • Keep messages.tts.auto on inbound or tagged, so TTS only fires when useful
  • Do not make Discord live voice your primary path on v2026.4.12 yet
  • Explicitly enable ElevenLabs with plugins.entries.elevenlabs.enabled=true

2. Best voice UX, if you already have it: Talk mode + ElevenLabs

3. Cheapest experimental voice path: Discord voice + Deepgram + ElevenLabs

  • Works in principle, but I’d treat it as beta until the empty-transcript handling gets fixed
  • Right now silent chunks can still poison the STT flow
#

If you want, I can sketch a minimal “best today” Discord + ElevenLabs config next.

fickle forgeBOT
#

Yep, here’s the minimal “best today” setup I’d use for Discord text + ElevenLabs replies, not Discord live voice:

{
  plugins: {
    allow: ["discord", "elevenlabs"],
    entries: {
      elevenlabs: {
        enabled: true,
      },
    },
  },

  channels: {
    discord: {
      enabled: true,
#
      voice: {
        enabled: false, // I’d keep live Discord voice off for now on v2026.4.12
      },
    },
  },

  messages: {
    tts: {
      auto: "inbound", // or "tagged" if you want tighter cost control
      provider: "elevenlabs",
      providers: {
        elevenlabs: {
          voiceId: "YOUR_ELEVENLABS_VOICE_ID",
        },
      },
#
    },
  },
}

A couple notes:

  • Set ELEVENLABS_API_KEY (or XI_API_KEY) in the gateway environment
  • plugins.allow makes ElevenLabs eligible, but plugins.entries.elevenlabs.enabled=true is the bit that actually turns it on
  • auto: "inbound" is a nice middle ground for mobile use
  • If you want the cheapest behavior, use auto: "tagged"

If you want to experiment with Discord voice later, keep your existing Deepgram block, then flip:

amber temple
#

all good thank you I will implement this clean approach now