#Gemini TTS generates .wav files that fail to be delivered by the agent

1 messages · Page 1 of 1 (latest)

small lantern
#

Describe what you're trying to do:
I'm using the native Gemini TTS feature in OpenClaw 2026.4.15. My goal is to have the agent automatically reply with voice messages using the google provider (Gemini 3.1 Flash TTS).

What's happening instead:
When the agent generates a response, it correctly triggers the TTS. A .wav file is successfully generated in the media/outbound/ directory. However, the agent fails to deliver the audio message to the channel (WhatsApp).

Interestingly:

  1. Manual audio delivery works perfectly when the file is in .mp3 format.
  2. The agent seems to ignore or fail to process the .wav file for delivery.

Setup:

  • Hosting: Local Macbook Pro (macOS 15.x)
  • OpenClaw Version: 2026.4.15
  • Node Version: 24.11.1
  • Provider: Google Gemini (native plugin)
  • Channel: WhatsApp

Relevant Config:

{
"messages": {
"tts": {
"auto": "always",
"provider": "google",
"providers": {
"google": {
"model": "gemini-3.1-flash-tts-preview",
"voiceName": "Gacrux"
}
}
}
}
}

Logs/Error Messages:
The files are generated locally:
~/openclaw-local/jedi-labs/media/outbound/tts-hera-XXXX.wav exists on disk.
But the agent log shows no attempt to upload/send the .wav file, unlike when .mp3 files are present.

Is there a known limitation with .wav delivery in the current version or a missing ffmpeg conversion step for the native Gemini TTS plugin?

spare dawnBOT
royal bobcatBOT
#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> âś… Mark Solution

spare dawnBOT
errant flowerBOT
#

Yep, this looks like a real limitation in 2026.4.15, not just a missing local ffmpeg.

What I found:

  • Docs explicitly say Google Gemini TTS returns raw PCM, and OpenClaw wraps it as WAV for attachments. It also says native Opus voice-note format is not supported by this path:
    https://docs.openclaw.ai/tools/tts
  • In source, the Google provider returns:
    • outputFormat: "wav"
    • fileExtension: ".wav"
    • voiceCompatible: false
  • The TTS reply path only marks audio as a native voice note when voiceCompatible === true.
  • WhatsApp’s voice-note/PTT path expects Opus/OGG style audio, not Gemini’s WAV output.

So the short version is:

  1. Gemini native TTS currently produces WAV
  2. WhatsApp auto voice replies want Opus-compatible voice-note audio
#
  1. OpenClaw does not currently transcode Gemini WAV -> Opus in that path

That’s why your .wav gets generated locally, but the WhatsApp voice-note flow you want does not complete.

Practical workaround right now:

  • Best: use OpenAI or ElevenLabs for WhatsApp auto voice replies, since those have the Opus voice-message path documented for WhatsApp/Telegram/Matrix/Feishu.
  • If you want to stay on Gemini, you’d need a conversion step to OGG/Opus before send, which does not appear to be built into this path today.

So, answer to your questions:

  • Known limitation? Yes, looks like it.
  • Missing ffmpeg step? Effectively yes, but more precisely: there is no built-in Gemini WAV -> Opus transcode step in this delivery flow.
  • Why no delivery? Because Gemini TTS is treated as not voice-note compatible for WhatsApp.

Relevant docs:

small lantern
#

Is it possible to change the Gemini native TTS behavior trhough config keys?

errant flowerBOT
#

Not for the part you care about.

With the built-in Google Gemini TTS provider, the configurable knobs are basically:

  • apiKey
  • baseUrl
  • model
  • voiceName

And via TTS globals you can also change things like:

  • auto
  • provider
  • maxLength
  • summarize
  • summaryModel
  • modelOverrides
#

But there is no Google TTS config key for:

  • output format
  • codec
  • container
  • sample rate
  • voiceCompatible
  • WhatsApp voice-note / Opus mode

So in current OpenClaw, you cannot flip Gemini from WAV to Opus through config alone.

That behavior is hardcoded in the Google provider path, which returns WAV and marks it as not voice-note compatible. If you want Gemini on WhatsApp voice notes, it would need a code change, basically:

  1. synthesize Gemini PCM
  2. transcode to ogg/opus
  3. return it as voice-compatible
  4. send through the WhatsApp voice-note path
#

Docs:

If you want, I can help you with either:

  1. a working provider switch for WhatsApp voice replies now, or
  2. a code-level patch plan for adding Gemini WAV -> Opus support.
small lantern
#

Help me with the code-level patch plan for adding Gemini WAV -> Opus support.

errant flowerBOT
#

Yep. I’d patch this in speech-core, not in the WhatsApp sender.

Recommended approach

Goal: when a provider is asked for target: "voice-note" but returns non-voice-compatible audio (.wav, .mp3, etc.), automatically transcode it to .ogg/Opus before it gets written to disk and sent.

That keeps the fix centralized for WhatsApp, Telegram, Discord, Matrix, Feishu, instead of making Google special.

Files to touch

1. Extract/share the existing Opus helper

There’s already a good implementation here:

  • /home/exedev/openclaw-ref/extensions/discord/src/voice-message.ts
    • ensureOggOpus(...)

I’d move the reusable part into something like:

#
  • /home/exedev/openclaw-ref/src/media/voice-note.ts

and export it via:

  • /home/exedev/openclaw-ref/src/plugin-sdk/media-runtime.ts

Reuse these existing helpers:

  • /home/exedev/openclaw-ref/src/media/ffmpeg-exec.ts
  • /home/exedev/openclaw-ref/src/media/ffmpeg-limits.ts
  • /home/exedev/openclaw-ref/src/media/temp-files.ts

2. Apply the transcode in TTS synthesis

Patch:

  • /home/exedev/openclaw-ref/extensions/speech-core/src/tts.ts
#

Right after:

  • const synthesis = await resolvedProvider.provider.synthesize(...)

add logic like:

const needsVoiceNoteTranscode =
  target === "voice-note" && synthesis.voiceCompatible !== true;

const finalSynthesis = needsVoiceNoteTranscode
  ? await maybeTranscodeSpeechToOggOpus(synthesis)
  : synthesis;

Then return finalSynthesis instead of synthesis.

#

What the helper should do

Add a shared helper, something like:

export async function transcodeAudioBufferToOggOpus(params: {
  buffer: Buffer;
  inputExtension: string;
}): Promise<{ buffer: Buffer; fileExtension: ".ogg"; outputFormat: "ogg"; cleanup?: boolean }>

Implementation:

  1. create temp dir
  2. write input buffer to input.wav (or whatever extension came from provider)
  3. call extracted ensureOggOpus(inputPath)
  4. read output .ogg
#
  1. cleanup temp files
  2. return { buffer, fileExtension: ".ogg", outputFormat: "ogg" }

Behavior I’d recommend

If target === "voice-note" and transcode succeeds:

  • fileExtension: ".ogg"
  • outputFormat: "ogg"
  • voiceCompatible: true

If transcode fails:

  • Recommended: throw a clear provider error so fallback providers can take over
  • example: Google TTS voice-note transcoding failed: ffmpeg not found...

That is better than silently returning WAV again.

#

Why not patch WhatsApp?

Because WhatsApp already knows how to send OGG/Opus voice audio:

  • /home/exedev/openclaw-ref/extensions/whatsapp/src/send.ts
  • /home/exedev/openclaw-ref/extensions/whatsapp/src/inbound/send-api.ts

It already maps audio/ogg to:

  • audio/ogg; codecs=opus

So the real missing piece is producing Opus upstream.

Tests to add

Shared helper tests

New file, e.g.:

#
  • /home/exedev/openclaw-ref/src/media/voice-note.test.ts

Port the existing Discord tests:

  • keeps .ogg when already Opus/48kHz
  • re-encodes .ogg when wrong sample rate
  • re-encodes non-ogg input
  • rejects URL/protocol input

Google regression test

Patch:

  • /home/exedev/openclaw-ref/extensions/google/speech-provider.test.ts

Add:

  1. target: "audio-file" still returns:
#
  • .wav
    • voiceCompatible: false
  1. target: "voice-note" returns:
    • .ogg
    • voiceCompatible: true

Speech-core regression test

Patch:

  • /home/exedev/openclaw-ref/extensions/speech-core/src/tts.test.ts

Add a mock provider that returns .wav + voiceCompatible: false, then verify for channel: "whatsapp":

  • synthesized file becomes .ogg
  • audioAsVoice === true
#

Smallest possible patch

If you want the fastest fix, do it Google-only in:

  • /home/exedev/openclaw-ref/extensions/google/speech-provider.ts

When req.target === "voice-note":

  1. wrap PCM as WAV
  2. transcode WAV -> OGG/Opus
  3. return voiceCompatible: true

That’s less code, but I think the speech-core fix is cleaner.

Acceptance check

After patching, this should happen on WhatsApp auto-TTS: