OpenClaw audio pipeline question: does OpenRouter Voxtral actually work as a native STT backend? | Friends of the Crustacean 🦞🤝 | Page 1

I am testing Telegram voice-note transcription in OpenClaw and I have narrowed the issue down pretty far.

Setup:

OpenClaw 2026.4.2
Telegram direct chat
tools.media.audio.enabled = true
echoTranscript = true
audio model chain:
provider: "openrouter", model: "mistralai/voxtral-small-24b-2507"
local whisper CLI fallback

What happens:

Telegram voice messages arrive correctly.
The local Whisper fallback works end-to-end.
The transcript echo I see in successful runs has Whisper-style timestamps like:
"[00:00.000 --> 00:03.000] Antworten nur mit dem Wort Banane."
So in practice it looks like OpenClaw is skipping the OpenRouter/Voxtral step and falling through to Whisper.

Important detail:

I tested the same OGG file directly against OpenRouter with Voxtral and it works.
So Voxtral itself is fine.
But inside OpenClaw, I cannot prove that provider: "openrouter" is actually being used for audio transcription.

I also tried adding Voxtral to models.providers.openrouter.models with input: ["text", "audio"], but config validation rejects "audio" there and only accepts "text" / "image".

Question:
Is provider: "openrouter" officially supported for audio transcription inside tools.media.audio.models?

If yes, what is the correct config pattern?

If no, is the intended supported path for Voxtral audio currently only provider: "mistral" rather than provider: "openrouter"?

#OpenClaw audio pipeline question: does OpenRouter Voxtral actually work as a native STT backend?