I am testing Telegram voice-note transcription in OpenClaw and I have narrowed the issue down pretty far.
Setup:
OpenClaw 2026.4.2
Telegram direct chat
tools.media.audio.enabled = true
echoTranscript = true
audio model chain:
provider: "openrouter", model: "mistralai/voxtral-small-24b-2507"
local whisper CLI fallback
What happens:
Telegram voice messages arrive correctly.
The local Whisper fallback works end-to-end.
The transcript echo I see in successful runs has Whisper-style timestamps like:
"[00:00.000 --> 00:03.000] Antworten nur mit dem Wort Banane."
So in practice it looks like OpenClaw is skipping the OpenRouter/Voxtral step and falling through to Whisper.
Important detail:
I tested the same OGG file directly against OpenRouter with Voxtral and it works.
So Voxtral itself is fine.
But inside OpenClaw, I cannot prove that provider: "openrouter" is actually being used for audio transcription.
I also tried adding Voxtral to models.providers.openrouter.models with input: ["text", "audio"], but config validation rejects "audio" there and only accepts "text" / "image".
Question:
Is provider: "openrouter" officially supported for audio transcription inside tools.media.audio.models?
If yes, what is the correct config pattern?
If no, is the intended supported path for Voxtral audio currently only provider: "mistral" rather than provider: "openrouter"?