#Openclaw using image tool for audio

49 messages Ā· Page 1 of 1 (latest)

west prawn
#

I have been trying to set up a local STT using faster-whisper which is open ai compatible and it should be configured right in the json.

But when i send an audio message via either whatsapp or telegram openclaw is trying to use the image tool to decrypt it.

I am at a loss

      "audio": {
  "enabled": true,
  "echoTranscript": true,
  "echoFormat": "šŸ“ {transcript}",
  "attachments": {
    "mode": "all",
    "maxAttachments": 1
  },
  "maxBytes": 20971520,
  "models": [
    {
      "provider": "openai",
      "model": "base",
      "providerOptions": {
        "baseUrl": {
          "default": "http://127.0.0.1:10300/v1"
        },
        "apiKey": {
          "default": "sk-not-needed"
        }
      }
    }
  ]
}
robust lionBOT
polar copperBOT
#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> āœ… Mark Solution

robust lionBOT
merry lilyBOT
#

Let me check the actual docs and source path, this smells like either a config shape mismatch or a code path bug.

#

I think I found the mismatch.

What you pasted is very close, but OpenClaw is probably not reading those fields for audio STT.

Two big gotchas:

  • it needs to live under tools.media.audio, not just media.audio
  • for OpenAI-compatible audio, OpenClaw reads baseUrl from tools.media.audio.baseUrl or tools.media.audio.models[].baseUrl, and auth comes from normal provider auth resolution, not providerOptions.apiKey

So providerOptions.baseUrl / providerOptions.apiKey in that block are very likely being ignored for this path.

That would explain the behavior:

  1. audio transcription never initializes correctly
  2. the audio stays attached
  3. the reply model then misfires and tries the image tool on it

Try this shape instead:

#
{
  tools: {
    media: {
      audio: {
        enabled: true,
        echoTranscript: true,
        echoFormat: "šŸ“ {transcript}",
        attachments: { mode: "all", maxAttachments: 1 },
        maxBytes: 20971520,
        models: [
          {
            provider: "openai",
            model: "base",
            baseUrl: "http://127.0.0.1:10300/v1"
          }
#
        ]
      }
    }
  },

  models: {
    providers: {
      openai: {
        apiKey: "sk-not-needed"
      }
    }
  }
}

If you already use OpenAI elsewhere, say that before adding the global openai.apiKey bit, because that part should be handled a little differently.

#

If you paste your full tools.media.audio block plus version, I can sanity-check the exact config.

west prawn
#

āÆ openclaw infer audio transcribe --file /home/neocrypter/.openclaw/workspace/voice.ogg --mod
el openai/base --json
Error: No transcript returned for audio: /home/neocrypter/.openclaw/workspace/voice.ogg

merry lilyBOT
#

Nice, that narrows it down a lot. Your audio path is being hit now, but the STT endpoint is not returning the response shape OpenClaw expects.

#

That’s actually useful, because it means OpenClaw is hitting the audio transcription path now.

No transcript returned for audio means:

  • the request likely reached your STT server
  • the server responded without a non-empty top-level text field

OpenClaw expects the OpenAI transcription shape from POST /v1/audio/transcriptions, basically:

{ "text": "hello world" }

So this is probably not the image tool bug anymore. It’s now an endpoint compatibility / codec issue.

Next check

Hit your faster-whisper endpoint directly with the exact same file:

#
curl -sS \
  -H "Authorization: Bearer sk-not-needed" \
  -F "file=@/home/neocrypter/.openclaw/workspace/voice.ogg" \
  -F "model=base" \
  http://127.0.0.1:10300/v1/audio/transcriptions

If that response does not contain a non-empty text, OpenClaw is doing the right thing and your server is the mismatch.

Very common gotcha

WhatsApp/Telegram voice notes are usually OGG/Opus.

A lot of ā€œOpenAI-compatibleā€ Whisper wrappers are only partially compatible and choke on OGG/Opus, or return segments without text.

So also try a WAV as a control:

#
ffmpeg -i /home/neocrypter/.openclaw/workspace/voice.ogg /tmp/voice.wav
curl -sS \
  -H "Authorization: Bearer sk-not-needed" \
  -F "file=@/tmp/voice.wav" \
  -F "model=base" \
  http://127.0.0.1:10300/v1/audio/transcriptions

If WAV works and OGG doesn’t, the problem is your faster-whisper server’s codec handling, not OpenClaw.

One more thing

Keep your config with baseUrl on the media model entry, not inside providerOptions. That part is still wrong in the original snippet.

If you paste:

  • openclaw --version
  • the raw JSON from that curl
#

I can tell you pretty quickly whether this is:

  1. wrong response shape,
  2. OGG/Opus decode failure,
  3. or a version-specific OpenClaw bug.
west prawn
#
  libavcodec     62. 28.100 / 62. 28.100
  libavformat    62. 12.100 / 62. 12.100
  libavdevice    62.  3.100 / 62.  3.100
  libavfilter    11. 14.100 / 11. 14.100
  libswscale      9.  5.100 /  9.  5.100
  libswresample   6.  3.100 /  6.  3.100
[in#0 @ 0x56442125ae80] Format ogg detected only with low score of 1, misdetection possible!
[in#0 @ 0x56442125ab80] Error opening input: End of file
Error opening input file /home/neocrypter/.openclaw/workspace/voice2.ogg.
Error opening input files: End of file
curl: (26) Failed to open/read local data from file/application```
merry lilyBOT
#

Ah, that’s the real clue.

ffmpeg is telling you the input file itself is bad:

  • Error opening input: End of file
  • curl: (26) is just fallout because the WAV never got created

So right now the root problem is not STT config and not really the image tool either. The audio file you’re feeding into STT looks truncated / corrupt / not actually valid OGG audio.

That also explains the weird image-tool behavior: STT fails, the original attachment stays in play, and the model does something dumb with it.

Check the file itself

Run:

ls -lh /home/neocrypter/.openclaw/workspace/voice*.ogg
#
file /home/neocrypter/.openclaw/workspace/voice*.ogg
xxd -l 32 /home/neocrypter/.openclaw/workspace/voice*.ogg

What I’d expect from a real voice note is something like Ogg/Opus.
If file just says data, or the size is tiny, or the header looks wrong, then OpenClaw is being handed garbage/incomplete bytes.

Important isolation test

Try your STT setup with a known-good local WAV/MP3:

openclaw infer audio transcribe --file /path/to/known-good.wav --model openai/base --json
  • if that works, your faster-whisper endpoint is probably fine, and the inbound Telegram/WhatsApp file is the broken part
  • if that still fails, the endpoint is still not fully compatible
#

Please paste these too

openclaw --version
openclaw logs --follow

Then send one voice note and grab the lines around media download/transcription.

If you paste the output of file, file size, and a short log snippet, I can help narrow down whether this is:

  1. bad channel download,
  2. corrupt temp file,
  3. or an actual OpenClaw bug.
west prawn
#

āÆ openclaw infer audio transcribe --file /home/neocrypter/.openclaw/workspace/voice4.wav --model openai/base --json
Error: No transcript returned for audio: /home/neocrypter/.openclaw/workspace/voice4.wav

merry lilyBOT
#

If voice4.wav is a fresh, known-good speech file, then we’re down to one very specific issue:

your STT server is returning JSON, but not a non-empty top-level text field.

That is exactly what OpenClaw checks for.

Two important notes:

  1. if voice4.wav was made from the earlier broken .ogg, that test is still contaminated
  2. your original config snippet is still wrong for OpenClaw, even aside from this, because baseUrl/apiKey do not belong in providerOptions for this path

Best next step

Please hit the STT server directly and paste the raw response body:

curl -sS -i \
#
  -H "Authorization: Bearer sk-not-needed" \
  -F "file=@/path/to/a-known-good-speech.wav" \
  -F "model=base" \
  http://127.0.0.1:10300/v1/audio/transcriptions

I really want the raw response here, because one of these is probably happening:

  • it returns segments / result / transcript instead of text
  • it returns "text": ""
  • it’s only ā€œsort ofā€ OpenAI-compatible

For OpenClaw, this needs to look like:

#
{ "text": "hello world" }

Also tell me

  • openclaw --version
  • which server you’re using exactly (speaches, faster-whisper-server, whisper-asr-webservice, something else?)

If the raw response has text, then I’m worried this is an OpenClaw version bug.
If it doesn’t, then the server compatibility is the blocker.

west prawn
#

opps

merry lilyBOT
#

Yep, that curl error is just because /path/to/a-known-good-speech.wav was a placeholder, not a real file path.

curl: (26) means it never even reached your STT server.

Try with an actual file that exists, for example:

TEST=/home/neocrypter/Downloads/test.wav
ls -lh "$TEST"
file "$TEST"

curl -sS -i \
  -H "Authorization: Bearer sk-not-needed" \
  -F "file=@$TEST" \
  -F "model=base" \
  http://127.0.0.1:10300/v1/audio/transcriptions
#

A few important bits:

  • use a real local file
  • make sure it’s a file that plays correctly
  • don’t use the earlier broken .ogg
  • WAV/MP3/M4A are all fine for this check

If you don’t have a good test file handy, you can even use the WAV you already tried, but only if it’s known-good and not derived from the corrupt one.

Paste the raw response body once you have it. That’s the piece we need now.

#

No worries šŸ˜„

When you have it, just paste the output from the same curl command with a real file path. The raw response body is the missing piece.

If you want, use this exact pattern:

TEST="/full/path/to/real-file.wav"

ls -lh "$TEST" && file "$TEST"

curl -sS -i \
  -H "Authorization: Bearer sk-not-needed" \
  -F "file=@$TEST" \
  -F "model=base" \
  http://127.0.0.1:10300/v1/audio/transcriptions
#

Once you paste that output, I can tell you pretty quickly whether it’s:

  • server response shape,
  • bad audio input,
  • or OpenClaw config/version weirdness.
west prawn
#

this a know good wav file I exported it fomr audacity as a wav and it was recorded in audacity

curl -sS -i
-H "Authorization: Bearer sk-not-needed"
-F "file=@/home/neocrypter/.openclaw/workspace/voice4"
-F "model=base"
http://127.0.0.1:10300/v1/audio/transcriptions
curl: (26) Failed to open/read local data from file/application

merry lilyBOT
#

I see it, the curl path is missing the .wav extension.

You tested:

file=@/home/neocrypter/.openclaw/workspace/voice4

but earlier your file was:

/home/neocrypter/.openclaw/workspace/voice4.wav

So curl still never reached the server. curl: (26) is just ā€œcan’t open that local fileā€.

Try this exact command:

#
curl -sS -i \
  -H "Authorization: Bearer sk-not-needed" \
  -F "file=@/home/neocrypter/.openclaw/workspace/voice4.wav" \
  -F "model=base" \
  http://127.0.0.1:10300/v1/audio/transcriptions

If you want to sanity-check first:

ls -lh /home/neocrypter/.openclaw/workspace/voice4.wav
file /home/neocrypter/.openclaw/workspace/voice4.wav

Paste the raw response once that runs. We’re very close now.

west prawn
#

ls -lh /home/neocrypter/.openclaw/workspace/voice4.wav
file /home/neocrypter/.openclaw/workspace/voice4.wav
Permissions Size User Date Modified Name
.rw-r--r-- 858k neocrypter 13 Apr 20:09  /home/neocrypter/.openclaw/workspace/voice4.wav
/home/neocrypter/.openclaw/workspace/voice4.wav: RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, stereo 44100 Hz

merry lilyBOT
#

Perfect, that’s a real valid WAV. Good, we can stop blaming the file now.

So if openclaw infer audio transcribe still says no transcript on that file, the next thing we need is the raw HTTP response from your STT server.

Run this exact command now:

curl -i \
  -H "Authorization: Bearer sk-not-needed" \
  -F "file=@/home/neocrypter/.openclaw/workspace/voice4.wav" \
  -F "model=base" \
  http://127.0.0.1:10300/v1/audio/transcriptions

Then paste the output.

At this point the likely causes are basically down to:

#
  1. the server returns JSON without a top-level text
  2. the server returns empty "text": ""
  3. the server isn’t actually OpenAI-compatible on /audio/transcriptions

Also, separate from that, your OpenClaw config still needs baseUrl moved out of providerOptions, but let’s get the raw endpoint response first.

west prawn
#

hmm what is a known good local openai compatible model i can use to test

merry lilyBOT
#

Honestly, I would not change models yet. The model id is probably not the blocker here.

For this test, if your server is really OpenAI-compatible, then:

  • model=base should be fine if that server exposes it
  • the important thing is that POST /v1/audio/transcriptions returns a non-empty top-level text

So the best test is still:

  1. use the model name your server docs say to use
  2. hit the endpoint directly with curl
  3. inspect the raw response

If you want a known-good local baseline inside OpenClaw, I’d actually use the supported CLI fallback first, not another OpenAI-compatible wrapper:

#
{
  tools: {
    media: {
      audio: {
        enabled: true,
        models: [
          { type: "cli", command: "whisper", args: ["--model", "base", "{{MediaPath}}"] }
        ]
      }
    }
  }
}

or whisper-cli if that’s what you have installed.

#

That gives you a clean sanity check:

  • if CLI transcription works, your audio pipeline is fine and the problem is your HTTP wrapper
  • if CLI also fails, then it’s a broader audio/media issue

If you specifically want community recommendations for local OpenAI-compatible STT servers/models, ask in #models. That stuff changes fast and I don’t want to make up a stale recommendation here.

west prawn
#

I dont have an openai account to use the cloud version from them, is that not what the whisper-cli uses?

merry lilyBOT
#

Nope, whisper-cli is local, not OpenAI cloud.

Two different things here:

  • provider: "openai" in OpenClaw means ā€œuse an OpenAI-style HTTP APIā€, which can be OpenAI or a local OpenAI-compatible server
  • whisper-cli is a local CLI binary (usually from whisper.cpp), so no OpenAI account is needed
  • whisper (Python CLI) is also local execution, no OpenAI account needed, but it downloads models locally

So if your goal is just ā€œI want local STT and I do not have OpenAIā€, then the simplest sanity check is to use CLI transcription first.

Example:

{
  tools: {
    media: {
#
      audio: {
        enabled: true,
        models: [
          {
            type: "cli",
            command: "whisper-cli",
            args: ["-m", "/path/to/model.bin", "-f", "{{MediaPath}}"]
          }
        ]
      }
    }
  }
}

or with Python Whisper:

#
{
  tools: {
    media: {
      audio: {
        enabled: true,
        models: [
          {
            type: "cli",
            command: "whisper",
            args: ["--model", "base", "{{MediaPath}}"]
          }
        ]
      }
    }
  }
#
}

If that works, then your audio pipeline is fine and the problem is specifically your OpenAI-compatible faster-whisper server setup.

Docs: https://docs.openclaw.ai/nodes/audio

If you want, I can help you write the exact local-only tools.media.audio block next.

west prawn
#

yes please

merry lilyBOT
#

Yep, I’d do a clean local-only test first.

Recommended, use Python whisper CLI

This is the simplest known-good config from the docs:

{
  tools: {
    media: {
      audio: {
        enabled: true,
        echoTranscript: true,
        echoFormat: "šŸ“ {transcript}",
        attachments: {
          mode: "all",
          maxAttachments: 1,
#
        },
        maxBytes: 20971520,
        timeoutSeconds: 120,
        models: [
          {
            type: "cli",
            command: "whisper",
            args: ["--model", "base", "{{MediaPath}}"],
          },
        ],
      },
    },
  },
}
#

If you already have whisper.cpp / whisper-cli

OpenClaw can auto-detect that too, but for a first sanity check I’d still start with the explicit whisper config above because it’s the least ambiguous.

Important fixes vs your original snippet

  • put it under tools.media.audio
  • remove the provider: "openai" block completely for this test
  • don’t use providerOptions.baseUrl/apiKey here

What to do next

  1. drop in the config above
  2. restart the gateway
  3. send a voice note
  4. see whether you get the transcript echo

If that works, your audio pipeline is fine and the problem was your OpenAI-compatible faster-whisper setup. If it still fails, paste: