'm using OpenClaw with Ollama + Gemma4 (multimodal, confirmed working via direct API). When I send a photo via Telegram, OpenClaw always falls back to Google Search instead of passing the image directly to Gemma4. I have input: ["text","image"] and imageModel.primary: ollama/gemma4 configured. Is there a known fix for Telegram → Ollama vision routing?
#When I send a photo via Telegram, OpenClaw always falls back to Google Search
1 messages · Page 1 of 1 (latest)
Alexander, first thing: can you share openclaw --version? This routing has changed recently enough that version matters.
From current docs/source, the intended behavior is:
- Telegram media is downloaded/staged as inbound media.
- If the active reply model is marked vision-capable, OpenClaw skips the image-summary step and passes the original image into the model context.
- If the active reply model is not vision-capable, OpenClaw uses
agents.defaults.imageModelfor image understanding. - For Ollama, vision is only trusted when the model catalog/provider entry says
input: ["text", "image"], usually discovered from Ollama/api/showreportingvision.
Most likely checks:
openclaw --version
openclaw models list --provider ollama
openclaw infer image describe --file ./photo.jpg --model ollama/gemma4 --json
openclaw logs --follow
If you manually configured models.providers.ollama, auto-discovery is skipped, so make sure the exact model entry has:
{
id: "gemma4",
name: "gemma4",
input: ["text", "image"],
contextWindow: 128000,
maxTokens: 8192,
}
And the image model config should be under:
{
agents: {
defaults: {
imageModel: {
primary: "ollama/gemma4",
},
},
},
}
Docs I checked:
https://docs.openclaw.ai/providers/ollama
https://docs.openclaw.ai/nodes/media-understanding
https://docs.openclaw.ai/concepts/models
If infer image describe works but Telegram still triggers web/Google search, please paste the version plus the relevant log lines around the Telegram message/media-understanding decision. That sounds issue-worthy if the image attachment is being staged but not injected.
OpenClaw 2026.5.12 (f066dd2)
Thanks. 2026.5.12 is the stable tag, so next I’d narrow this to “catalog says Gemma4 is vision-capable?” vs “Telegram media is not being staged/injected.”
Please run:
openclaw models list --provider ollama
openclaw infer image describe --file ./photo.jpg --model ollama/gemma4 --json
openclaw logs --follow
What I’m looking for:
- In
models list,ollama/gemma4needs to show image/vision input. If it only shows text, OpenClaw will not inject the Telegram photo into that model. - If you have a manual
models.providers.ollamablock, auto-discovery is skipped, so the explicitgemma4model entry must includeinput: ["text", "image"]. - If
infer image describeworks but Telegram still searches, paste the media-understanding / Telegram log lines around that inbound photo. That would point to the Telegram attachment path not reaching the agent turn.
Also check that your active reply model is actually the Ollama vision model if you want direct native injection:
{
agents: {
defaults: {
model: { primary: "ollama/gemma4" },
imageModel: { primary: "ollama/gemma4" },
},
},
}
Docs for the exact behavior:
https://docs.openclaw.ai/providers/ollama
https://docs.openclaw.ai/nodes/media-understanding
Model Input Ctx Local Auth Tags
ollama/gemma4 text+image 125k no yes default,image,configured
Good, that rules out the common catalog/config problem: OpenClaw sees ollama/gemma4 as text+image and it’s tagged as both default and image.
Next useful split:
openclaw infer image describe --file ./photo.jpg --model ollama/gemma4 --json
If that fails, it’s Ollama/provider-side despite the catalog saying vision.
If that succeeds, then the remaining suspect is the Telegram inbound path. Please grab the log lines immediately after sending one Telegram photo:
openclaw logs --follow
Look for entries mentioning telegram, MediaPath / MediaPaths, media-understanding, or Skipping image understanding: primary model supports vision natively.
Expected path for your config is: Telegram downloads/stages the image, OpenClaw sees the active model supports vision, skips pre-describing it, then injects the original image into the model turn. If you instead see Google Search/web search, either the image never reached the turn as media, or Gemma is deciding to call search after seeing only text/context.
Found it. Log shows:
Inbound message telegram:... (direct, image/jpeg, 13 chars)
followed immediately by:
Native image: dropped 1 image(s) after sanitization (prompt:images)
Image reaches the gateway but is dropped before the model turn. Is this a known issue with Ollama vision models in 2026.5.12?
I don’t think that log points to an Ollama-vision-specific failure. In current source, Native image: dropped ... after sanitization (prompt:images) is emitted after OpenClaw has already loaded the image into native image blocks, then sanitizeImageBlocks() filters it out before the model call.
That drop usually means one of these happened:
- empty image payload
- invalid/non-canonical base64
- image resize/metadata processing failed, or it could not be reduced under the image limits
For Ollama specifically, OpenClaw only marks the model as vision-capable if /api/show reports capabilities: ["vision", ...]; then the native Ollama transport sends base64 via the images field to /api/chat. So if it got as far as prompt:images, I’d look at the sanitizer/media path first, not model fallback.
Useful checks:
openclaw --version
openclaw models status
ollama show <model-name> --json | jq '.capabilities'
openclaw logs --follow
Then reproduce once and look for a nearby line from agents/tool-images, especially omitted image payload: ..., Image resize failed to fit limits, or invalid base64. If that line says invalid/resize failure, it’s likely worth filing with the minimal log excerpt + image type/size here: https://github.com/openclaw/openclaw/issues
Docs for the model/image fallback bits: https://docs.openclaw.ai/concepts/models
Found the exact log sequence:
Inbound message telegram:... (direct, image/jpeg, 13 chars)
Native image: dropped 1 image(s) after sanitization (prompt:images)
The image reaches the gateway as image/jpeg but is dropped during sanitization before the model call. No lines about omitted image payload, resize failed, or invalid base64 visible. Version 2026.5.12, Ollama gemma4 with vision capability confirmed. Is this a known sanitization issue with Telegram images?"
Yes, there’s a known/very similar report: https://github.com/openclaw/openclaw/issues/79750
That one was Telegram + native image drop with the exact Native image: dropped 1 image(s) after sanitization (prompt:images) line, though it was reported against Gemini on 2026.5.7. So I’d treat your Ollama/Gemma case as likely the same class of regression unless the logs prove otherwise.
One nuance: image/jpeg, 13 chars is probably the inbound text/caption length, not the image byte size. The important part is the later sanitizer drop.
Also, the lack of omitted image payload... lines makes sense from the current code: sanitizeImageBlocks() converts failures into text blocks internally, then sanitizeImageBlocks() filters those text blocks out and only logs the high-level dropped count. So the useful failure reason can be swallowed by that path.
Best next move: add your 2026.5.12 + Ollama/Gemma details to that issue, or open a new bug if maintainers prefer separate providers. Include:
openclaw --versionopenclaw models statusollama show <model> --json | jq '.capabilities'- the exact three log lines around inbound media + sanitizer drop
- whether sending the same image as a Telegram document/file behaves differently than as a photo
If you can briefly run with debug logs, that may also help: OPENCLAW_LOG_LEVEL=debug openclaw logs --follow or set logging.level: "debug". Logging docs: https://docs.openclaw.ai/logging
Thanks, that issue has the right shape. One small thing I’d add there if you can: the actual model id as Ollama reports it (ollama list / ollama show ...) and whether ollama show <model> --json includes "vision" in capabilities.
The maintainer-useful distinction is: OpenClaw config says input: ["text", "image"], but does the Ollama /api/show capability probe also report vision for that exact model id? That narrows whether this is capability detection vs image sanitization/drop after load.
Better?
Better, yes. I’d still fill two gaps so it’s easier to action:
- Logs, screenshots, and evidence: paste the exact log block there, even if it’s only 2-3 lines.
- Additional provider/model setup: add the exact Ollama model id plus
ollama show <model> --json | jq '.capabilities'output.
Right now the issue says Gemma4 (8B Q4_K_M) and that config has input: ["text", "image"], but the most useful proof is the raw Ollama capability result for that exact model. I’d also add the install method if you know it.
Better?
I’m still seeing the old version on GitHub from here: install method empty, provider/model setup empty, and no log block under evidence. Could be GitHub/readability cache, or the edit didn’t save.
If it did save on your side, the target version is basically:
- install method filled in
- exact log block pasted under evidence
- exact Ollama model id
ollama show <model> --json | jq '.capabilities'output
Once those are in, yes, that’s a solid bug report.
I posted it into comments