#How to add vision/image analysis and YouTube support to OpenClaw?

1 messages · Page 1 of 1 (latest)

delicate token
#

Hi, I'm running OpenClaw on WSL2 with local Ollama models.

Two things I want to add:

  1. Image/screenshot analysis - my agent can't currently analyze images
  2. YouTube/video content analysis

I have qwen3-vl:4b-instruct running locally which should support vision.

What's the recommended way to enable these capabilities? Any skills or config changes needed?

deft flameBOT
#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> ✅ Mark Solution

sick thunderBOT
violet lintelBOT
#

Let me check the current docs and source so I don’t make up config keys.

#

Short version, image analysis yes, YouTube/video is a separate thing.

  1. Image / screenshot analysis
    • You do not need a skill for this.
    • Recommended path is to use the native Ollama provider, not Ollama’s /v1 OpenAI-compatible URL.
    • Then either:
      • let OpenClaw auto-detect the vision model, or
      • set it explicitly as the image model:
openclaw models list --provider ollama
openclaw models set-image ollama/qwen3-vl:4b-instruct
openclaw models status

Important gotcha:

  • If you configured models.providers.ollama manually, auto-discovery is skipped.
#
  • In that case your model entry must be marked as image-capable, otherwise OpenClaw will treat it as text-only.
  • Also use baseUrl: "http://...:11434" with api: "ollama", not /v1.
  1. Video / YouTube analysis
    • Uploaded video clips: built-in media understanding can handle these, but current provider support is for Google, Qwen (DashScope), and Moonshot. This is not the same as local Ollama Qwen3-VL.
    • YouTube URLs specifically: there is no turnkey built-in “paste YouTube link and summarize” flow in core OpenClaw right now.

So the practical recommendation is:

  • For screenshots/images: use your local Ollama qwen3-vl:4b-instruct via agents.defaults.imageModel.
  • For video files: use a supported remote video-understanding provider.
  • For YouTube links: add a custom skill/plugin that fetches captions or downloads/transcribes the video, then feeds that text/video into the agent.

Docs:

#

If you want, paste:

openclaw --version
openclaw models status

and I can tell you exactly what to change in your config.

delicate token
#

My primary model is myclaw/minimax-m2.7. Does it support image/vision input? If not, how do I configure a separate image model for vision tasks?

violet lintelBOT
#

Probably not by default, at least not in the way OpenClaw routes vision.

Two cases:

  1. If myclaw/minimax-m2.7 is the bundled MiniMax provider

    • M2.7 is generally treated as the text chat model.
    • MiniMax image understanding is exposed separately via MiniMax-VL-01.
    • So I would not rely on M2.7 as your vision model.
  2. If myclaw/minimax-m2.7 is your own custom provider/model entry

    • OpenClaw only treats it as vision-capable if that model is declared with:
    input: ["text", "image"]
    
    • Otherwise it is treated as text-only.

Best setup for your case

#

Keep your main text model as-is, and set a separate image model:

openclaw models set-image ollama/qwen3-vl:4b-instruct

Or in config:

{
  agents: {
    defaults: {
      model: {
        primary: "myclaw/minimax-m2.7",
      },
      imageModel: {
#
        primary: "ollama/qwen3-vl:4b-instruct",
      },
    },
  },
}

That is exactly what imageModel is for. OpenClaw uses it when the active/default model cannot accept image input.

Important Ollama note

For the Ollama side, use the native Ollama API, not /v1:

{