#Desktop automation options and discussion about the accessibility tree

1 messages · Page 1 of 1 (latest)

gray harbor
#

Does the claw support UI use through the accessibility tree? I want to give hime more freedom aside from terminal use, without using vision. Its a lot cheaper. I wanted to know if i can teach my claw to do that, or if there's some compatibility issues and if he will get confused.

cedar falconBOT
strange plankBOT
#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> âś… Mark Solution

cedar falconBOT
gray harbor
#

To clarify, Claw is running on WSL, with interops enabled.

tacit orchidBOT
#

Yes, for browser/web UI automation, mostly yes.

OpenClaw’s browser tool can work without vision:

  • browser snapshot --format aria returns the accessibility tree
  • action flows use snapshot refs plus browser act
  • under the hood, refs resolve via Playwright aria-ref / getByRole(...)

A couple important limits:

  • Pure ARIA snapshot is inspection-only right now, no action refs.
  • For actual clicking/typing without vision, the better path is role/interactive snapshots (snapshot --interactive / role refs), which are still non-vision and cheaper than screenshot-based control.
  • I do not see a documented first-class tool for arbitrary native desktop app automation via the OS accessibility tree yet. The documented support is clearly strongest for the browser tool.

Will it get confused?

  • Less than vision on well-structured sites, usually.
  • More trouble on badly-labeled UIs, canvas-heavy apps, or weird custom controls.
  • Re-snapshot often, refs are not stable across navigations/page changes.
#

Docs:

So my short take: yes for browser UI, no clear documented “general desktop accessibility tree automation” path yet.

If you want, I can sketch the best non-vision setup for giving your claw more browser freedom.

#

Ah, yeah, that changes it a bit.

With OpenClaw in WSL2, the supported non-vision path is basically:

  • OpenClaw/Gateway in WSL2
  • browser running on Windows
  • control it via remote CDP

So for browser UI, yes, this is still very viable and cheap.

Important gotchas:

  1. Use raw remote CDP, not user / existing-session
    The docs are explicit that host-local Chrome attach is not a WSL2-to-Windows bridge.

  2. snapshot --format aria is inspection-only
    Good for reading the accessibility tree, but not for action refs.

#

For actual non-vision clicking/typing, use role/interactive snapshots. That is still accessibility-style control, not pixel vision.

  1. If you mean native Windows desktop apps through the OS accessibility tree, I do not see a documented first-class OpenClaw path for that yet.
    WSL interop by itself would not make that magically work.

Docs:

So the short answer is:

  • Browser UI from WSL2: yes
  • General Windows accessibility-tree desktop automation from WSL2: not clearly supported/documented yet

If you want, I can give you the exact setup shape I’d use for a WSL2 claw that avoids vision but can still drive Windows Chrome well.

gray harbor
#

There's two github repos for tools that accomplish what i want to do in two different ways. There's CursorTouch/Windows-use, and doucej/uia-x. "Windows-use" transforms the Windows UIA tree into text that an LLM can interpret, and "uia-x" does a similar thing, but it's implemented as an MCP that trasforms the tree into an addressable "API".

tacit orchidBOT
#

Yeah, between those two, uia-x sounds like the better fit for OpenClaw.

My take:

  • windows-use style, UIA tree as text

    • fast to prototype
    • cheaper than vision
    • but still burns context/tokens
    • easier for the model to get lost in giant noisy trees
  • uia-x style, UIA exposed as an MCP/API

    • much cleaner for agent use
    • more deterministic
    • smaller context footprint
    • easier to make actions addressable instead of “read blob, guess click target”

So if your goal is “give claw more freedom without vision,” I’d strongly lean:

#
  1. run the UIA bridge on Windows
  2. expose it as MCP/tool calls
  3. let OpenClaw use that as a structured tool surface

That matches OpenClaw’s design way better than dumping huge UI trees into prompt text.

A really good pattern would be:

  • tool: list windows
  • tool: inspect focused element / subtree
  • tool: get actionable controls
  • tool: click / invoke / set value on a known element id
  • optional: text summary only when the model needs extra context

That hybrid is usually the sweet spot. Keep the actions structured, keep the tree reads targeted.

Big caveat: with your setup, since OpenClaw is in WSL, I would not try to make Linux directly “own” Windows UIA. I’d treat Windows as the UIA host and WSL OpenClaw as the client.