#Desktop automation options and discussion about the accessibility tree
1 messages · Page 1 of 1 (latest)
To clarify, Claw is running on WSL, with interops enabled.
Yes, for browser/web UI automation, mostly yes.
OpenClaw’s browser tool can work without vision:
browser snapshot --format ariareturns the accessibility tree- action flows use snapshot refs plus
browser act - under the hood, refs resolve via Playwright
aria-ref/getByRole(...)
A couple important limits:
- Pure ARIA snapshot is inspection-only right now, no action refs.
- For actual clicking/typing without vision, the better path is role/interactive snapshots (
snapshot --interactive/ role refs), which are still non-vision and cheaper than screenshot-based control. - I do not see a documented first-class tool for arbitrary native desktop app automation via the OS accessibility tree yet. The documented support is clearly strongest for the browser tool.
Will it get confused?
- Less than vision on well-structured sites, usually.
- More trouble on badly-labeled UIs, canvas-heavy apps, or weird custom controls.
- Re-snapshot often, refs are not stable across navigations/page changes.
Docs:
So my short take: yes for browser UI, no clear documented “general desktop accessibility tree automation” path yet.
If you want, I can sketch the best non-vision setup for giving your claw more browser freedom.
Ah, yeah, that changes it a bit.
With OpenClaw in WSL2, the supported non-vision path is basically:
- OpenClaw/Gateway in WSL2
- browser running on Windows
- control it via remote CDP
So for browser UI, yes, this is still very viable and cheap.
Important gotchas:
-
Use raw remote CDP, not
user/existing-session
The docs are explicit that host-local Chrome attach is not a WSL2-to-Windows bridge. -
snapshot --format ariais inspection-only
Good for reading the accessibility tree, but not for action refs.
For actual non-vision clicking/typing, use role/interactive snapshots. That is still accessibility-style control, not pixel vision.
- If you mean native Windows desktop apps through the OS accessibility tree, I do not see a documented first-class OpenClaw path for that yet.
WSL interop by itself would not make that magically work.
Docs:
- Browser tool: https://docs.openclaw.ai/tools/browser
- WSL2 + Windows browser control: https://docs.openclaw.ai/tools/browser-wsl2-windows-remote-cdp-troubleshooting
- Windows / WSL2 platform notes: https://docs.openclaw.ai/platforms/windows
So the short answer is:
- Browser UI from WSL2: yes
- General Windows accessibility-tree desktop automation from WSL2: not clearly supported/documented yet
If you want, I can give you the exact setup shape I’d use for a WSL2 claw that avoids vision but can still drive Windows Chrome well.
There's two github repos for tools that accomplish what i want to do in two different ways. There's CursorTouch/Windows-use, and doucej/uia-x. "Windows-use" transforms the Windows UIA tree into text that an LLM can interpret, and "uia-x" does a similar thing, but it's implemented as an MCP that trasforms the tree into an addressable "API".
Yeah, between those two, uia-x sounds like the better fit for OpenClaw.
My take:
-
windows-usestyle, UIA tree as text- fast to prototype
- cheaper than vision
- but still burns context/tokens
- easier for the model to get lost in giant noisy trees
-
uia-xstyle, UIA exposed as an MCP/API- much cleaner for agent use
- more deterministic
- smaller context footprint
- easier to make actions addressable instead of “read blob, guess click target”
So if your goal is “give claw more freedom without vision,” I’d strongly lean:
- run the UIA bridge on Windows
- expose it as MCP/tool calls
- let OpenClaw use that as a structured tool surface
That matches OpenClaw’s design way better than dumping huge UI trees into prompt text.
A really good pattern would be:
- tool: list windows
- tool: inspect focused element / subtree
- tool: get actionable controls
- tool: click / invoke / set value on a known element id
- optional: text summary only when the model needs extra context
That hybrid is usually the sweet spot. Keep the actions structured, keep the tree reads targeted.
Big caveat: with your setup, since OpenClaw is in WSL, I would not try to make Linux directly “own” Windows UIA. I’d treat Windows as the UIA host and WSL OpenClaw as the client.