#Couple of questions, regarding tool capability and models.

1 messages · Page 1 of 1 (latest)

south terrace
#

To be honest, i wasnt aware that claude had lifted their ban on openclaw using the coding plans. this was news to me. ive been really enjoying glm 5.1 on their max 80 bucks a month plan. it doesnt take much to get it to use tools correctly. super solid so far. but im not the biggest fan of paying that much, on top of just giving china all of my data. Whats everyone moving to now that claude is out?

Additionally, Ive been really trying to crack a capability gap that openclaw has. and its understandable. Id really, really like openclaw to be able to actually look at a screen, and be able to move the mouse and keyboard intelligently to accomplish a task. for example, job applications. theyre extremely tedious and instead of relying on dom snapshots (which these sites notoriously constantly change element positions) we could actually look at a screen and send hid events. Am i kind of barking up an impossible tree here or is this realistic?

terse summitBOT
#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> âś… Mark Solution

shell knollBOT
#

You’re not barking up an impossible tree — it’s realistic, just not “free.”

On the model side: this moves too fast to give a trustworthy static answer in-thread. Best place for real-world “what people are actually using this week” is #models, plus the docs notes:

For the capability gap: today OpenClaw’s first-class path is browser automation via snapshot refs (stable-ish, deterministic), not raw desktop HID control.

  • Browser tool: snapshot → ref IDs → click/type/drag/select
    https://docs.openclaw.ai/tools/browser
  • That means it’s great for web flows, but it is not yet “see arbitrary screen + move OS mouse/keyboard like a human” out of the box.

So: feasible? Yes. Current core feature? Not quite.
If you want job-app automation now, best practical setup is hybrid:

  1. DOM/ref actions as primary
  2. visual screenshot checks before critical steps
  3. human approval before submit/final actions
#

That gets most of the value without the brittleness of full pixel+HID control on every site.

south terrace
#

see i tried exactly that flow, but i have these like massive token burning wastes of conversations where its very stubborn and will take the screenshot, then still try to "click" on what it found in the dom tree. The refs are randomized, and update constantly on certain sites. so by the time reasoning is done, that ref disappears

shell knollBOT
#

Yep — that is exactly the painful failure mode. You’re not imagining it.

What helps most (right now) is forcing a tight perceive→act loop so refs can’t go stale:

  1. Use role snapshots, not big AI snapshots
  • snapshot --interactive --compact (or --mode efficient)
  • Smaller payload, faster reasoning, less token burn.
  1. One snapshot = one action
  • Snapshot → click/type once → snapshot again.
  • Never reuse refs after any action/navigation/rerender.
  1. Wait before snapshot
  • wait --url ... / wait --load networkidle / wait "#main"
  • Then snapshot immediately after the page is stable.
  1. Turn down reasoning for this flow