#Architecture questions

1 messages · Page 1 of 1 (latest)

vernal fable
#

screenshot tool takes the screenshot. Which tool actually calls vision LLM?

gritty laurelBOT
#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> âś… Mark Solution

tiny ventureBOT
#

In OpenClaw, the thing that actually invokes a vision-capable model is the image tool.

  • browser.screenshot / canvas.snapshot / nodes.camera_snap etc. only capture pixels (PNG/JPEG) and return an image (or a path/buffer). They don’t do any interpretation.
  • Then you either:
    1. Call image with that screenshot + a prompt (explicit “vision step”), or
    2. If your main chat model is multimodal and the runtime wires it that way, the screenshot can be passed back into the model as an input (no separate “vision tool” call). But in terms of “which tool calls vision LLM?” → it’s image.

So the typical explicit flow is:

  1. browser.screenshot → get …png
  2. image({ image: <that png>, prompt: "what’s on this screen?" }) → vision text output
vernal fable
#

How an I setup discord channel from my openclaw?

tiny ventureBOT
#

To hook OpenClaw up to Discord you basically: create a Discord bot → put the bot token in OpenClaw config → start the gateway → do the first DM pairing → (optionally) allow a specific server/channel.

First though: what version are you on?
Run openclaw --version (version mismatches cause a lot of “why isn’t Discord starting?” pain).

1) Create the Discord bot + invite it

Follow the official guide (it’s up to date and has the exact intents/perms):
https://docs.openclaw.ai/channels/discord

Key bits from that page:

  • Enable Message Content Intent (required)
  • Invite with bot + applications.commands
  • Give it perms like View Channels / Send Messages / Read History / etc.

2) Set the bot token in OpenClaw (don’t paste tokens into chat)

On the machine running the gateway:

#
openclaw config set channels.discord.token '"YOUR_BOT_TOKEN"' --json
openclaw config set channels.discord.enabled true --json
openclaw gateway restart   # or: openclaw gateway (if you run it foreground)

Docs: https://docs.openclaw.ai/channels/discord
Config reference: https://docs.openclaw.ai/gateway/configuration-reference

(Env fallback exists too: DISCORD_BOT_TOKEN=..., but config token wins.)

3) Pair via DM (recommended first step)

  • DM your bot in Discord → it replies with a pairing code
  • Approve it:
#
openclaw pairing list discord
openclaw pairing approve discord <CODE>

After this, Discord DMs should work.

4) Let it talk in a server/channel (guild)

By default it’s conservative. To allow a specific server, add it to the allowlist and decide whether it must be @mentioned:

{
  channels: {
    discord: {
      groupPolicy: "allowlist",
      guilds: {
#
        YOUR_SERVER_ID: {
          requireMention: true,   // set false only for private servers
          users: ["YOUR_USER_ID"],
        },
      },
    },
  },
}

If you tell me whether you want DM-only or respond in a specific server channel, and paste:

  • your openclaw --version
  • whether you run the gateway as a service (openclaw gateway status)

…I can point you at the exact minimal config for your case.

vernal fable
#

How does openclaw handle user interactive with agent? Is there any tool call for user prompt?

tiny ventureBOT
#

OpenClaw doesn’t treat “ask the user a question” as a tool call.

The basic interaction model

  • User input arrives from the channel (Discord/Telegram/…) as an inbound message event.
  • The agent replies with normal text (and optionally attachments). That’s just the agent’s output, not a tool.
  • Tools are for doing things (web, files, devices, etc.), not for “prompting” the user.

So if you want interactivity, you usually do it one of these ways:

1) Plain chat (most common)

The agent asks: “Which option: A or B?”
User replies with text. No special tool involved.

2) Discord slash commands

Users run /model, /focus, etc. These are interactive entry points on Discord and route into OpenClaw’s command handling.
Docs: https://docs.openclaw.ai/tools/slash-commands

#

3) Discord UI components (buttons/selects/modals)

If you want structured user input (click buttons, choose from a dropdown, fill a modal), that’s done by sending a Discord message with components. In OpenClaw, that’s supported via the message tool with a components payload; when a user clicks/submits, OpenClaw routes that interaction back to the agent as a normal inbound event (so the agent can continue the flow).

Docs section: https://docs.openclaw.ai/channels/discord (see “Interactive components” / modals)

What’s not it

  • browser.dialog is only for handling JavaScript alert/prompt dialogs in browser automation, not for prompting the human user in chat.

If you tell me which surface you care about (Discord only? also Telegram?), I can show a minimal “button → click → agent continues” pattern that matches it.

vernal fable
#

Say there is a form, which displays the fields one at a time, I want to agent to reaed me the form field and I'll supply the input for the same.

tiny ventureBOT
#

Do it as a wizard: the agent keeps a little bit of state (“which field am I on + answers so far”), asks one question, then treats your next message as the answer for that field.

There’s no special “prompt user” tool — the “prompt” is just the agent sending a message, and your reply is the input event.

Option A (simplest): pure chat, one field per message

Flow:

  1. Agent: “Field 1/5: What’s your full name?”
  2. You: “Jane Doe”
  3. Agent stores it, validates if needed, then: “Field 2/5: What’s your email?”
  4. …

At the end, agent summarizes:

  • name: …
  • email: …
  • …
#

…and asks: “Confirm submit? (yes/no)”.

This works great in Discord because OpenClaw sessions are already per-channel/thread, so the state naturally lives in that conversation history.

Make it robust (recommended)

Have the agent support:

  • back (go to previous field)
  • skip (leave blank)
  • restart (clear answers)
  • validation + re-ask (e.g., email format)

Option B (more “UI”): Discord components / modals

If you want clickable interaction (or to avoid free-form replies), use Discord components:

  • Send a message with a button like “Enter value”
  • Button opens a modal with one text input
  • User submits → OpenClaw routes the interaction back to the agent → agent advances to next field
#

Docs (Discord components/modals support):
https://docs.openclaw.ai/channels/discord (see “Interactive components”)

Notes:

  • Discord modals can contain multiple inputs, but you can intentionally send one-field modals repeatedly to get “one at a time” behavior.
  • This approach is nicer when you need strict formatting, but it’s more moving parts than plain chat.

If you literally want it “read aloud”

That’s a separate layer: you’d run the interaction in a Discord voice context + TTS (OpenClaw supports Discord voice/TTS), but the form logic is still the same wizard pattern.

If you tell me:

  1. are you doing this in DMs or a guild channel/thread, and
  2. do you want pure text chat or button+modal,

…I can sketch the exact minimal state machine / prompt pattern for your agent.

vernal fable
#

Explain more on wizard.

tiny ventureBOT
#

A “wizard” is just a multi-turn form fill implemented as a tiny state machine inside the conversation:

  • at any moment you’re on exactly one step (field)
  • the agent asks the question for that field
  • your next message is treated as the answer
  • agent validates/stores it, then moves to the next field
  • at the end it shows a summary + asks for confirmation

No special OpenClaw “prompt user” tool is required — it’s all normal chat messages + keeping state.

1) What state the wizard tracks

Minimum viable state (mentally or explicitly in text):

  • active: true/false
  • ownerUserId: who is filling it out (important in guild channels)
  • stepIndex: which field we’re on
  • answers: map of { fieldKey: value }
#
  • status: "collecting" | "confirming" | "submitted"

And a fixed “schema” for the wizard:

fields: [
  { key: "fullName", prompt: "Full name?", required: true },
  { key: "email", prompt: "Email address?", required: true, validate: "email" },
  { key: "company", prompt: "Company?", required: false },
]

2) The loop (the whole wizard in 6 lines)

On every inbound message from the user:

  1. If wizard not active → start it, set stepIndex=0, ask first prompt
  2. If message is a control command (back, skip, restart, cancel) → handle that
#
  1. Else treat message as answer for fields[stepIndex]
  2. Validate / normalize (e.g. trim, parse number/date, check email format)
  3. If invalid → re-ask same field with a helpful hint
  4. If valid → store in answers, increment stepIndex, ask next field (or confirm)

3) Control commands (you want these, trust me)

In chat wizards, users will make mistakes. Add these:

  • back → go to previous field (don’t lose answers unless overwritten)
  • skip → set value to null/empty (only if field not required)
  • restart → clear answers + start at step 0
  • cancel → exit wizard
  • status → show “you’re on step X/Y, current answers: …”

4) Confirmation step (prevents bad submissions)

When all fields are filled:

#
  1. Agent prints a compact summary
  2. Asks: “Type confirm to submit or back to edit”
  3. Only on confirm does it run the “submit” action (API call, file write, etc.)

This is where you avoid “oops I typed the wrong email and it already did the thing”.

5) How to make it reliable in Discord channels

Two practical gotchas in guild channels:

A) Multiple people can talk

So the wizard should lock to the initiating user:

  • “I’m running this wizard for <user>; others can still chat but I’ll ignore inputs that look like form answers.”

B) Keep the state explicit and small

LLMs are better at being consistent if you keep a short “state recap” in each prompt, e.g.:

#

Step 2/5 (email) — reply with your email, or type back / cancel.

That reduces drift and makes the flow resilient even if the conversation gets noisy.

6) Example transcript (plain chat wizard)

Agent: Starting signup wizard. Step 1/3: Full name? (or cancel)
User: Jane Doe
Agent: Step 2/3: Email? (back to edit name)
User: jane@
Agent: That doesn’t look like a full email. Please enter something like jane@example.com.
User: jane@example.com
Agent: Step 3/3: Company? (skip to leave blank)
User: skip
Agent: Summary:\n- fullName: Jane Doe\n- email: jane@example.com\n- company: (blank)\nType confirm to submit or back to edit.
User: confirm
Agent: Submitted. Want to start another? (restart)

#

If you tell me what you’re submitting to at the end (API? create a ticket? write a JSON file?), I can show the cleanest “collect → validate → confirm → execute” pattern for that.

vernal fable
#

I'm tslking about a case where form fileds are not known in advance

tiny ventureBOT
#

If the fields aren’t known up front, the wizard still works — you just need a field discovery step each turn. An LLM can’t “guess” fields reliably; it has to get them from some source at runtime.

There are basically 3 ways to do that:

1) Human-in-the-loop discovery (most robust)

You make the wizard generic:

Loop:

  1. Agent: “What’s the next field label / question shown on the form? (paste it)”
  2. User pastes: “Billing ZIP code (required)”
  3. Agent: “OK — Billing ZIP code: what should I enter?”
  4. User answers
  5. Repeat until user says done

State is just an ever-growing list like:

  • { label, value, notes/constraints, required? }
#

This is the simplest architecture when you truly don’t have programmatic access to the form definition.

2) Agent reads the form each step (UI-driven discovery)

If the form is actually visible somewhere (web page / screenshot), the wizard can do:

Loop:

  1. Take a snapshot of the UI (DOM or screenshot)
  2. Extract “current field label + type + required + any error message”
  3. Ask you for that field’s value
  4. Apply it (type/select) and continue

In OpenClaw terms that usually looks like:

  • DOM-based (best if you control the browser): browser.snapshot (or similar) to see inputs/labels
  • Pixel-based: take screenshot (browser.screenshot / canvas.snapshot / etc.) then run image to interpret it (“what field is being asked right now?”)

This gives you “unknown fields” but discoverable from the UI.

#

3) Fetch the schema from the source system (data-driven discovery)

If the form comes from a backend that can describe itself (common in enterprise apps):

  • call an API to get the form definition (JSON Schema, OpenAPI, custom “form config”)
  • then the wizard becomes deterministic: iterate over returned fields

This is the cleanest architecture if you can get it, because you can validate types, required fields, allowed values, etc.


The key design change vs. “known fields”

With unknown fields, each turn is:

(discover next field) → (ask user) → (validate/store) → (advance)

So you need:

  • a stopping condition: user says done, or UI shows “Review/Submit”, or schema has no more required fields
  • idempotent editing: “back”, “change <field>”, “show current answers”
#
  • owner locking in a guild channel (only accept answers from the user who started the wizard)

If you tell me where the form lives (web page you can automate? external system with an API? or purely human-described), I can recommend which of the 3 discovery strategies fits best.