#Architecture questions
1 messages · Page 1 of 1 (latest)
In OpenClaw, the thing that actually invokes a vision-capable model is the image tool.
browser.screenshot/canvas.snapshot/nodes.camera_snapetc. only capture pixels (PNG/JPEG) and return an image (or a path/buffer). They don’t do any interpretation.- Then you either:
- Call
imagewith that screenshot + a prompt (explicit “vision step”), or - If your main chat model is multimodal and the runtime wires it that way, the screenshot can be passed back into the model as an input (no separate “vision tool” call). But in terms of “which tool calls vision LLM?” → it’s
image.
- Call
So the typical explicit flow is:
browser.screenshot→ get…pngimage({ image: <that png>, prompt: "what’s on this screen?" })→ vision text output
How an I setup discord channel from my openclaw?
To hook OpenClaw up to Discord you basically: create a Discord bot → put the bot token in OpenClaw config → start the gateway → do the first DM pairing → (optionally) allow a specific server/channel.
First though: what version are you on?
Run openclaw --version (version mismatches cause a lot of “why isn’t Discord starting?” pain).
1) Create the Discord bot + invite it
Follow the official guide (it’s up to date and has the exact intents/perms):
https://docs.openclaw.ai/channels/discord
Key bits from that page:
- Enable Message Content Intent (required)
- Invite with
bot+applications.commands - Give it perms like View Channels / Send Messages / Read History / etc.
2) Set the bot token in OpenClaw (don’t paste tokens into chat)
On the machine running the gateway:
openclaw config set channels.discord.token '"YOUR_BOT_TOKEN"' --json
openclaw config set channels.discord.enabled true --json
openclaw gateway restart # or: openclaw gateway (if you run it foreground)
Docs: https://docs.openclaw.ai/channels/discord
Config reference: https://docs.openclaw.ai/gateway/configuration-reference
(Env fallback exists too: DISCORD_BOT_TOKEN=..., but config token wins.)
3) Pair via DM (recommended first step)
- DM your bot in Discord → it replies with a pairing code
- Approve it:
openclaw pairing list discord
openclaw pairing approve discord <CODE>
After this, Discord DMs should work.
4) Let it talk in a server/channel (guild)
By default it’s conservative. To allow a specific server, add it to the allowlist and decide whether it must be @mentioned:
{
channels: {
discord: {
groupPolicy: "allowlist",
guilds: {
YOUR_SERVER_ID: {
requireMention: true, // set false only for private servers
users: ["YOUR_USER_ID"],
},
},
},
},
}
If you tell me whether you want DM-only or respond in a specific server channel, and paste:
- your
openclaw --version - whether you run the gateway as a service (
openclaw gateway status)
…I can point you at the exact minimal config for your case.
How does openclaw handle user interactive with agent? Is there any tool call for user prompt?
OpenClaw doesn’t treat “ask the user a question” as a tool call.
The basic interaction model
- User input arrives from the channel (Discord/Telegram/…) as an inbound message event.
- The agent replies with normal text (and optionally attachments). That’s just the agent’s output, not a tool.
- Tools are for doing things (web, files, devices, etc.), not for “prompting” the user.
So if you want interactivity, you usually do it one of these ways:
1) Plain chat (most common)
The agent asks: “Which option: A or B?”
User replies with text. No special tool involved.
2) Discord slash commands
Users run /model, /focus, etc. These are interactive entry points on Discord and route into OpenClaw’s command handling.
Docs: https://docs.openclaw.ai/tools/slash-commands
3) Discord UI components (buttons/selects/modals)
If you want structured user input (click buttons, choose from a dropdown, fill a modal), that’s done by sending a Discord message with components. In OpenClaw, that’s supported via the message tool with a components payload; when a user clicks/submits, OpenClaw routes that interaction back to the agent as a normal inbound event (so the agent can continue the flow).
Docs section: https://docs.openclaw.ai/channels/discord (see “Interactive components” / modals)
What’s not it
browser.dialogis only for handling JavaScript alert/prompt dialogs in browser automation, not for prompting the human user in chat.
If you tell me which surface you care about (Discord only? also Telegram?), I can show a minimal “button → click → agent continues” pattern that matches it.
Say there is a form, which displays the fields one at a time, I want to agent to reaed me the form field and I'll supply the input for the same.
Do it as a wizard: the agent keeps a little bit of state (“which field am I on + answers so far”), asks one question, then treats your next message as the answer for that field.
There’s no special “prompt user” tool — the “prompt” is just the agent sending a message, and your reply is the input event.
Option A (simplest): pure chat, one field per message
Flow:
- Agent: “Field 1/5: What’s your full name?”
- You: “Jane Doe”
- Agent stores it, validates if needed, then: “Field 2/5: What’s your email?”
- …
At the end, agent summarizes:
- name: …
- email: …
- …
…and asks: “Confirm submit? (yes/no)”.
This works great in Discord because OpenClaw sessions are already per-channel/thread, so the state naturally lives in that conversation history.
Make it robust (recommended)
Have the agent support:
back(go to previous field)skip(leave blank)restart(clear answers)- validation + re-ask (e.g., email format)
Option B (more “UI”): Discord components / modals
If you want clickable interaction (or to avoid free-form replies), use Discord components:
- Send a message with a button like “Enter value”
- Button opens a modal with one text input
- User submits → OpenClaw routes the interaction back to the agent → agent advances to next field
Docs (Discord components/modals support):
https://docs.openclaw.ai/channels/discord (see “Interactive components”)
Notes:
- Discord modals can contain multiple inputs, but you can intentionally send one-field modals repeatedly to get “one at a time” behavior.
- This approach is nicer when you need strict formatting, but it’s more moving parts than plain chat.
If you literally want it “read aloud”
That’s a separate layer: you’d run the interaction in a Discord voice context + TTS (OpenClaw supports Discord voice/TTS), but the form logic is still the same wizard pattern.
If you tell me:
- are you doing this in DMs or a guild channel/thread, and
- do you want pure text chat or button+modal,
…I can sketch the exact minimal state machine / prompt pattern for your agent.
Explain more on wizard.
A “wizard” is just a multi-turn form fill implemented as a tiny state machine inside the conversation:
- at any moment you’re on exactly one step (field)
- the agent asks the question for that field
- your next message is treated as the answer
- agent validates/stores it, then moves to the next field
- at the end it shows a summary + asks for confirmation
No special OpenClaw “prompt user” tool is required — it’s all normal chat messages + keeping state.
1) What state the wizard tracks
Minimum viable state (mentally or explicitly in text):
active: true/falseownerUserId: who is filling it out (important in guild channels)stepIndex: which field we’re onanswers: map of{ fieldKey: value }
status:"collecting" | "confirming" | "submitted"
And a fixed “schema” for the wizard:
fields: [
{ key: "fullName", prompt: "Full name?", required: true },
{ key: "email", prompt: "Email address?", required: true, validate: "email" },
{ key: "company", prompt: "Company?", required: false },
]
2) The loop (the whole wizard in 6 lines)
On every inbound message from the user:
- If wizard not active → start it, set
stepIndex=0, ask first prompt - If message is a control command (
back,skip,restart,cancel) → handle that
- Else treat message as answer for
fields[stepIndex] - Validate / normalize (e.g. trim, parse number/date, check email format)
- If invalid → re-ask same field with a helpful hint
- If valid → store in
answers, incrementstepIndex, ask next field (or confirm)
3) Control commands (you want these, trust me)
In chat wizards, users will make mistakes. Add these:
back→ go to previous field (don’t lose answers unless overwritten)skip→ set value to null/empty (only if field not required)restart→ clear answers + start at step 0cancel→ exit wizardstatus→ show “you’re on step X/Y, current answers: …”
4) Confirmation step (prevents bad submissions)
When all fields are filled:
- Agent prints a compact summary
- Asks: “Type
confirmto submit orbackto edit” - Only on
confirmdoes it run the “submit” action (API call, file write, etc.)
This is where you avoid “oops I typed the wrong email and it already did the thing”.
5) How to make it reliable in Discord channels
Two practical gotchas in guild channels:
A) Multiple people can talk
So the wizard should lock to the initiating user:
- “I’m running this wizard for
<user>; others can still chat but I’ll ignore inputs that look like form answers.”
B) Keep the state explicit and small
LLMs are better at being consistent if you keep a short “state recap” in each prompt, e.g.:
Step 2/5 (email) — reply with your email, or type
back/cancel.
That reduces drift and makes the flow resilient even if the conversation gets noisy.
6) Example transcript (plain chat wizard)
Agent: Starting signup wizard. Step 1/3: Full name? (or cancel)
User: Jane Doe
Agent: Step 2/3: Email? (back to edit name)
User: jane@
Agent: That doesn’t look like a full email. Please enter something like jane@example.com.
User: jane@example.com
Agent: Step 3/3: Company? (skip to leave blank)
User: skip
Agent: Summary:\n- fullName: Jane Doe\n- email: jane@example.com\n- company: (blank)\nType confirm to submit or back to edit.
User: confirm
Agent: Submitted. Want to start another? (restart)
If you tell me what you’re submitting to at the end (API? create a ticket? write a JSON file?), I can show the cleanest “collect → validate → confirm → execute” pattern for that.
I'm tslking about a case where form fileds are not known in advance
If the fields aren’t known up front, the wizard still works — you just need a field discovery step each turn. An LLM can’t “guess” fields reliably; it has to get them from some source at runtime.
There are basically 3 ways to do that:
1) Human-in-the-loop discovery (most robust)
You make the wizard generic:
Loop:
- Agent: “What’s the next field label / question shown on the form? (paste it)”
- User pastes: “Billing ZIP code (required)”
- Agent: “OK — Billing ZIP code: what should I enter?”
- User answers
- Repeat until user says
done
State is just an ever-growing list like:
{ label, value, notes/constraints, required? }
This is the simplest architecture when you truly don’t have programmatic access to the form definition.
2) Agent reads the form each step (UI-driven discovery)
If the form is actually visible somewhere (web page / screenshot), the wizard can do:
Loop:
- Take a snapshot of the UI (DOM or screenshot)
- Extract “current field label + type + required + any error message”
- Ask you for that field’s value
- Apply it (type/select) and continue
In OpenClaw terms that usually looks like:
- DOM-based (best if you control the browser):
browser.snapshot(or similar) to see inputs/labels - Pixel-based: take screenshot (
browser.screenshot/canvas.snapshot/ etc.) then runimageto interpret it (“what field is being asked right now?”)
This gives you “unknown fields” but discoverable from the UI.
3) Fetch the schema from the source system (data-driven discovery)
If the form comes from a backend that can describe itself (common in enterprise apps):
- call an API to get the form definition (JSON Schema, OpenAPI, custom “form config”)
- then the wizard becomes deterministic: iterate over returned fields
This is the cleanest architecture if you can get it, because you can validate types, required fields, allowed values, etc.
The key design change vs. “known fields”
With unknown fields, each turn is:
(discover next field) → (ask user) → (validate/store) → (advance)
So you need:
- a stopping condition: user says
done, or UI shows “Review/Submit”, or schema has no more required fields - idempotent editing: “back”, “change <field>”, “show current answers”
- owner locking in a guild channel (only accept answers from the user who started the wizard)
If you tell me where the form lives (web page you can automate? external system with an API? or purely human-described), I can recommend which of the 3 discovery strategies fits best.