#Multi-VPS orchestration
1 messages · Page 1 of 1 (latest)
Your split (research / dev / orchestrator) is a sane separation-of-concerns model. The main traps aren’t the “agent roles” themselves — it’s where state lives, how you move untrusted text around, and what becomes a single point of failure.
The obvious failure points in your design
- Orchestrator as SPOF
- If it’s truly “central control”, then making it “never go down” means you’re doing HA, backups, and recovery drills — not just prompt-hardening.
- If the orchestrator can also “modify agent configs/files” and “provision new VPS”, you’ve concentrated the blast radius there.
- “Shared research knowledge base” turns into distributed systems work
- Consistency (who overwrote what?), provenance (what source?), and latency (sync frequency) get painful fast if it’s file-based.
- Recommendation: treat it as a real service (Postgres + object storage; optionally vector index) with append-only writes + source citations, rather than “syncing agent folders”.
- Prompt injection is mostly a data flow problem
- The key is: external content must not become “instructions” for the orchestrator.
- Concretely: orchestrator should only accept structured outputs (schemas), and do non-LLM validation where possible (type checks, allowlisted actions, policy engine).
What OpenClaw already gives you (relevant to this)
If you’re building this on OpenClaw, you can get a lot of the isolation you want without multiple VPS gateways:
1) Multiple isolated agents in one gateway
Each agent can have its own workspace + auth + sessions, and you route inbound to them via bindings:
2) Per-agent tool restrictions + per-agent sandboxing (Docker)
This is the “don’t let the orchestrator get owned” lever: make the orchestrator agent messaging/scheduling-only (no browser, no exec, no write), and sandbox the risky ones.
3) “Workers on other VPS” via Nodes (remote execution)
Instead of “one VPS per agent”, a common pattern is:
- 1 gateway (control plane)
- many node hosts (execution plane) on other VPS machines
So your dev agent can runexecon a remote build node, etc. - Docs: https://docs.openclaw.ai/nodes
This reduces the chance that “a misconfiguration in the gateway stops all agents” only if you still run multiple agents safely in that gateway; it doesn’t give you clustered gateways, but it does decouple compute location from control plane.
Docker-per-agent vs VPS-per-agent
- Docker-per-agent (on one machine) is good for security boundaries and “this agent’s dependencies broke”, but it’s still one host + one gateway process unless you run multiple gateways.
- VPS-per-agent is stronger fault isolation, but you pay in ops overhead and you still need a robust comms layer (queue + durable storage) to avoid brittle “agent-to-agent chat”.
A pragmatic architecture that usually works
Control plane (small, hardened):
- Orchestrator agent: no web browsing, no exec, no write. Only coordinates + validates.
- Durable DB/queue: Postgres + Redis/NATS/SQS etc.
Worker plane (expendable):
- Research worker(s): browse + extract + write to KB service (append-only).
- Dev worker(s): exec/build/test on isolated nodes/containers; produce artifacts + logs; never directly edit orchestrator config.
If you want “auto-provision new VPS”, I’d strongly separate that into a non-LLM provisioner service that only accepts a tiny allowlisted job spec (instance type, image, tags), ideally signed.
Quick questions (to tailor advice)
- Are your agents exposed to untrusted users (public Discord/Telegram), or is this internal-only?