#Multi-VPS orchestration

1 messages · Page 1 of 1 (latest)

wide charmBOT
#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> âś… Mark Solution

tawdry girderBOT
#

Your split (research / dev / orchestrator) is a sane separation-of-concerns model. The main traps aren’t the “agent roles” themselves — it’s where state lives, how you move untrusted text around, and what becomes a single point of failure.

The obvious failure points in your design

  1. Orchestrator as SPOF
  • If it’s truly “central control”, then making it “never go down” means you’re doing HA, backups, and recovery drills — not just prompt-hardening.
  • If the orchestrator can also “modify agent configs/files” and “provision new VPS”, you’ve concentrated the blast radius there.
  1. “Shared research knowledge base” turns into distributed systems work
  • Consistency (who overwrote what?), provenance (what source?), and latency (sync frequency) get painful fast if it’s file-based.
  • Recommendation: treat it as a real service (Postgres + object storage; optionally vector index) with append-only writes + source citations, rather than “syncing agent folders”.
  1. Prompt injection is mostly a data flow problem
  • The key is: external content must not become “instructions” for the orchestrator.
  • Concretely: orchestrator should only accept structured outputs (schemas), and do non-LLM validation where possible (type checks, allowlisted actions, policy engine).

What OpenClaw already gives you (relevant to this)

If you’re building this on OpenClaw, you can get a lot of the isolation you want without multiple VPS gateways:

#

1) Multiple isolated agents in one gateway

Each agent can have its own workspace + auth + sessions, and you route inbound to them via bindings:

2) Per-agent tool restrictions + per-agent sandboxing (Docker)

This is the “don’t let the orchestrator get owned” lever: make the orchestrator agent messaging/scheduling-only (no browser, no exec, no write), and sandbox the risky ones.

3) “Workers on other VPS” via Nodes (remote execution)

Instead of “one VPS per agent”, a common pattern is:

  • 1 gateway (control plane)
  • many node hosts (execution plane) on other VPS machines
    So your dev agent can run exec on a remote build node, etc.
  • Docs: https://docs.openclaw.ai/nodes

This reduces the chance that “a misconfiguration in the gateway stops all agents” only if you still run multiple agents safely in that gateway; it doesn’t give you clustered gateways, but it does decouple compute location from control plane.

#

Docker-per-agent vs VPS-per-agent

  • Docker-per-agent (on one machine) is good for security boundaries and “this agent’s dependencies broke”, but it’s still one host + one gateway process unless you run multiple gateways.
  • VPS-per-agent is stronger fault isolation, but you pay in ops overhead and you still need a robust comms layer (queue + durable storage) to avoid brittle “agent-to-agent chat”.

A pragmatic architecture that usually works

Control plane (small, hardened):

  • Orchestrator agent: no web browsing, no exec, no write. Only coordinates + validates.
  • Durable DB/queue: Postgres + Redis/NATS/SQS etc.

Worker plane (expendable):

  • Research worker(s): browse + extract + write to KB service (append-only).
  • Dev worker(s): exec/build/test on isolated nodes/containers; produce artifacts + logs; never directly edit orchestrator config.

If you want “auto-provision new VPS”, I’d strongly separate that into a non-LLM provisioner service that only accepts a tiny allowlisted job spec (instance type, image, tags), ideally signed.

Quick questions (to tailor advice)

  1. Are your agents exposed to untrusted users (public Discord/Telegram), or is this internal-only?