I need planning help for arch. I have dedicated gateway (ubuntu) and local LLM on separate ubuntu host. I have been using vllm and llama.cpp as-needed to run individual models directly, but not automatically. How can the gateway orchestrate loading appropriate models for the sub-agents as needed? My concern is if something like unsloth/DeepSeek is loaded in mem, then that will eat up some resources while not in use, but keep the model hot to reduce bootstrap time. Can gateway orchestrate spinning up/down models as needed for rarely used models, or when thinking is appropriate? What's the general approach here?
#Local runner arch
1 messages · Page 1 of 1 (latest)
Local runner arch
OpenClaw can select between models/providers, but it currently doesn’t orchestrate the lifecycle of your local inference processes (vLLM / llama.cpp) — i.e. it won’t start/stop them, load/unload weights, or manage VRAM. From the Gateway’s POV, a “model” is just an API target.
What it can do today (supported patterns)
- Route work to different agents with different default models (multi-agent routing). Each agent can have its own
agents.list[].model. Docs: https://docs.openclaw.ai/concepts/multi-agent - Spawn sub-agents / offload heavy work and set different defaults for them. (FAQ calls this out as a core pattern.) https://docs.openclaw.ai/help/faq (see “How can I use different models for different tasks?”)
- On-demand switch in a session via
/model(no restart). https://docs.openclaw.ai/concepts/models - Fallbacks if a provider/model is down (
agents.defaults.model.fallbacks). https://docs.openclaw.ai/concepts/models + https://docs.openclaw.ai/concepts/model-failover
How to wire your “LLM host” into the Gateway
- Run an OpenAI-compatible server on the LLM box (vLLM OpenAI server, llama.cpp server, LM Studio, LiteLLM, etc.).
- Point a custom provider at it via
models.providers.<name>.baseUrl, then reference models asprovider/model.
Docs + examples for local proxies: https://docs.openclaw.ai/concepts/model-providers
The “hot vs cold model” problem (what to do in practice)
- If you want models to load/unload based on usage, that needs to be handled by the inference runtime (some have idle-unload/keep-warm knobs) or an external orchestrator (systemd/docker-compose/k8s) plus optionally a small proxy that provides a stable
/v1endpoint and spins containers up/down behind it. - A very common setup is:
- keep one “fast default” model always hot (general chat / routing / light tool use)
- have a “deep/thinking” agent (or manual
/modelswitch) that hits a heavier model that may cold-start - optionally configure a cloud fallback so you still get replies if the heavy local model isn’t running
If you want “Gateway-managed model spin-up/down” as a first-class feature, that’s not something I’ve seen in OpenClaw yet — worth a GitHub feature request with your target behavior (idle timeout, max concurrent models, warm pool size, etc.): https://github.com/openclaw/openclaw/issues
If you tell me what your LLM host exposes today (vLLM OpenAI server? llama.cpp --server? ports/model IDs), I can suggest a clean models.providers layout + an agent split (fast vs deep) that matches your usage.
For a “local runner arch” with a dedicated Gateway + separate Ubuntu LLM box, the key constraint is:
OpenClaw doesn’t manage vLLM/llama.cpp lifecycle (load/unload/VRAM) for you — it only routes requests to whatever /v1 API endpoint you expose. The local-models doc even explicitly recommends “keep the model loaded; cold-load adds startup latency”, which is basically admitting “we’re not orchestrating that part”: https://docs.openclaw.ai/gateway/local-models
A solid architecture looks like this:
1) Make the LLM host look like providers (stable HTTP APIs)
- Run one OpenAI-compatible endpoint per model (e.g. vLLM per model on different ports), or
- Put LiteLLM (or similar) in front as a “model router” and have it route
model=to the right backend.
Then register those endpoints in OpenClaw via models.providers (custom provider baseUrl) and list the model IDs you want it to call:
https://docs.openclaw.ai/concepts/model-providers
(see “Local proxies (LM Studio, vLLM, LiteLLM, etc.)”)
2) Use OpenClaw for routing policy (fast vs deep), not process control
Supported patterns:
- Two agents: “fast chat” agent → small/cheap local model; “deep work” agent → big model. Route via bindings.
Docs: https://docs.openclaw.ai/concepts/multi-agent
- Manual upshift with
/modelwhen you want thinking. - Fallbacks so if the heavy local endpoint is down/cold, it can fall back to another model/provider:
https://docs.openclaw.ai/concepts/models + https://docs.openclaw.ai/concepts/model-failover
3) If you truly need hot/cold models: do it outside OpenClaw
Pick one:
- inference-runtime features (if your stack supports idle unload / keep-warm / max loaded models)
- systemd/docker-compose/K8s orchestration on the LLM host (start/stop workers, one worker per model)
- optional “keep-warm” pings (cron outside OpenClaw, or a tiny healthcheck loop) if you decide certain models should stay hot
If you tell me:
- GPU(s) + VRAM on the LLM host
- which models you want “always hot” vs “rarely used”
- whether you’re okay with a LiteLLM layer
…I can suggest the cleanest split (ports/providers + agent model defaults + fallback ordering) for your setup.
the gpus/vram is unimportant, and conf is not an issue. Ok, i get it, multi-vllm etc. Is there a lightweight runner that is well suited for running multiple models simultaneously (e.g. expose available models via endpoint call) instead of separate ports (or proxy) for multiple llms? open-webui was ok, but overkill. My ideal is to spin-up vllm-like runner with multiple models defined and for the runner to do light orchestration based on conf and usage patterns
Yeah — if your goal is “one endpoint, multiple base models, /v1/models lists them, and it auto-loads/unloads based on use”, there’s an annoying truth:
vLLM itself can’t do multi-base-model serving on a single OpenAI port (it explicitly says “not supported; run multiple instances and put a router in front”). Source (vLLM FAQ): https://docs.vllm.ai/en/v0.6.0/serving/faq.html
So you basically have 3 realistic “lightweight” paths:
Option A: Ollama (simplest “one endpoint, many models, basic hot/cold”)
- One daemon, many models, has a model registry + “keep alive” style behavior (so you can keep hot models hot, let rare ones unload).
- Exposes model listing (and nowadays also OpenAI-ish endpoints in many setups).
- Tradeoff: not vLLM throughput/features; but for “GPU/VRAM unimportant” + convenience, it’s the cleanest.
Option B: LocalAI (single OpenAI-compatible server, multiple models defined in a config)
- Purpose-built for “define multiple models, serve them under one OpenAI-compatible API, list them”.
- Much closer to your “vLLM-like runner with multiple models defined” idea than Open WebUI is.
- Still: the “smart orchestration based on usage patterns” is usually basic (load/unload/limits), not magical.
Option C (most common in practice): LiteLLM as the thin “model router” in front
- I know you said “instead of proxy”, but conceptually the thing you’re asking for is a proxy/router (it’s just a local one).
- LiteLLM is way lighter than Open WebUI, and gives you a single
/v1+ a clean model list, while you keep best-of-breed runners behind it (multiple vLLM instances, llama.cpp servers, etc.).
How this fits OpenClaw: OpenClaw is happy as long as there’s one OpenAI-compatible baseUrl to point at (and you can declare multiple models for that provider in models.providers....models). Docs/examples: https://docs.openclaw.ai/concepts/model-providers
If you tell me which direction you prefer (Ollama vs LocalAI vs “fine, LiteLLM”), I can sketch the cleanest “single provider in OpenClaw → multiple model IDs → fast/deep agents” layout.
ok. perf on ollama is too big of a tradeoff. the other solutions can't orchestrate, so I guess that leaves vllm/llama.cpp as-is, and I'll just need to diy vram as managed resource.
That’s basically where most people land today if they care about vLLM/llama.cpp perf: treat VRAM like a schedulable resource and run 1 server process per base model.
A couple pragmatic patterns that keep the DIY part pretty small:
-
Hard-partition the GPU (best if you can): if you’re on NVIDIA A/H-class, MIG turns “VRAM as managed resource” into “VRAM as fixed slices”, so multiple model servers can coexist without stepping on each other. (If you can’t MIG, still consider pinning with
CUDA_VISIBLE_DEVICES+ strict per-service limits.) -
Cap vLLM’s footprint per model: run each model with a conservative
--gpu-memory-utilization(and/or tighter--max-model-len, etc.) so one model can’t eat the whole card just because its KV cache grows. -
systemd (or docker-compose) per model + a tiny “manager”:
-
deepseek.service,qwen.service, etc. with fixed ports -
a small script that decides “is there enough free VRAM? start/stop X”
-
if you want auto spin-up on first request, you usually end up with a thin router anyway (even a 50-line Python reverse proxy) because something has to accept the initial request while the backend boots.
-
OpenClaw side stays simple: just register each backend as its own provider/baseUrl (different ports), then use multi-agent routing (fast agent vs deep agent) and/or
/modelto pick which one to hit. OpenClaw won’t care that they’re separate processes as long as they look like OpenAI-style/v1.
If your “multiple models” are actually one base + LoRA adapters, that’s the one case where you can keep a single vLLM instance and switch adapters (much nicer than juggling multiple base models). Otherwise, yeah: separate processes + VRAM scheduler is the clean path.
If you want, paste the rough set of models you want to rotate between (just names + sizes/quant), and whether you prefer systemd or containers, and I’ll suggest an orchestration layout that won’t fight you later.
that's ok, slicing defeats the purpose, better to queue the resource and let the agents wait in my case. Is there some sort of webhook arch tied to context (over agent lifecycle) I can use? Spinning down a model on the llm host side based on idle timeout misses the case where the user is still in active chat w/ the gateway, but not requiring the specialized llm agent atm (e.g. working out strategy of a bugfix via conversation), but will soon.
OpenClaw has hooks, but (today) they won’t quite give you the signal you actually want (“user is actively chatting in this session, so keep model X warm even if we’re not calling it yet”).
What exists:
1) Internal hooks (run inside the Gateway)
Docs: https://docs.openclaw.ai/automation/hooks
They can run TypeScript on events like:
command:*(e.g./new,/reset,/stop, and generally “slash commands”)agent:bootstrapgateway:startup
What’s missing for your use-case: the docs explicitly list message:received / message:sent and session:start / session:end as future events. So there isn’t a first-class “session is active” event stream you can use to drive a VRAM lease.
2) Webhooks (HTTP endpoint to trigger OpenClaw)
This is inbound (external system → OpenClaw), not “OpenClaw calls out when a session is active”.
What I’d do for “queue GPU + keep warm because we’re mid-conversation”
You need a lease concept on your LLM-host-side orchestrator.
Model manager API (your side):
POST /lease { sessionKey, model, ttlSeconds }→ “keep this model loaded until lease expires”POST /release { sessionKey, model }- Internally you can implement “1 GPU slot”, queue, and explicit warm/cold decisions.
How to drive the lease (OpenClaw side):
- Use an internal hook to renew/release on the events you do have:
- when the user explicitly indicates intent (e.g.
/model deepseek, “spawn deep agent”, or whatever trigger you choose) →lease(model, ttl=20m) - on
/newor/resetor/stop→release(...) - If you want the “I’m still chatting but haven’t invoked the deep agent yet” behavior, you unfortunately need an explicit “arm/warm” signal (a command/phrase) because OpenClaw currently doesn’t emit per-message lifecycle events to hooks.
If you want it fully implicit, the workaround is a sidecar poller that watches OpenClaw session activity (file mtimes / session store) and renews leases based on “session had traffic in last N minutes”. That’s DIY, but it matches your “tie to context over lifecycle” goal better than idle-timeout on the LLM server alone.
If you tell me what “intent signal” you’d tolerate (a slash command, switching to the specialized agent, or a keyword like “warm deepseek”), I can outline the cleanest hook-driven lease flow.