#Integrating OpenClaw with a Locally Hosted vLLM Model Server on a Target Device

64 messages · Page 1 of 1 (latest)

steep shore
#

I am setting up a distributed LLM inference architecture where the model serving and the openclaw client application run on same machine, and I want them to communicate seamlessly.

At a high level, my goal is to use OpenClaw (running on a this server machine) as the frontend/orchestrator, while leveraging a vLLM-based model server running on same machine hosted at port 8000 (remote or target) device as the backend inference engine.

System Components

  1. vLLM Server (Backend)
    Running on a target device (likely Linux-based with QAIC acceleration).
    Hosts a large language model such as:
    meta-llama/Llama-3.3-70B-Instruct

Exposes an OpenAI-compatible REST API:

http://0.0.0.0:8000/v1
Handles:
Prompt processing
Token generation
Tool/function calling (enabled via flags)

OpenClaw (Frontend / Client)
Running on the same machine.
Acts as:
A user-facing interface or agent framework
A tool-calling orchestrator
Sends requests in OpenAI API format:
/v1/chat/completions
Expects:
Fast responses
Streaming or structured outputs
Tool-calling compatibility

Help me with the setup, and get the response from the model. I already tried connecting the model, but got no response from the openclaw output.
In vllm server run, it shows like this,

(APIServer pid=75263) INFO: 127.0.0.1:48594 - "POST /v1/chat/completions HTTP/1.1" 400 Bad Request

supple echoBOT
#

That 400 Bad Request is actually a good clue, OpenClaw is reaching vLLM. This is probably a payload compatibility issue, not a networking issue.

For vLLM, I’d start with the built-in vllm provider and the openai-completions adapter, not openai-responses.

1) Use a conservative config first

{
  agents: {
    defaults: {
      model: { primary: "vllm/meta-llama/Llama-3.3-70B-Instruct" },
      localModelMode: "lean",
    },
  },
  models: {
    mode: "merge",
#
    providers: {
      vllm: {
        baseUrl: "http://127.0.0.1:8000/v1",
        apiKey: "${VLLM_API_KEY}",
        api: "openai-completions",
        models: [
          {
            id: "meta-llama/Llama-3.3-70B-Instruct",
            name: "Llama 3.3 70B via vLLM",
            reasoning: false,
            input: ["text"],
            contextWindow: 128000,
            maxTokens: 8192,
            compat: {
              requiresStringContent: true,
#
              supportsTools: false
            }
          }
        ]
      }
    }
  }
}

Then set:

export VLLM_API_KEY="vllm-local"
#

Even if your vLLM server has no auth, OpenClaw wants a non-empty value for vLLM setup/discovery.

2) Smoke test the backend directly

curl -sS http://127.0.0.1:8000/v1/models

Then:

curl -sS http://127.0.0.1:8000/v1/chat/completions \
  -H 'content-type: application/json' \
  -H 'authorization: Bearer vllm-local' \
  -d '{
    "model":"meta-llama/Llama-3.3-70B-Instruct",
#
    "messages":[{"role":"user","content":"Reply with exactly: smoke-ok"}],
    "stream": false
  }'

3) Smoke test through OpenClaw

openclaw --version
openclaw models status
openclaw infer model run --model vllm/meta-llama/Llama-3.3-70B-Instruct --prompt "Reply with exactly: smoke-ok" --json

Why this usually fixes it

The common 400 causes for local /v1/chat/completions backends are:

#
  1. messages[].content must be a string, not content-part arrays
    → compat.requiresStringContent: true

  2. Tool schema payload is too much / unsupported
    → compat.supportsTools: false first, and localModelMode: "lean"

  3. Wrong adapter
    → for vLLM, use api: "openai-completions"

Important

Once plain text works, you can try re-enabling tools by removing:

compat: { supportsTools: false }
#

If it breaks only after that, the issue is vLLM/model tool-calling compatibility, not OpenClaw transport.

Please paste these next

  1. openclaw --version
  2. Output of:
    openclaw infer model run --model vllm/meta-llama/Llama-3.3-70B-Instruct --prompt "hi" --json
    
  3. The actual JSON error body from the failing vLLM request, not just the access log line

Docs I checked:

#

If you paste your current models.providers.vllm block, I can sanity-check it quickly.

steep shore
#

"models": {
"mode": "merge",
"providers": {
"vllm": {
"baseUrl": "http://127.0.0.1:8000/v1",
"api": "openai-completions",
"apiKey": "VLLM_API_KEY",
"models": [
{
"id": "meta-llama/Llama-4-Scout-17B-16E-Instruct",
"name": "meta-llama/Llama-4-Scout-17B-16E-Instruct",
"reasoning": false,
"input": [
"text"
],
"cost": {
"input": 0,
"output": 0,
"cacheRead": 0,
"cacheWrite": 0
},
"contextWindow": 128000,
"maxTokens": 8192,
"compat": {
"requiresStringContent": true
}
}
]
}
}
}
}

qcom@qcom:~/openclaw_workspace/openclaw$ openclaw infer model run --model vllm/meta-llama/Llama-4-Scout-17B-16E-Instruct --prompt "hi" --json
17:20:50 [agent/embedded] embedded run agent end: runId=3f5e4cb6-1dec-4271-afec-bbcdf0d790c8 isError=true model=meta-llama/Llama-4-Scout-17B-16E-Instruct provider=vllm error=Context overflow: prompt too large for the model. Try /reset (or /new) to start a fresh session, or use a larger-context model. rawError=400 'max_tokens' or 'max_completion_tokens' is too large: 8192. This model's maximum context length is 16384 tokens and your request has 13612 input tokens (8192 > 16384 - 13612). None
17:20:50 [compaction-safeguard] Compaction safeguard: no real conversation messages to summarize; writing compaction boundary to suppress re-trigger loop.
{
"ok": true,
"capability": "model.run",
"transport": "local",
"provider": "vllm",
"model": "meta-llama/Llama-4-Scout-17B-16E-Instruct",
"attempts": [],
"outputs": [
{
"text": "smoke-ok",
"mediaUrl": null
}
]
}
qcom@qcom:~/openclaw_workspace/openclaw$

supple echoBOT
#

Nice, this narrows it down a lot.

You are getting a response from the model. The important clue is this part:

rawError=400 'max_tokens' ... too large: 8192.
This model's maximum context length is 16384 tokens and your request has 13612 input tokens

So the issue is now context sizing, not connectivity.

What’s happening

Your OpenClaw config says:

  • contextWindow: 128000
  • maxTokens: 8192
#

But vLLM says the served model only has 16384 total context.

With 13612 input tokens already in the request, asking for 8192 output tokens is impossible.

Fix your model entry

Use values closer to what vLLM is actually reporting:

{
  "models": {
    "mode": "merge",
    "providers": {
      "vllm": {
        "baseUrl": "http://127.0.0.1:8000/v1",
        "api": "openai-completions",
#
        "apiKey": "${VLLM_API_KEY}",
        "models": [
          {
            "id": "meta-llama/Llama-4-Scout-17B-16E-Instruct",
            "name": "meta-llama/Llama-4-Scout-17B-16E-Instruct",
            "reasoning": false,
            "input": ["text"],
            "cost": {
              "input": 0,
              "output": 0,
              "cacheRead": 0,
              "cacheWrite": 0
            },
            "contextWindow": 16384,
            "maxTokens": 2048,
#
            "compat": {
              "requiresStringContent": true,
              "supportsTools": false
            }
          }
        ]
      }
    }
  },
  "agents": {
    "defaults": {
      "localModelMode": "lean"
    }
  }
}
#

Two extra notes

  1. apiKey: "VLLM_API_KEY" is a literal string
    If your server actually checks auth, use:

    "apiKey": "${VLLM_API_KEY}"
    
  2. Your thread prompt is big
    That long starter message + session context is already eating ~13.6k tokens. On a 16k model, that leaves very little room.

What I’d do next

After updating config:

  1. Start a fresh session or run /reset
  2. Retry:
#
   openclaw infer model run --model vllm/meta-llama/Llama-4-Scout-17B-16E-Instruct --prompt "hi" --json
  1. If that works, test normal chat again

If you need tool calling later

Keep this off first:

"supportsTools": false

Get plain text chat stable first. Then remove it and retest. If it breaks only after that, the remaining issue is tool-schema compatibility in vLLM/model handling.

Docs:

steep shore
#

what is this config?

localModelMode: "lean"

supple echoBOT
#

It means:

agents.defaults.localModelMode: "lean"

OpenClaw has 2 modes here:

  • "default": normal tool surface
  • "lean": smaller prompt/tool profile for local or weaker backends

In lean mode, OpenClaw drops some heavier default tools, specifically:

  • browser
  • cron
  • message
#

That helps when a local OpenAI-compatible backend is failing because the full agent prompt is too big or the tool schema is too much.

In your case, I suggested it because your vLLM model is only exposing 16k context, and your request was already near that limit.

Example

{
  agents: {
    defaults: {
      localModelMode: "lean"
    }
  }
}

When to use it

#

Use "lean" if:

  • small direct /v1/chat/completions calls work
  • but normal OpenClaw agent runs fail
  • especially with local models, vLLM, LM Studio, Ollama, etc.

If lean still isn’t enough, the next step is usually:

compat: {
  supportsTools: false
}

Docs:

#

If you want, I can show you a full minimal working vLLM config for your exact model.

steep shore
#

How to add timeout in openclaw-config.jsonif for model server.

let say the model server is suddenly down after receiving openclaw prompt during its processing.

But we have a mechanism to auto start in the model server, in that case how openclaw waits for the response? how to add timeout attribute so I can wait that period until recovery

supple echoBOT
#

I checked the current config/schema, and there is no per-provider timeout field under models.providers.vllm.

So, not this:

models.providers.vllm.timeout: ...

That setting does not exist.

What you can use instead

In openclaw-config.json5, use:

{
#
  agents: {
    defaults: {
      timeoutSeconds: 600, // hard cap for the whole run
      llm: {
        idleTimeoutSeconds: 300 // wait up to 5 min with no model chunks
      }
    }
  }
}

What these mean

  • agents.defaults.timeoutSeconds
    • total max time for the entire agent run
  • agents.defaults.llm.idleTimeoutSeconds
#
  • how long OpenClaw waits if the LLM request goes quiet and no chunks arrive

Important behavior

If your model server is just slow, increasing llm.idleTimeoutSeconds helps.

If your model server crashes / connection drops / process dies mid-request, a longer timeout does not let OpenClaw “resume” that same request after restart. That HTTP request is already gone.

So in your exact scenario:

  • slow but still connected → increase llm.idleTimeoutSeconds
  • server died and restarted → OpenClaw cannot keep the old POST alive through restart

If you want better recovery

You have 3 practical options:

#
  1. Set a higher idle timeout for slow recovery cases
  2. Add model fallbacks so OpenClaw can switch if vLLM is unavailable
  3. Put vLLM behind a stable proxy/service layer that handles restart/retry externally

Good starting config

{
  agents: {
    defaults: {
      timeoutSeconds: 600,
      llm: {
        idleTimeoutSeconds: 300
      },
      localModelMode: "lean"
    }
steep shore
#

see simple, the model from the server has TTFT more than 2 mins, but openclaw fails to wait for such long time, resulting in a failed response.

what config should I set to make openclaw wait

supple echoBOT
#

Yep, for TTFT > 2 minutes, the setting you want is:

agents.defaults.llm.idleTimeoutSeconds

OpenClaw’s default idle wait is effectively 120s if you don’t set it, so if your first token arrives after that, the run gets aborted.

Set this

{
  agents: {
    defaults: {
      timeoutSeconds: 900,
      llm: {
#
        idleTimeoutSeconds: 300
      }
    }
  }
}

What this does

  • llm.idleTimeoutSeconds: 300
    • wait up to 5 minutes for the model to produce the first chunk/token
  • timeoutSeconds: 900
    • allow the whole run to live up to 15 minutes

If your TTFT is even worse

#

Use something higher, for example:

llm: {
  idleTimeoutSeconds: 600
}

Important

This helps when the model is slow but still alive.

If the model server dies / resets the connection, this setting will not recover the same request. It only tells OpenClaw how long to wait before calling the request idle.

So for your case, start with:

#
{
  agents: {
    defaults: {
      timeoutSeconds: 900,
      llm: {
        idleTimeoutSeconds: 300
      },
      localModelMode: "lean"
    }
  }
}

Docs: https://docs.openclaw.ai/concepts/agent-loop

If this fixes it, please right-click the solution message → Apps → ✅ Mark Solution so Answer Overflow can index it.

steep shore
#

In this same setup,

on the server where gateway is running,

I just need to create a node, this node-host must be the place where all the exec calls should run.

It should also be able to run system commands,

Like edit a file, read, write etc. What is the correct way to do this setup

supple echoBOT
#

If the gateway and the command-execution machine are the same server, the simplest and correct setup is:

Option A, same server, no node needed

Just run exec on the gateway host:

openclaw config set tools.exec.host gateway

If you want tighter control:

openclaw config set tools.exec.security allowlist
openclaw config set tools.exec.ask on-miss
#

This is the right setup when commands should run on the same box as the gateway.


Option B, use a node host anyway

Only do this if you explicitly want exec to go through a node host instead of the gateway host.

1) Start a headless node host on that machine

export OPENCLAW_GATEWAY_TOKEN="<gateway-token>"
openclaw node install --host 127.0.0.1 --port 18789 --display-name "server-node"
openclaw node restart

2) Approve the node

#

On the gateway side:

openclaw devices list
openclaw devices approve <requestId>
openclaw nodes status

3) Point exec to that node

openclaw config set tools.exec.host node
openclaw config set tools.exec.node "server-node"
openclaw config set tools.exec.security allowlist
openclaw config set tools.exec.ask on-miss
#

Important: host=auto does not automatically pick a node. You must set tools.exec.host=node or use /exec host=node.


What a node host can do

A headless node host exposes:

  • system.run
  • system.which

So yes, it can run shell/system commands on that machine.

But this part matters:

read / write / edit are not node commands

#

Those are OpenClaw’s workspace file tools, not node-host RPC commands.

So:

  • if you mean shell-level file ops like cat, sed, tee, python, etc, then yes, use exec host=node
  • if you mean the assistant’s built-in structured file tools (read, write, edit), those operate on the agent workspace, not on the node host as a separate remote filesystem API

So for remote-ish file changes through a node host, the path is:

  • exec host=node for shell commands
  • not nodes invoke system.run
  • not node-native read/write/edit

My recommendation

#
  1. same-machine node-host exec.
steep shore
#

I want to edit my files present in the gateway server. How to do that?

supple echoBOT
#

If the files are on the gateway server, the clean way is:

1) Point the agent workspace at that directory

OpenClaw’s file tools work against the agent workspace.

openclaw config set agents.defaults.workspace "/path/to/your/project"

Example:

openclaw config set agents.defaults.workspace "/home/qcom/my-project"

Then start a new session and ask OpenClaw to edit files there.

#

2) Use the normal file tools, not a node

For files on the gateway host, you usually do not need a node.

The built-in file tools (read, write, edit) work in the workspace on the gateway host.

So if your project is on the gateway server, set:

  • agents.defaults.workspace → your project path

and OpenClaw can edit files in that directory.


3) If the file is outside the workspace

#

Then you have 2 choices:

Safer

Move/change the workspace to that directory.

Broader / more manual

Use gateway-host exec:

openclaw config set tools.exec.host gateway

That lets exec run shell commands on the gateway server, but that is more powerful and less structured than the normal file tools.


4) One important caveat

#

If sandboxing is enabled with read-only or no workspace access, writes won’t land on the host workspace.

So for normal editing, make sure you are not in a read-only sandbox setup.


Recommendation

For your case, I’d do this:

openclaw config set agents.defaults.workspace "/home/qcom/my-project"
openclaw config set tools.exec.host gateway

Then use a new chat/session and ask OpenClaw:

steep shore
#

So, explain when a node exec is really needed?

supple echoBOT
#

Short version, node exec is needed when the commands should run on a different machine than the gateway.

Use host=node when

  • your gateway runs on machine A, but you want shell commands on machine B
  • you want to use a separate build box / GPU box / workstation
  • you need access to that machine’s local OS resources
    • browser on that machine
    • macOS app context / TCC-protected stuff
    • camera / screen / canvas / local binaries on that host
  • you want a separate exec approval policy on another host

You do not need node exec when

  • the files are on the gateway server
  • the commands should run on the same machine as the gateway
  • you just want OpenClaw to read/write/edit files in the gateway workspace
#
  • your model server and gateway are on the same box, and you only want local orchestration

In your setup, based on what you described so far, you probably do not need node exec.

You likely just want:

  • agents.defaults.workspace → your project folder on the gateway server
  • tools.exec.host: "gateway" if you want shell commands there

Good mental model

  • host=gateway → run commands on the gateway machine
  • host=node → run commands on a paired remote node machine

So the rule is:

Same machine as gateway, use gateway exec.

steep shore
#

Explain the model wait time? agent wait time, and node wait time ?

All the timeouts involved in the query or prompt execution of openclaw

supple echoBOT
#

Yep, there are a few different timeout layers, and they do different jobs. The confusing part is that “wait time” is not just one thing.

1) agent.wait

Default: 30s
What it is: how long the caller waits for a run result
Important: this is wait-only, it does not kill the agent run

So if agent.wait times out, the run may still continue in the background.


2) agents.defaults.timeoutSeconds

Default: 172800s (48 hours)
What it is: the hard cap for the whole agent run

If this expires, the run is aborted.

#

Example:

{
  agents: {
    defaults: {
      timeoutSeconds: 900
    }
  }
}

3) agents.defaults.llm.idleTimeoutSeconds

Default: 120s if not otherwise set
What it is: how long OpenClaw waits when the LLM is silent and no chunks/tokens arrive

#

This is the one that matters most for your slow TTFT vLLM case.

If your first token takes 2+ minutes, set this higher:

{
  agents: {
    defaults: {
      timeoutSeconds: 900,
      llm: {
        idleTimeoutSeconds: 300
      }
    }
  }
}
#

Important nuance

  • this is not total generation time
  • this is the idle window with no output from the model

So:

  • slow TTFT → increase llm.idleTimeoutSeconds
  • long whole run → increase timeoutSeconds

4) agents.defaults.compaction.timeoutSeconds

Default: 900s
What it is: timeout for compaction/summarization if OpenClaw has to compact a long session before retrying

This only matters when the session is large enough to trigger compaction.

#

5) exec.timeout

Default: 1800s
What it is: timeout for an exec tool call

This matters only if the agent uses shell commands.

It applies whether exec runs on:

  • host=gateway
  • host=node
  • host=sandbox

6) Node invoke timeout

#

Default: 15000ms
What it is: timeout for direct openclaw nodes invoke ...

This is mostly a node CLI / RPC timeout, not the main LLM wait time.

So if you do:

openclaw nodes invoke --node ... --command canvas.eval ...

then --invoke-timeout matters.


7) Node system.run --command-timeout

What it is: timeout for the actual node-side system command

#

When relevant: only for node system commands / shell execution paths

This is more about node command execution than model waiting.


What matters for a normal prompt to vLLM

For your setup, the important ones are:

  1. agents.defaults.llm.idleTimeoutSeconds
    controls how long OpenClaw waits for first token / silence from vLLM

  2. agents.defaults.timeoutSeconds
    controls total max runtime of the prompt

  3. agent.wait

#

only affects how long the caller waits, not whether the run keeps going


Recommended config for your slow local model

{
  agents: {
    defaults: {
      timeoutSeconds: 900,
      llm: {
        idleTimeoutSeconds: 300
      },
      localModelMode: "lean"
    }
#
  }
}

If TTFT is worse than 5 minutes, raise idleTimeoutSeconds again.


Quick mental model

  • Model wait time = llm.idleTimeoutSeconds
  • Agent total time = timeoutSeconds
  • Caller wait time = agent.wait
  • Node wait time = only matters if node tools / node exec are involved

Docs:

steep shore
#

{"gateway":{"auth":{"mode":"token","token":""},"mode":"local","port":18789,"bind":"loopback","tailscale":{"mode":"off","resetOnExit":false}},"meta":{"lastTouchedVersion":"2026.4.22","lastTouchedAt":"2026-04-24T10:28:17.046Z"},"agents":{"defaults":{"imageGenerationModel":{"primary":"google/gemini-2.5-flash-preview-05-20"},"compaction":{"mode":"safeguard"},"model":{"primary":"vllm/meta-llama/Llama-3.3-70B-Instruct"},"models":{"vllm/meta-llama/Llama-3.3-70B-Instruct":{},"google/gemini-3.1-flash-lite-preview":{},"google/gemini-2.5-flash-preview-05-20":{},"google/gemini-2.5-flash-preview-04-17":{}},"workspace":"C:\Users\mcw\.openclaw\workspace"}},"browser":{"enabled":true,"defaultProfile":"openclaw"},"nodeHost":{"browserProxy":{"enabled":true}},"wizard":{"lastRunAt":"2026-04-24T10:28:16.736Z","lastRunVersion":"2026.4.22","lastRunCommand":"configure","lastRunMode":"local"},"auth":{"profiles":{"google:default":{"provider":"google","mode":"api_key"},"vllm:default":{"provider":"vllm","mode":"api_key"}}},"plugins":{"entries":{"google":{"enabled":true,"config":{"webSearch":{"apiKey":""}}},"browser":{"enabled":true},"vllm":{"enabled":true}}},"models":{"mode":"merge","providers":{"vllm":{"baseUrl":"http://192.168.75.81:8000/v1","api":"openai-completions","apiKey":"VLLM_API_KEY","models":[{"id":"meta-llama/Llama-3.3-70B-Instruct","name":"meta-llama/Llama-3.3-70B-Instruct","reasoning":false,"input":["text"],"cost":{"input":0,"output":0,"cacheRead":0,"cacheWrite":0},"contextWindow":32768,"maxTokens":10,"compat":{"requiresStringContent":true,"supportsTools":false}}]}}},"tools":{"web":{"search":{"provider":"gemini","enabled":true,"openaiCodex":{}},"fetch":{"enabled":true}}},"skills":{"install":{"nodeManager":"npm"}}}

why this model gives gibberish output when using above config?

#

{"gateway":{"auth":{"mode":"token","token":""},"mode":"local","port":18789,"bind":"loopback","tailscale":{"mode":"off","resetOnExit":false}},"meta":{"lastTouchedVersion":"2026.4.22","lastTouchedAt":"2026-04-24T10:28:17.046Z"},"agents":{"defaults":{"imageGenerationModel":{"primary":"google/gemini-2.5-flash-preview-05-20"},"compaction":{"mode":"safeguard"},"model":{"primary":"vllm/meta-llama/Llama-3.3-70B-Instruct"},"models":{"vllm/meta-llama/Llama-3.3-70B-Instruct":{},"google/gemini-3.1-flash-lite-preview":{},"google/gemini-2.5-flash-preview-05-20":{},"google/gemini-2.5-flash-preview-04-17":{}},"workspace":"C:\Users\mcw.openclaw\workspace"}},"browser":{"enabled":true,"defaultProfile":"openclaw"},"nodeHost":{"browserProxy":{"enabled":true}},"wizard":{"lastRunAt":"2026-04-24T10:28:16.736Z","lastRunVersion":"2026.4.22","lastRunCommand":"configure","lastRunMode":"local"},"auth":{"profiles":{"google:default":{"provider":"google","mode":"api_key"},"vllm:default":{"provider":"vllm","mode":"api_key"}}},"plugins":{"entries":{"google":{"enabled":true,"config":{"webSearch":{"apiKey":""}}},"browser":{"enabled":true},"vllm":{"enabled":true}}},"models":{"mode":"merge","providers":{"vllm":{"baseUrl":"http://192.168.75.81:8000/v1","api":"openai-completions","apiKey":"VLLM_API_KEY","models":[{"id":"meta-llama/Llama-3.3-70B-Instruct","name":"meta-llama/Llama-3.3-70B-Instruct","reasoning":false,"input":["text"],"cost":{"input":0,"output":0,"cacheRead":0,"cacheWrite":0},"contextWindow":32768,"maxTokens":10,"compat":{"requiresStringContent":true,"supportsTools":false}}]}}},"tools":{"web":{"search":{"provider":"gemini","enabled":true,"openaiCodex":{}},"fetch":{"enabled":true}}},"skills":{"install":{"nodeManager":"npm"%7D%7D%7D

why this model gives gibberish output when using above config?