#Local model responds in ms whem llamma.ccp CLI and hangs for 4 miutes on openclaw tui

1 messages · Page 1 of 1 (latest)

fair arrow
#

Here's a concise prompt for Discord:


OpenClaw TUI + llama-server streaming issue - 4min response delay

Setup:

  • OpenClaw 2026.2.19-2 on Ubuntu 24.04
  • Local llama-server (llama.cpp) at http://127.0.0.1:11434/v1
  • Model: Qwen3-Coder-Next-Q4_K_M (70B, CPU inference)
  • Provider: vLLM in OpenClaw config

Problem:
OpenClaw TUI shows "running" for 4+ minutes per response. Model actually generates responses in ~20-30 seconds (confirmed via direct curl). Responses only appear in TUI after ~4 minutes OR when I exit/restart TUI (then all past responses appear instantly).

What I've tried:

  • Set agents.defaults.models["vllm/qwen3-coder"].streaming = false - no effect
  • Verified llama-server responds quickly via curl
  • Gateway logs show "embedded run agent start" but no API call errors
  • Model responds correctly, just streaming/delivery is broken

Config excerpt:

"models": {
  "providers": {
    "vllm": {
      "baseUrl": "http://127.0.0.1:11434/v1",
      "apiKey": "dummy",
      "api": "openai-completions"
    }
  }
}

Question: How do I fix the streaming delay or force non-streaming responses in OpenClaw TUI with a local llama-server backend?

honest foxBOT
#

That 4 minutes lines up suspiciously well with “OpenClaw is waiting for an OpenAI-style streaming response to finish, but your backend never properly terminates the stream (or OpenClaw never sees the [DONE])”, so the run only “completes” when a timeout/reconnect happens — and then the TUI shows everything when it reloads history.

Two practical fixes/workarounds:

1) Disable upstream OpenAI streaming (the config key you tried is in the wrong place)

For OpenAI-compatible providers, the docs call out disabling streaming via model params, not a top-level streaming key.

Example (adjust the model id key to match yours):

{
  agents: {
    defaults: {
      models: {
        "vllm/qwen3-coder": {
          params: {
#
            streaming: false
          }
        }
      }
    }
  }
}

This is explicitly mentioned in the Ollama doc for OpenAI-compatible mode (“disable streaming with params: { streaming: false }”).
Docs: https://docs.openclaw.ai/providers/ollama (see “Legacy OpenAI-Compatible Mode”)

2) If you’re actually on port 11434 (Ollama default), prefer the native Ollama provider

If that endpoint is Ollama (11434 strongly suggests it), don’t go through api: "openai-completions" unless you have to. Use api: "ollama" + the native /api/chat integration, which supports streaming/tooling more reliably.
Docs: https://docs.openclaw.ai/providers/ollama

#

Quick diagnostic (to confirm it’s a “stream never ends” problem)

Try a streaming curl and see if it ever cleanly finishes with [DONE]:

curl -N http://127.0.0.1:11434/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"YOUR_MODEL","messages":[{"role":"user","content":"hi"}],"stream":true}'

If it prints tokens but doesn’t properly terminate, OpenClaw will sit in “running” until timeout/reconnect.


If params.streaming=false doesn’t change behavior, paste:

  • openclaw models status (just the provider/model section is fine)
#
  • the relevant chunk of openclaw logs --follow covering one stuck run

…because then it may be a TUI event/update bug (similar “only appears after restart” symptoms exist in a couple TUI issues), and we can tell whether the gateway run truly took ~240s vs the TUI just not updating.

fair arrow
#

you need to give me full commands. I don't know the paramteres you're talking about

honest foxBOT
#

Use these exact commands (copy/paste). The only thing you need to substitute is the exact model ref (the provider/model-id string) that OpenClaw is using.

0) Confirm your version (sanity)

openclaw --version

1) Find the exact model ref OpenClaw is using

Run:

openclaw models status

Look for something like:

  • Primary: vllm/<something> (that full vllm/... is what you’ll paste below)

If it’s not obvious, also run:

#
openclaw config get agents.defaults.model
openclaw config get agents.defaults.model.primary

2) Disable OpenAI-style streaming the right way (it’s params.streaming, not streaming)

Replace vllm/YOUR_MODEL_ID_HERE with your real one from step 1:

openclaw config set 'agents.defaults.models["vllm/YOUR_MODEL_ID_HERE"].params.streaming' false --strict-json

Example (if your model ref is vllm/qwen3-coder-next-q4_k_m):

openclaw config set 'agents.defaults.models["vllm/qwen3-coder-next-q4_k_m"].params.streaming' false --strict-json
#

Verify it stuck:

openclaw config get 'agents.defaults.models["vllm/YOUR_MODEL_ID_HERE"].params'

3) Restart the gateway (required after config edits)

openclaw gateway restart

4) Test in the TUI again

openclaw tui

#

If it still “hangs”: confirm whether your llama.cpp server’s streaming ever terminates

Run this (replace model id):

curl -N http://127.0.0.1:11434/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"YOUR_MODEL_ID_HERE","messages":[{"role":"user","content":"hi"}],"stream":true}'

If it prints output but never ends cleanly (no proper end / [DONE]-style termination), that’s exactly what causes the “TUI shows running for minutes, then everything appears later” behavior.

Send the output of:

  • openclaw models status
  • and whether the curl -N ... stream:true command terminates or hangs
    and I’ll tell you the next most likely fix.
fair arrow
#

🦞 OpenClaw 2026.2.19-2 (45d9b20) — Because the right answer is usually a script.

{
"primary": "vllm/qwen3-coder",
"fallbacks": []
}

🦞 OpenClaw 2026.2.19-2 (45d9b20)
I speak fluent bash, mild sarcasm, and aggressive tab-completion energy.

vllm/qwen3-coder
root@Ubuntu-2404-noble-amd64-base ~

#

root@Ubuntu-2404-noble-amd64-base ~ # openclaw config set 'agents.defaults.models["vllm/qwen3-coder-next-q4_k_m"].params.streaming' false --strict-json
error: unknown option '--strict-json'
root@Ubuntu-2404-noble-amd64-base ~ # openclaw config get agents.defaults.model
openclaw config get agents.defaults.model.primary

🦞 OpenClaw 2026.2.19-2 (45d9b20) — Because the right answer is usually a script.

{
"primary": "vllm/qwen3-coder",
"fallbacks": []
}

🦞 OpenClaw 2026.2.19-2 (45d9b20)
I speak fluent bash, mild sarcasm, and aggressive tab-completion energy.

vllm/qwen3-coder
root@Ubuntu-2404-noble-amd64-base ~

honest foxBOT
#

Cool — your model ref is exactly vllm/qwen3-coder, so here are the full copy/paste commands.

1) Disable OpenAI streaming for that model (correct key: params.streaming)

openclaw config set 'agents.defaults.models["vllm/qwen3-coder"].params.streaming' false --strict-json

2) Restart the gateway (required)

openclaw gateway restart

3) Verify the setting is applied

openclaw config get 'agents.defaults.models["vllm/qwen3-coder"].params'
#

You should see something like:

{ "streaming": false }

4) Test in TUI

openclaw tui

Send a message and see if it now returns in ~20–30s instead of ~4 min.


If it STILL takes ~4 minutes

Run this streaming test against your backend (replace the "model" value if your llama.cpp model name isn’t literally qwen3-coder on the server side):

#
curl -N http://127.0.0.1:11434/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{"model":"qwen3-coder","messages":[{"role":"user","content":"hi"}],"stream":true}'

Then tell me: does that command terminate by itself quickly, or does it keep the connection open/hang? That answer determines the next fix.

fair arrow
#

root@Ubuntu-2404-noble-amd64-base ~ # openclaw config set 'agents.defaults.models["vllm/qwen3-coder"].params.streaming' false --strict-json
error: unknown option '--strict-json'
root@Ubuntu-2404-noble-amd64-base ~ # openclaw config get 'agents.defaults.models["vllm/qwen3-coder"].params'

🦞 OpenClaw 2026.2.19-2 (45d9b20) — Chat automation for people who peaked at IRC.

Config path not found: agents.defaults.models["vllm/qwen3-coder"].params
root@Ubuntu-2404-noble-amd64-base ~

honest foxBOT
#

Yep — your build just doesn’t have --strict-json yet. On 2026.2.19-2, use --json (older flag) or no flag at all.

Here are the exact commands to run for your model vllm/qwen3-coder.

A) Set it (recommended: set the whole model entry)

openclaw config set 'agents.defaults.models["vllm/qwen3-coder"]' '{"params":{"streaming":false}}' --json

B) Verify it stuck

openclaw config get 'agents.defaults.models["vllm/qwen3-coder"]'

You want to see something like:

#
{"params":{"streaming":false}}

C) Restart the gateway (required)

openclaw gateway restart

D) Test in TUI

openclaw tui

If --json is also “unknown option”

#

Then your config command is auto-parsing JSON already. Run the same command without the flag:

openclaw config set 'agents.defaults.models["vllm/qwen3-coder"]' '{"params":{"streaming":false}}'

Heads-up (important)

The moment you create agents.defaults.models, it becomes a model allowlist. Right now you’re only adding vllm/qwen3-coder, so that’s fine — but if you later want other models available via /model, you’ll need to add them too.


If the TUI still “runs” for ~4 minutes after this

Run this and tell me whether it finishes on its own or hangs: