#Call a OpenResponses endpoint for LLM
106 messages Ā· Page 1 of 1 (latest)
Yep, that should work.
For an upstream LLM endpoint, configure it as a custom provider and set:
{
models: {
mode: "merge",
providers: {
"my-provider": {
baseUrl: "https://your-provider.example/v1",
apiKey: "${MY_PROVIDER_API_KEY}",
api: "openai-responses",
models: [
{
id: "my-model",
name: "My Model",
reasoning: false,
input: ["text"],
cost: { input: 0, output: 0, cacheRead: 0, cacheWrite: 0 },
contextWindow: 128000,
maxTokens: 8192,
},
],
},
},
},
}
A couple of gotchas:
apimust beopenai-responsesbaseUrlshould point at the providerās/v1root, not/v1/responses- if the provider only supports Chat Completions, use
openai-completionsinstead - custom
/v1proxies are treated as OpenAI-compatible, not native OpenAI, so some OpenAI-only extras do not apply
Docs:
- https://docs.openclaw.ai/gateway/configuration-reference
- https://docs.openclaw.ai/gateway/local-models
If you want, paste the provider youāre targeting and I can sketch the exact config for it.
"omniroute": {
"baseUrl": "https://omniroute.example.com/v1",
"apiKey": "sk-xxxxxxxxxxxx",
"api": "openai-responses",
"models": []
}
That worked perfectly
Nice, glad that worked š
Yeah, thatās the right shape for a generic upstream /v1/responses provider.
If this fixed it, please right-click the solution message ā Apps ā ā Mark Solution so Answer Overflow can index it for the next person.
How does streaming work with discord? I seem to get the internal thought progress with "progress"
Yeah, that tracks.
On Discord, streaming: "progress" is not a separate mode. It just maps to partial. So OpenClaw sends a temporary preview message and keeps editing it as text arrives.
A few important bits:
- Discord does not have true token streaming, itās send + edit preview
progresson Discord ==partial- if your upstream Responses provider streams āthinkingā or intermediate text in the normal output, Discord preview can show it before the final answer settles
Thatās especially likely with a generic custom openai-responses proxy, because OpenClaw treats it as an OpenAI-compatible proxy, not native OpenAI, so some native OpenAI reasoning cleanup/shaping does not apply.
If you want to stop that, Iād try:
{
channels: {
discord: {
streaming: "off"
}
}
}
Also make sure these are off in that session:
/verbose off
/trace off
/reasoning off
So short version, progress is not āstatus onlyā on Discord. Itās basically live preview, and preview shows whatever text the provider emits.
Docs:
- https://docs.openclaw.ai/concepts/streaming
- https://docs.openclaw.ai/channels/discord
- https://docs.openclaw.ai/tools/thinking
If you want, I can help you pick the least-annoying Discord streaming mode for that Omniroute setup.
I think "/verbose off" was the culprit.
back on topic about openai-responses API mode.
OmniRoute seems to be doing better now with interacting with my Gemini provider, before around half requests were getting 400 responses and the only thing differing were sometimes just a tool-call that OpenClaw sent that it's in progress
Is OpenClaw's openapi-responses API implementation fully stateful or what will happen when OmniRoute will chose another upstream provider if Gemini is down?
Can my local llama also do openapi-responses mode?
Yep, that makes sense. /verbose off would stop a lot of that noisy tool-progress stuff.
On the openai-responses part, the short version is:
1) Is it āfully statefulā across arbitrary upstream switches?
Not in the āsafe to hop between different vendors mid-threadā sense.
What OpenClaw does:
- keeps the session transcript statefully on its own side
- on Responses transports, it can also chain turns with
previous_response_id - for native OpenAI it additionally uses server-side features like
store: trueand optionalcontext_managementcompaction
What it does not do for a generic custom proxy like OmniRoute:
- it does not treat it like native OpenAI
- OpenAI-only extras such as
store, prompt-cache hints, and OpenAI reasoning-compat shaping do not apply on generic proxy URLs
So if OmniRoute reroutes a conversation from Gemini to some other upstream mid-session, continuity depends on OmniRoute, not OpenClaw.
If OmniRoute preserves the response chain and tool-call continuity, it may work fine.
If it does not, the next follow-up turn can blow up with 400s, especially around:
previous_response_id- tool-call / tool-result matching
- provider-specific turn-shape quirks
So for agentic runs, I would strongly prefer sticky routing per conversation.
2) Why might OmniRoute be helping?
Probably because it is normalizing the upstream better than the direct Gemini path you had before.
Gemini behind OpenAI-ish proxies can be picky about tool / replay turn shape, so a smarter proxy can absolutely reduce those 400s.
3) Can local Llama do openai-responses mode?
Yes, if it is behind a server that actually exposes a compatible /v1 Responses endpoint.
Good candidates from the docs:
- LM Studio
- vLLM
- LiteLLM
- other custom OpenAI-compatible
/v1servers
That is the recommended local Responses-style path.
4) Important exception, Ollama
If your ālocal llamaā is via Ollama, do not use its /v1 OpenAI-compatible path for this.
Docs explicitly say to use:
api: "ollama"
baseUrl: "http://host:11434"
and not http://host:11434/v1, because tool calling is not reliable there.
Docs:
- https://docs.openclaw.ai/providers/openai
- https://docs.openclaw.ai/gateway/local-models
- https://docs.openclaw.ai/providers/ollama
If you want, I can sketch the safest setup for either:
- OmniRoute with sticky upstreams, or
- local llama via LM Studio / vLLM.
I'm using llama.cpp as ollama lacks the kv-cache quants to reduce mem-usage for long context
Yep, thatās a fair reason.
In that case Iād treat llama.cpp as a generic custom provider and be a bit conservative:
- if your front end exposes a solid
/v1/responsessurface, useapi: "openai-responses" - if its Responses support is partial/flaky, use
api: "openai-completions"instead - if you keep OmniRoute in front, that may actually be the nicer setup because it can normalize some provider quirks
For direct llama.cpp, I would not assume portable provider-side state. If the backend restarts, reroutes, or loses its response chain, follow-up tool turns can break more easily than with native OpenAI.
One extra llama.cpp gotcha, if you start seeing weird 500s / parse errors, it is often a chat template issue. Setting an explicit template is usually safer than trusting the GGUFās embedded one.
So my practical take:
- best stability: OmniRoute in front of llama.cpp, sticky routing per conversation
- best simplicity: direct llama.cpp only if its
/v1/responsestool flow is already proven stable - fallback: use
openai-completionsif Responses mode gets brittle
Docs:
- https://docs.openclaw.ai/gateway/local-models
- https://docs.openclaw.ai/gateway/configuration-reference
If you want, paste the exact llama.cpp server youāre running, llama-server, LiteLLM in front of it, something else, and I can tell you which API mode Iād pick.
It's a docker-container
llama:
image: ghcr.io/ggml-org/llama.cpp:server-vulkan
container_name: llama
ports:
- "10965:8080"
volumes:
- ./ollama/models:/models
devices:
- /dev/dri
- /dev/kfd
command:
- --model
- /models/gemma-4-26B-A4B-it-UD-IQ2_M.gguf
- --ctx-size
- "131072"
- --n-gpu-layers
- "99"
- --cache-type-k
- q4_0
- --cache-type-v
- q4_0
- --host
- 0.0.0.0
- --port
- "8080"
- --jinja
- --log-verbose
restart: unless-stopped
Got it. For that exact llama.cpp server, Iād be a bit cautious.
My practical recommendation:
What Iād do
- Best bet: keep OmniRoute in front of it, let OmniRoute expose the cleaner
/v1/responsessurface - If going direct from OpenClaw to llama.cpp: Iād start with
openai-completions, notopenai-responses, unless youāve already proven multi-turn tools are stable - If you really want direct
openai-responses: test tool-call continuations hard, because that is where these servers usually get weird
The part Iām suspicious of
--jinja
That setting is often where llama.cpp starts getting brittle with OpenAI-style agent/tool payloads. If you see odd 400/500 parse failures, first things Iād try are:
- remove
--jinja, or - keep it, but set an explicit
--chat-template .../--chat-template-file ...
Iāve seen llama.cpp setups fail because the embedded GGUF chat template was not actually compatible with the request shape.
So for your setup
- OmniRoute -> llama.cpp: yes, I like this
- OpenClaw -> OmniRoute with
api: "openai-responses": also good - OpenClaw -> raw llama.cpp: doable, but I would treat it as experimental for agentic tool use
If you want the shortest answer, mine is:
for this container, Iād trust OmniRoute + sticky routing more than raw llama.cpp
openai-responses.
If you want, I can sketch the exact models.providers block Iād use for either:
- direct llama.cpp, or
- OmniRoute in front of llama.cpp.
I'm have disabled streaming in discord and set verbose off and reasoning off but I still seem to get internal thinking. I'm on the gemini gemma-4-31b-it
I think there are two different failure modes here:
1) If it shows up as a separate message starting with Reasoning:
That is still an OpenClaw setting/override somewhere.
In the same thread, send:
/verbose off
/trace off
/reasoning off
Then check the session in Control UI and set overrides back to inherit. The FAQ calls that out because session overrides can stick even after you think you turned them off.
2) If it is plain āthinking out loudā inside the normal reply
That is usually the model/proxy itself, not Discord streaming.
And your setup makes that more likely:
- Gemma 4 supports thinking
- OpenClaw only does the Google-specific āthinking offā mapping when you use the bundled google provider
- through OmniRoute as a generic
openai-responsesprovider, OpenClaw does not get that provider-specific behavior
So if OmniRoute or the upstream Gemma surface is still letting the model emit internal thoughts as ordinary text, OpenClaw cannot always distinguish that from the final answer, unless it comes in proper reasoning fields or inside <think>...</think> tags.
What Iād try next
- turn thinking/reasoning off at OmniRoute / upstream model config, not just in OpenClaw
- if OmniRoute has a per-route ādisable thinkingā or equivalent, use that
- if it does not, try a route/model that cleanly separates reasoning from final text
- also send
/trace offtoo, just in case
So, short version:
if it is
Reasoning:, it is an OpenClaw override
if it is plain internal monologue in the answer, it is probably your Gemma proxy path
Docs:
If you paste one redacted example of the āthinkingā text, I can usually tell in one glance which of the two it is.
Yeah, the issue seems to be in OmniRoute. This is the raw response from Gemini
{
"_streamed": true,
"_format": "sse-json",
"_stage": "provider_response",
"_eventCount": 28,
"summary": {
"candidates": [
{
"index": 0,
"content": {
"role": "model",
"parts": [
{
"text": "Long reasoning text..",
"thought": true
},
{
"text": "Final response"
}
]
},
"finishReason": "STOP"
}
],
"usageMetadata": {
"promptTokenCount": 34743,
"totalTokenCount": 35343,
"thoughtsTokenCount": 577,
"candidatesTokenCount": 23,
"cachedContentTokenCount": 32536
},
"modelVersion": "gemma-4-31b-it"
}
}
But the response returned back to OpenClaw is this:
{
"_streamed": true,
"_format": "sse-json",
"_stage": "client_response",
"_eventCount": 41,
"summary": {
"choices": [
{
"message": {
"role": "assistant",
"content": "The parts above concatinated"
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 34743,
"completion_tokens": 600,
"total_tokens": 35343
},
"_streamed": true
}
}
Yep, thatās the bug.
OmniRoute is taking:
- Gemini part 1,
text + thought: true - Gemini part 2, final answer text
and flattening both into one plain assistant content string.
So from OpenClawās point of view, there is no reasoning metadata left, just normal assistant text. That is why:
/reasoning offdoes nothing- streaming settings do nothing
- verbose settings do nothing
OpenClaw never got āreasoningā, it only got concatenated final text.
The token math even confirms it:
thoughtsTokenCount: 577candidatesTokenCount: 23- returned
completion_tokens: 600
So OmniRoute is basically merging thought tokens + final tokens into one completion.
What OmniRoute should do instead
For OpenClaw compatibility, the proxy should either:
- drop
parts[].thought === truefrom assistant content entirely, and only return the final visible text, or - map thought parts into a separate reasoning field/channel, but not concatenate them into
message.content
In pseudo-logic:
visible = parts.filter(p => !p.thought).map(p => p.text).join("")
hidden = parts.filter(p => p.thought).map(p => p.text).join("")
and only visible should become the assistant reply text.
So yeah, Iām pretty confident OpenClaw is behaving correctly here. OmniRoute is erasing the boundary between thought and answer.
If OmniRoute has an issue tracker, Iād file it as:
Gemini
parts[].thought === trueis concatenated into OpenAI-compatible assistant content instead of being dropped or mapped separately.
Thatās a solid bug report.
If I would like to have the thought in the response, how whould OmniRoute handle the response so OpenClaw sees it correctly?
If you want OpenClaw to handle it correctly, OmniRoute should return the thought as a separate reasoning item, not concatenate it into assistant text.
Best shape, if OmniRoute exposes openai-responses
Return something like:
{
"id": "resp_123",
"object": "response",
"output": [
{
"id": "rs_1",
"type": "reasoning",
"summary": "Long reasoning text.."
},
{
"id": "msg_1",
"type": "message",
"role": "assistant",
"content": [
{
"type": "output_text",
"text": "Final response"
}
]
}
],
"usage": {
"prompt_tokens": 34743,
"completion_tokens": 600,
"total_tokens": 35343
}
}
summary can also be content, OpenClaw accepts either for reasoning items.
Why this works
Then OpenClaw can tell:
- this part is reasoning
- this part is the visible final answer
So:
/reasoning offkeeps the thought hidden/reasoning oncan show it as a separateReasoning:message- normal replies only show
Final response
What OmniRoute should not do
Do not do this:
{
"choices": [
{
"message": {
"role": "assistant",
"content": "Long reasoning text..Final response"
}
}
]
}
At that point OpenClaw has no way to know where reasoning ends and answer begins.
If OmniRoute only returns Chat Completions
Then honestly, there is no clean standard place for reasoning. In that case the safest choice is:
- omit thought text entirely from
message.content, or - switch OmniRouteās client-facing surface to Responses API, not Chat Completions
So the short answer is:
reasoning must be a separate output item, assistant answer must be a separate message item
If you want, I can mock the exact SSE event sequence OmniRoute should emit for streamed openai-responses.
Alright, so now gemini responds with summary.choices, is that also in the spec? I'm quite lost as you say output as the attribute for the response
Yeah, I think the confusing part is that summary is probably OmniRouteās debug wrapper, not the actual API spec.
Short version
- Gemini native uses
candidates[].content.parts[] - OpenAI Chat Completions uses
choices[].message.content - OpenAI Responses uses
output[]
So:
summary.choicesis not the Responses spec- it is either:
- a Chat Completions-shaped summary, or
- just OmniRouteās own debug/inspection format
The key question
What endpoint is OmniRoute exposing to OpenClaw?
If OmniRoute exposes Chat Completions
Then this is normal:
{
"choices": [
{
"message": {
"role": "assistant",
"content": "..."
}
}
]
}
and OpenClaw should use:
api: "openai-completions"
If OmniRoute exposes Responses API
Then the response should be Responses-shaped, meaning separate output items, not choices:
{
"output": [
{
"type": "reasoning",
"summary": "..."
},
{
"type": "message",
"role": "assistant",
"content": [
{ "type": "output_text", "text": "Final response" }
]
}
]
}
and OpenClaw should use:
api: "openai-responses"
So Iād say this plainly
If OmniRoute is returning choices, then it is acting like Chat Completions, not Responses.
If you want OpenClaw to see reasoning separately, OmniRoute needs to expose a real Responses-shaped payload, not a Chat Completions one.
Also, the summary.* object you pasted is very likely not the raw wire payload OpenClaw sees. It looks like a diagnostic summary.
If you want, paste the actual raw HTTP body or SSE events that OmniRoute sends back on the endpoint OpenClaw is calling, and I can tell you in one shot whether it is Chat Completions or Responses.
OmniRoute uses Gemini endpoint like this:
Ah, that paste helps a lot.
What you linked is native Gemini SSE, specifically :streamGenerateContent?alt=sse.
So the shapes are:
- Gemini native ā
candidates[].content.parts[] - OpenAI Chat Completions ā
choices[] - OpenAI Responses ā
output[]
So no, the Gemini payload itself is not using either choices or output. OmniRoute has to translate it.
If OmniRoute wants to expose Chat Completions
It should translate Gemini into:
{
"choices": [
{
"message": {
"role": "assistant",
"content": "Final response"
},
"finish_reason": "stop"
}
]
}
But then the Gemini thought text should be dropped or kept in proxy-only metadata, because Chat Completions does not give OpenClaw a clean standard reasoning slot here.
If OmniRoute wants to expose Responses API
Then it should translate Gemini into:
{
"id": "resp_123",
"object": "response",
"created_at": 1713210000,
"status": "completed",
"model": "gemma-4-31b-it",
"output": [
{
"id": "rs_1",
"type": "reasoning",
"summary": "Long reasoning text.."
},
{
"id": "msg_1",
"type": "message",
"role": "assistant",
"content": [
{ "type": "output_text", "text": "Final response" }
]
}
],
"usage": {
"input_tokens": 34743,
"output_tokens": 600,
"total_tokens": 35343
}
}
Mapping from Gemini to Responses
Roughly:
- Gemini
parts[].thought === trueā Responsesoutput[].type = "reasoning" - Gemini normal text parts ā Responses
output[].type = "message"withoutput_text - Gemini
promptTokenCountāusage.input_tokens - Gemini output side tokens ā
usage.output_tokens - Gemini
totalTokenCountāusage.total_tokens
Important practical note
If you want OmniRoute streaming to be easy and correct, Iād do this:
- buffer Gemini
thought: truechunks separately - stream only the final visible answer text
- attach the reasoning as a separate
reasoningitem in the final completed response
That is much safer than trying to shove Gemini thoughts into Chat Completions choices.
So the main thing is, Gemini native is neither choices nor output. Those are translation targets OmniRoute must choose.
If you want, I can sketch the exact translation function from Gemini SSE chunks to OpenAI Responses JSON.
Give me a curl example matching my example for openai-responses that I can probe OmniRoute to see how it handles it
Yep, try these.
Streaming probe, OpenAI Responses style
curl -N https://YOUR-OMNIROUTE/v1/responses \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $OMNIROUTE_API_KEY" \
-d '{
"model": "gemma-4-31b-it",
"stream": true,
"input": [
{
"type": "message",
"role": "user",
"content": [
{ "type": "input_text", "text": "Ansver 1+1 only." }
]
}
]
}'
Non-streaming probe
curl https://YOUR-OMNIROUTE/v1/responses \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $OMNIROUTE_API_KEY" \
-d '{
"model": "gemma-4-31b-it",
"stream": false,
"input": [
{
"type": "message",
"role": "user",
"content": [
{ "type": "input_text", "text": "Ansver 1+1 only." }
]
}
]
}' | jq
What you want to see
If OmniRoute is doing Responses correctly, the final payload should look roughly like:
{
"id": "resp_...",
"object": "response",
"status": "completed",
"output": [
{
"type": "reasoning",
"summary": "..."
},
{
"type": "message",
"role": "assistant",
"content": [
{ "type": "output_text", "text": "2" }
]
}
]
}
What means it is still Chat Completions-ish / broken for reasoning
If you see this shape instead:
{
"choices": [
{
"message": {
"role": "assistant",
"content": "long thought text ... 2"
}
}
]
}
then OmniRoute is not giving you proper Responses output separation.
If you want, paste the raw /v1/responses result and Iāll sanity-check it.
Yep, this is the smoking gun.
What your paste shows
Non-streaming
This returns:
{
"object": "chat.completion",
"choices": [...]
}
So if that came from /v1/responses, it is already spec-wrong.
That is Chat Completions shape, not Responses shape.
Streaming
This is closer to Responses, because it emits:
response.createdresponse.output_text.deltaresponse.completed
But it is still semantically wrong, because OmniRoute is putting Geminiās thought: true text into:
"delta": "..."
for the assistant output text.
That means OpenClaw sees it as normal answer text, not reasoning.
So the bug is now very clear
OmniRoute is doing:
- non-stream: Chat Completions object
- stream: Responses event names, but with thoughts merged into assistant text
That is exactly why OpenClaw leaks the internal thinking on the streamed path.
What OmniRoute should do
For /v1/responses non-stream
Return a real Responses object:
{
"id": "resp_...",
"object": "response",
"status": "completed",
"output": [
{ "type": "reasoning", "summary": "..." },
{
"type": "message",
"role": "assistant",
"content": [{ "type": "output_text", "text": "2" }]
}
]
}
For /v1/responses stream
Do not send Gemini thought chunks as response.output_text.delta.
Instead:
- buffer
thought: truechunks separately - stream only final visible answer via
response.output_text.delta - optionally include reasoning as a separate
reasoningitem by the end
Very likely explanation for your OpenClaw behavior
OpenClaw is probably hitting the streaming upstream path, so it gets the broken stream translation, not the clean non-stream 2.
Short verdict
- non-stream path: cleaner output, but wrong schema for
/v1/responses - stream path: right event family, wrong content mapping
Honestly, youāve narrowed it down perfectly. This is an OmniRoute translation bug now.
If you want, I can write the exact minimal correct streamed event sequence OmniRoute should emit for this Gemini example.
This is the response I get from Gemini on this https://generativelanguage.googleapis.com/v1beta/models/gemma-4-31b-it:streamGenerateContent?alt=sse for streaming and without sse for non-streaming
https://pastebin.com/vrJi8xZL
@safe pumice
Yep, Iām here š¦
Do you want the minimal correct streamed openai-responses event sequence for OmniRoute next?
yeah, and is it possible to disable the streaming from OpenClaw so I can avoid the thinking bloat until Omniroute is fixed
Yep, thereās a decent workaround.
Temporary workaround: disable upstream model streaming in OpenClaw
I checked the config schema, and there is a per-model flag:
agents: {
defaults: {
models: {
"omniroute/gemma-4-31b-it": {
streaming: false
}
}
}
}
If you also keep it as your primary model, a fuller example is:
{
agents: {
defaults: {
model: { primary: "omniroute/gemma-4-31b-it" },
models: {
"omniroute/gemma-4-31b-it": {
alias: "Gemma 4 via OmniRoute",
streaming: false
}
}
}
}
}
That should make OpenClaw use the non-streaming upstream path, which is exactly what you want until OmniRoute fixes its broken streamed Responses translation.
Important, this is separate from Discord preview streaming.
You already turned channel streaming off, but this one is about LLM transport streaming upstream.
If it does not pick up immediately, Iād restart the gateway after changing config.
Minimal correct streamed openai-responses event sequence for OmniRoute
Since OpenClawās Responses schema has reasoning items but not a special reasoning-delta event, the safest stream shape is:
response.createdresponse.in_progress- buffer Gemini
thought: truechunks internally response.output_item.addedfor the assistant message
response.content_part.addedresponse.output_text.deltaonly for the final visible answer, for example"2"response.output_text.doneresponse.content_part.doneresponse.output_item.donefor the assistant messageresponse.completed, with finalresponse.outputcontaining:
- one
reasoningitem - one assistant
messageitem
So the final response.completed should look roughly like:
{
"type": "response.completed",
"response": {
"id": "resp_123",
"object": "response",
"created_at": 1776291308,
"status": "completed",
"model": "gemma-4-31b-it",
"output": [
{
"id": "rs_1",
"type": "reasoning",
"summary": "* Input: \"Ansver 1+1 only.\" ..."
},
{
"id": "msg_1",
"type": "message",
"role": "assistant",
"content": [
{ "type": "output_text", "text": "2" }
]
}
],
"usage": {
"input_tokens": 9,
"output_tokens": 112,
"total_tokens": 121
}
}
}
The key bit is:
never send Gemini
thought: truechunks asresponse.output_text.delta
If you want, I can also give you a ready-to-paste OmniRoute issue report text.
How would completions streaming look like? give me a simple curl for the same query I had
Yep, for Chat Completions streaming the curl would be:
curl -N https://YOUR-OMNIROUTE/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $OMNIROUTE_API_KEY" \
-d '{
"model": "gemma-4-31b-it",
"stream": true,
"messages": [
{
"role": "user",
"content": "Ansver 1+1 only."
}
]
}'
Expected shape
You should see SSE chunks roughly like:
data: {"choices":[{"delta":{"role":"assistant","content":"2"},"index":0}]}
data: {"choices":[{"delta":{},"finish_reason":"stop","index":0}]}
data: [DONE]
What would be wrong
If OmniRoute streams something like:
data: {"choices":[{"delta":{"content":"* Input: \"Ansver 1+1 only.\" ..."}}]}
then it is still leaking Gemini thought text into the visible assistant stream.
If you want, I can also give you the non-streaming Chat Completions curl right below it so you can compare both side by side.
Yep, same problem there too.
That paste shows OmniRoute is leaking the Gemini thoughts directly into Chat Completions delta.content:
"delta": { "content": "* Input: \"Ansver 1+1 only.\"\n" }
and only at the very end it reports:
"completion_tokens_details": { "reasoning_tokens": 101 }
So the proxy clearly knows those were reasoning tokens, but it still streamed them as visible assistant text. That is the bug.
What this means
- Responses streaming: broken
- Chat Completions streaming: also broken
- Non-streaming: seems much cleaner in your earlier probe
So your best temporary workaround is still:
agents.defaults.models["omniroute/gemma-4-31b-it"].streaming = false
That should force OpenClaw onto the non-stream upstream path and avoid the thought bloat until OmniRoute fixes its streaming translation.
Short verdict for OmniRoute
For streamed Chat Completions, it should not put Gemini thought text into choices[].delta.content.
If it wants to stay Chat Completions-compatible, it should stream only the visible answer text, for your test basically just:
data: {"choices":[{"delta":{"role":"assistant"}}]}
data: {"choices":[{"delta":{"content":"2"}}]}
data: {"choices":[{"delta":{},"finish_reason":"stop"}]}
data: [DONE]
So yeah, youāve confirmed it now, this is not an OpenClaw issue, itās OmniRouteās streamed translation on both API surfaces.
Yeah, give me a issue report and how to implement the reasoning properly with streaming
Yep. Hereās a ready-to-paste one.
OmniRoute issue report
Title
Gemini streamed reasoning is leaked into OpenAI-compatible visible output on both /v1/responses and /v1/chat/completions
Body
I found a translation bug when OmniRoute fronts Gemini / Gemma models.
What Gemini returns natively
Gemini streaming returns content parts like:
{
"candidates": [
{
"content": {
"parts": [
{ "text": "internal reasoning...", "thought": true },
{ "text": "final answer" }
]
}
}
]
}
So Gemini clearly distinguishes:
- hidden reasoning:
part.thought === true - visible answer: normal text part
What OmniRoute currently does
1) Chat Completions streaming
OmniRoute streams Gemini reasoning directly into visible assistant deltas:
data: {
"choices": [
{
"delta": {
"content": "* Input: \"Ansver 1+1 only.\" ..."
}
}
]
}
At the end it also reports reasoning token metadata:
"completion_tokens_details": {
"reasoning_tokens": 101
}
So the proxy knows these are reasoning tokens, but still emits them as visible content.
2) Responses streaming
OmniRoute emits response.output_text.delta containing Gemini reasoning text, instead of separating reasoning from final answer.
3) Responses non-streaming
The /v1/responses non-stream path appears to return Chat Completions shape:
{
"object": "chat.completion",
"choices": [...]
}
instead of Responses shape:
{
"object": "response",
"output": [...]
}
Expected behavior
For /v1/chat/completions
part.thought === trueshould not be emitted viachoices[].delta.content- only visible final answer text should be streamed as content
- reasoning can be dropped or kept in proxy-only metadata
For /v1/responses
- Gemini thought parts should become a separate
reasoningoutput item - visible answer should become a separate assistant
messageoutput item response.output_text.deltashould only stream visible answer text
Example final Responses shape:
{
"object": "response",
"output": [
{
"type": "reasoning",
"summary": "internal reasoning..."
},
{
"type": "message",
"role": "assistant",
"content": [
{ "type": "output_text", "text": "2" }
]
}
]
}
Why this matters
Clients like OpenClaw correctly hide or show reasoning only when it is delivered separately.
If OmniRoute concatenates or streams Gemini thought text as visible assistant content, clients cannot distinguish reasoning from final answer.
How to implement it properly
Shared rule
When consuming Gemini parts:
if (part.thought === true) {
reasoningBuffer += part.text ?? "";
} else {
visibleBuffer += part.text ?? "";
}
For Chat Completions streaming
There is no clean standard reasoning slot, so:
- do not stream
thoughttext intodelta.content - stream only visible text
So for your test prompt, the stream should be basically:
data: {"choices":[{"delta":{"role":"assistant"}}]}
data: {"choices":[{"delta":{"content":"2"}}]}
data: {"choices":[{"delta":{},"finish_reason":"stop"}]}
data: [DONE]
For Responses streaming
Best approach:
- buffer Gemini reasoning chunks separately
- stream only visible answer as
response.output_text.delta - on completion, include a separate
reasoningitem in the finalresponse.completed
Minimal good event flow
event: response.created
event: response.in_progress
event: response.output_item.added # assistant message item
event: response.content_part.added
event: response.output_text.delta # only visible text, e.g. "2"
event: response.output_text.done
event: response.content_part.done
event: response.output_item.done
event: response.completed # final response includes:
# - reasoning item
# - assistant message item
data: [DONE]
Final response.completed should include
{
"response": {
"object": "response",
"output": [
{
"id": "rs_1",
"type": "reasoning",
"summary": "internal reasoning..."
},
{
"id": "msg_1",
"type": "message",
"role": "assistant",
"content": [
{ "type": "output_text", "text": "2" }
]
}
]
}
}
Temporary OpenClaw workaround
Until OmniRoute is fixed, Iām using:
agents.defaults.models["omniroute/gemma-4-31b-it"].streaming = false
That avoids the broken streamed translation path.
If you want, I can also turn that into a shorter maintainer-friendly version.
alright, but completions API cannot support the reasoning?
Not cleanly or portably, no.
For OpenAI-compatible Chat Completions, the standard visible output slot is basically just:
choices[].message.content
or streamed:
choices[].delta.content
So there is no standard separate reasoning channel like Responses has with output[].type = "reasoning".
Practically
- Standard Chat Completions: final visible text only
- Responses API: can separate reasoning from answer properly
- Vendor extensions: possible, but non-standard
So OmniRoute could invent something like:
choices[0].message.reasoning
or
choices[0].delta.reasoning_content
but that is proxy-specific, not normal Chat Completions, and OpenClawās generic openai-completions path should not be expected to understand it.
Short answer
-
Can Chat Completions carry reasoning at all?
Yes, with vendor-specific hacks. -
Can it carry reasoning in a standard way that OpenClaw can reliably treat as separate reasoning?
Not really.
If you want reasoning to round-trip correctly, Responses API is the right surface.
Is there no standard for streaming the reasoning during the event-emitting? or is it only avalible just after the request is done?
Short answer, not in a clean portable standard that Iād trust across proxies.
For the surface you and I are talking about:
- visible answer text has a clear streaming shape, like
output_text.delta - reasoning does not have the same well-settled generic streaming shape in the compatibility layer OmniRoute is trying to provide
So the safest interoperable pattern is:
- stream only final visible text live
- keep reasoning buffered
- emit reasoning as a separate
reasoningitem atoutput_item.done/response.completed
In practice
Good portable behavior
- live stream: answer text only
- end of request: attach reasoning item
Vendor-specific behavior
Some native providers do have their own live thought/reasoning events, but those are usually provider-specific, not something generic OpenAI-compatible clients can rely on uniformly.
So for OmniRoute -> OpenClaw, I would treat reasoning as:
available reliably at the end, not as a live generic delta stream
If they really want live reasoning, theyād need a proxy-specific extension, and then OpenClaw would also need explicit support for that extension.