#Call a OpenResponses endpoint for LLM

106 messages Ā· Page 1 of 1 (latest)

strange palm
safe pumiceBOT
#

Yep, that should work.

For an upstream LLM endpoint, configure it as a custom provider and set:

{
  models: {
    mode: "merge",
    providers: {
      "my-provider": {
        baseUrl: "https://your-provider.example/v1",
        apiKey: "${MY_PROVIDER_API_KEY}",
        api: "openai-responses",
        models: [
          {
            id: "my-model",
#
            name: "My Model",
            reasoning: false,
            input: ["text"],
            cost: { input: 0, output: 0, cacheRead: 0, cacheWrite: 0 },
            contextWindow: 128000,
            maxTokens: 8192,
          },
        ],
      },
    },
  },
}

A couple of gotchas:

strange palm
safe pumiceBOT
#

Nice, glad that worked 😌

Yeah, that’s the right shape for a generic upstream /v1/responses provider.

If this fixed it, please right-click the solution message → Apps → āœ… Mark Solution so Answer Overflow can index it for the next person.

strange palm
#

How does streaming work with discord? I seem to get the internal thought progress with "progress"

safe pumiceBOT
#

Yeah, that tracks.

On Discord, streaming: "progress" is not a separate mode. It just maps to partial. So OpenClaw sends a temporary preview message and keeps editing it as text arrives.

A few important bits:

  • Discord does not have true token streaming, it’s send + edit preview
  • progress on Discord == partial
  • if your upstream Responses provider streams ā€œthinkingā€ or intermediate text in the normal output, Discord preview can show it before the final answer settles

That’s especially likely with a generic custom openai-responses proxy, because OpenClaw treats it as an OpenAI-compatible proxy, not native OpenAI, so some native OpenAI reasoning cleanup/shaping does not apply.

If you want to stop that, I’d try:

{
#
  channels: {
    discord: {
      streaming: "off"
    }
  }
}

Also make sure these are off in that session:

/verbose off
/trace off
/reasoning off
strange palm
#

I think "/verbose off" was the culprit.

back on topic about openai-responses API mode.

OmniRoute seems to be doing better now with interacting with my Gemini provider, before around half requests were getting 400 responses and the only thing differing were sometimes just a tool-call that OpenClaw sent that it's in progress

Is OpenClaw's openapi-responses API implementation fully stateful or what will happen when OmniRoute will chose another upstream provider if Gemini is down?

Can my local llama also do openapi-responses mode?

safe pumiceBOT
#

Yep, that makes sense. /verbose off would stop a lot of that noisy tool-progress stuff.

On the openai-responses part, the short version is:

1) Is it ā€œfully statefulā€ across arbitrary upstream switches?

Not in the ā€œsafe to hop between different vendors mid-threadā€ sense.

What OpenClaw does:

  • keeps the session transcript statefully on its own side
  • on Responses transports, it can also chain turns with previous_response_id
  • for native OpenAI it additionally uses server-side features like store: true and optional context_management compaction

What it does not do for a generic custom proxy like OmniRoute:

  • it does not treat it like native OpenAI
#
  • OpenAI-only extras such as store, prompt-cache hints, and OpenAI reasoning-compat shaping do not apply on generic proxy URLs

So if OmniRoute reroutes a conversation from Gemini to some other upstream mid-session, continuity depends on OmniRoute, not OpenClaw.

If OmniRoute preserves the response chain and tool-call continuity, it may work fine.
If it does not, the next follow-up turn can blow up with 400s, especially around:

  • previous_response_id
  • tool-call / tool-result matching
  • provider-specific turn-shape quirks

So for agentic runs, I would strongly prefer sticky routing per conversation.

2) Why might OmniRoute be helping?

Probably because it is normalizing the upstream better than the direct Gemini path you had before.

#

Gemini behind OpenAI-ish proxies can be picky about tool / replay turn shape, so a smarter proxy can absolutely reduce those 400s.

3) Can local Llama do openai-responses mode?

Yes, if it is behind a server that actually exposes a compatible /v1 Responses endpoint.

Good candidates from the docs:

  • LM Studio
  • vLLM
  • LiteLLM
  • other custom OpenAI-compatible /v1 servers

That is the recommended local Responses-style path.

4) Important exception, Ollama

#
  1. OmniRoute with sticky upstreams, or
  2. local llama via LM Studio / vLLM.
strange palm
#

I'm using llama.cpp as ollama lacks the kv-cache quants to reduce mem-usage for long context

safe pumiceBOT
#

Yep, that’s a fair reason.

In that case I’d treat llama.cpp as a generic custom provider and be a bit conservative:

  • if your front end exposes a solid /v1/responses surface, use api: "openai-responses"
  • if its Responses support is partial/flaky, use api: "openai-completions" instead
  • if you keep OmniRoute in front, that may actually be the nicer setup because it can normalize some provider quirks

For direct llama.cpp, I would not assume portable provider-side state. If the backend restarts, reroutes, or loses its response chain, follow-up tool turns can break more easily than with native OpenAI.

One extra llama.cpp gotcha, if you start seeing weird 500s / parse errors, it is often a chat template issue. Setting an explicit template is usually safer than trusting the GGUF’s embedded one.

So my practical take:

  • best stability: OmniRoute in front of llama.cpp, sticky routing per conversation
  • best simplicity: direct llama.cpp only if its /v1/responses tool flow is already proven stable
  • fallback: use openai-completions if Responses mode gets brittle
strange palm
#

It's a docker-container
llama:
image: ghcr.io/ggml-org/llama.cpp:server-vulkan
container_name: llama
ports:
- "10965:8080"
volumes:
- ./ollama/models:/models
devices:
- /dev/dri
- /dev/kfd
command:
- --model
- /models/gemma-4-26B-A4B-it-UD-IQ2_M.gguf
- --ctx-size
- "131072"
- --n-gpu-layers
- "99"
- --cache-type-k
- q4_0
- --cache-type-v
- q4_0
- --host
- 0.0.0.0
- --port
- "8080"
- --jinja
- --log-verbose
restart: unless-stopped

safe pumiceBOT
#

Got it. For that exact llama.cpp server, I’d be a bit cautious.

My practical recommendation:

What I’d do

  • Best bet: keep OmniRoute in front of it, let OmniRoute expose the cleaner /v1/responses surface
  • If going direct from OpenClaw to llama.cpp: I’d start with openai-completions, not openai-responses, unless you’ve already proven multi-turn tools are stable
  • If you really want direct openai-responses: test tool-call continuations hard, because that is where these servers usually get weird

The part I’m suspicious of

--jinja

That setting is often where llama.cpp starts getting brittle with OpenAI-style agent/tool payloads. If you see odd 400/500 parse failures, first things I’d try are:

  • remove --jinja, or
  • keep it, but set an explicit --chat-template ... / --chat-template-file ...
#

I’ve seen llama.cpp setups fail because the embedded GGUF chat template was not actually compatible with the request shape.

So for your setup

  • OmniRoute -> llama.cpp: yes, I like this
  • OpenClaw -> OmniRoute with api: "openai-responses": also good
  • OpenClaw -> raw llama.cpp: doable, but I would treat it as experimental for agentic tool use

If you want the shortest answer, mine is:

for this container, I’d trust OmniRoute + sticky routing more than raw llama.cpp openai-responses.

If you want, I can sketch the exact models.providers block I’d use for either:

  1. direct llama.cpp, or
  2. OmniRoute in front of llama.cpp.
strange palm
#

I'm have disabled streaming in discord and set verbose off and reasoning off but I still seem to get internal thinking. I'm on the gemini gemma-4-31b-it

safe pumiceBOT
#

I think there are two different failure modes here:

1) If it shows up as a separate message starting with Reasoning:

That is still an OpenClaw setting/override somewhere.

In the same thread, send:

/verbose off
/trace off
/reasoning off

Then check the session in Control UI and set overrides back to inherit. The FAQ calls that out because session overrides can stick even after you think you turned them off.

2) If it is plain ā€œthinking out loudā€ inside the normal reply

That is usually the model/proxy itself, not Discord streaming.

#

And your setup makes that more likely:

  • Gemma 4 supports thinking
  • OpenClaw only does the Google-specific ā€œthinking offā€ mapping when you use the bundled google provider
  • through OmniRoute as a generic openai-responses provider, OpenClaw does not get that provider-specific behavior

So if OmniRoute or the upstream Gemma surface is still letting the model emit internal thoughts as ordinary text, OpenClaw cannot always distinguish that from the final answer, unless it comes in proper reasoning fields or inside <think>...</think> tags.

What I’d try next

  • turn thinking/reasoning off at OmniRoute / upstream model config, not just in OpenClaw
  • if OmniRoute has a per-route ā€œdisable thinkingā€ or equivalent, use that
  • if it does not, try a route/model that cleanly separates reasoning from final text
  • also send /trace off too, just in case

So, short version:

if it is Reasoning:, it is an OpenClaw override

strange palm
#

Yeah, the issue seems to be in OmniRoute. This is the raw response from Gemini

{
"_streamed": true,
"_format": "sse-json",
"_stage": "provider_response",
"_eventCount": 28,
"summary": {
"candidates": [
{
"index": 0,
"content": {
"role": "model",
"parts": [
{
"text": "Long reasoning text..",
"thought": true
},
{
"text": "Final response"
}
]
},
"finishReason": "STOP"
}
],
"usageMetadata": {
"promptTokenCount": 34743,
"totalTokenCount": 35343,
"thoughtsTokenCount": 577,
"candidatesTokenCount": 23,
"cachedContentTokenCount": 32536
},
"modelVersion": "gemma-4-31b-it"
}
}

But the response returned back to OpenClaw is this:

{
"_streamed": true,
"_format": "sse-json",
"_stage": "client_response",
"_eventCount": 41,
"summary": {
"choices": [
{
"message": {
"role": "assistant",
"content": "The parts above concatinated"
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 34743,
"completion_tokens": 600,
"total_tokens": 35343
},
"_streamed": true
}
}

safe pumiceBOT
#

Yep, that’s the bug.

OmniRoute is taking:

  • Gemini part 1, text + thought: true
  • Gemini part 2, final answer text

and flattening both into one plain assistant content string.

So from OpenClaw’s point of view, there is no reasoning metadata left, just normal assistant text. That is why:

  • /reasoning off does nothing
  • streaming settings do nothing
  • verbose settings do nothing

OpenClaw never got ā€œreasoningā€, it only got concatenated final text.

#

The token math even confirms it:

  • thoughtsTokenCount: 577
  • candidatesTokenCount: 23
  • returned completion_tokens: 600

So OmniRoute is basically merging thought tokens + final tokens into one completion.

What OmniRoute should do instead

For OpenClaw compatibility, the proxy should either:

  1. drop parts[].thought === true from assistant content entirely, and only return the final visible text, or
  2. map thought parts into a separate reasoning field/channel, but not concatenate them into message.content

In pseudo-logic:

#
visible = parts.filter(p => !p.thought).map(p => p.text).join("")
hidden = parts.filter(p => p.thought).map(p => p.text).join("")

and only visible should become the assistant reply text.

So yeah, I’m pretty confident OpenClaw is behaving correctly here. OmniRoute is erasing the boundary between thought and answer.

If OmniRoute has an issue tracker, I’d file it as:

Gemini parts[].thought === true is concatenated into OpenAI-compatible assistant content instead of being dropped or mapped separately.

That’s a solid bug report.

strange palm
#

If I would like to have the thought in the response, how whould OmniRoute handle the response so OpenClaw sees it correctly?

safe pumiceBOT
#

If you want OpenClaw to handle it correctly, OmniRoute should return the thought as a separate reasoning item, not concatenate it into assistant text.

Best shape, if OmniRoute exposes openai-responses

Return something like:

{
  "id": "resp_123",
  "object": "response",
  "output": [
    {
      "id": "rs_1",
      "type": "reasoning",
      "summary": "Long reasoning text.."
    },
#
    {
      "id": "msg_1",
      "type": "message",
      "role": "assistant",
      "content": [
        {
          "type": "output_text",
          "text": "Final response"
        }
      ]
    }
  ],
  "usage": {
    "prompt_tokens": 34743,
    "completion_tokens": 600,
#
    "total_tokens": 35343
  }
}

summary can also be content, OpenClaw accepts either for reasoning items.

Why this works

Then OpenClaw can tell:

  • this part is reasoning
  • this part is the visible final answer

So:

#
  • /reasoning off keeps the thought hidden
  • /reasoning on can show it as a separate Reasoning: message
  • normal replies only show Final response

What OmniRoute should not do

Do not do this:

{
  "choices": [
    {
      "message": {
        "role": "assistant",
        "content": "Long reasoning text..Final response"
      }
#
    }
  ]
}

At that point OpenClaw has no way to know where reasoning ends and answer begins.

If OmniRoute only returns Chat Completions

Then honestly, there is no clean standard place for reasoning. In that case the safest choice is:

  • omit thought text entirely from message.content, or
  • switch OmniRoute’s client-facing surface to Responses API, not Chat Completions

So the short answer is:

#

reasoning must be a separate output item, assistant answer must be a separate message item

If you want, I can mock the exact SSE event sequence OmniRoute should emit for streamed openai-responses.

strange palm
#

Alright, so now gemini responds with summary.choices, is that also in the spec? I'm quite lost as you say output as the attribute for the response

safe pumiceBOT
#

Yeah, I think the confusing part is that summary is probably OmniRoute’s debug wrapper, not the actual API spec.

Short version

  • Gemini native uses candidates[].content.parts[]
  • OpenAI Chat Completions uses choices[].message.content
  • OpenAI Responses uses output[]

So:

  • summary.choices is not the Responses spec
  • it is either:
    • a Chat Completions-shaped summary, or
    • just OmniRoute’s own debug/inspection format

The key question

#

What endpoint is OmniRoute exposing to OpenClaw?

If OmniRoute exposes Chat Completions

Then this is normal:

{
  "choices": [
    {
      "message": {
        "role": "assistant",
        "content": "..."
      }
    }
  ]
}
#

and OpenClaw should use:

api: "openai-completions"

If OmniRoute exposes Responses API

Then the response should be Responses-shaped, meaning separate output items, not choices:

{
  "output": [
    {
      "type": "reasoning",
      "summary": "..."
    },
#
    {
      "type": "message",
      "role": "assistant",
      "content": [
        { "type": "output_text", "text": "Final response" }
      ]
    }
  ]
}

and OpenClaw should use:

api: "openai-responses"
#

So I’d say this plainly

If OmniRoute is returning choices, then it is acting like Chat Completions, not Responses.

If you want OpenClaw to see reasoning separately, OmniRoute needs to expose a real Responses-shaped payload, not a Chat Completions one.

Also, the summary.* object you pasted is very likely not the raw wire payload OpenClaw sees. It looks like a diagnostic summary.

If you want, paste the actual raw HTTP body or SSE events that OmniRoute sends back on the endpoint OpenClaw is calling, and I can tell you in one shot whether it is Chat Completions or Responses.

strange palm
safe pumiceBOT
#

Ah, that paste helps a lot.

What you linked is native Gemini SSE, specifically :streamGenerateContent?alt=sse.

So the shapes are:

  • Gemini native → candidates[].content.parts[]
  • OpenAI Chat Completions → choices[]
  • OpenAI Responses → output[]

So no, the Gemini payload itself is not using either choices or output. OmniRoute has to translate it.

If OmniRoute wants to expose Chat Completions

It should translate Gemini into:

#
{
  "choices": [
    {
      "message": {
        "role": "assistant",
        "content": "Final response"
      },
      "finish_reason": "stop"
    }
  ]
}

But then the Gemini thought text should be dropped or kept in proxy-only metadata, because Chat Completions does not give OpenClaw a clean standard reasoning slot here.

If OmniRoute wants to expose Responses API

#

Then it should translate Gemini into:

{
  "id": "resp_123",
  "object": "response",
  "created_at": 1713210000,
  "status": "completed",
  "model": "gemma-4-31b-it",
  "output": [
    {
      "id": "rs_1",
      "type": "reasoning",
      "summary": "Long reasoning text.."
    },
    {
#
      "id": "msg_1",
      "type": "message",
      "role": "assistant",
      "content": [
        { "type": "output_text", "text": "Final response" }
      ]
    }
  ],
  "usage": {
    "input_tokens": 34743,
    "output_tokens": 600,
    "total_tokens": 35343
  }
}
#

Mapping from Gemini to Responses

Roughly:

  • Gemini parts[].thought === true → Responses output[].type = "reasoning"
  • Gemini normal text parts → Responses output[].type = "message" with output_text
  • Gemini promptTokenCount → usage.input_tokens
  • Gemini output side tokens → usage.output_tokens
  • Gemini totalTokenCount → usage.total_tokens

Important practical note

If you want OmniRoute streaming to be easy and correct, I’d do this:

  • buffer Gemini thought: true chunks separately
  • stream only the final visible answer text
  • attach the reasoning as a separate reasoning item in the final completed response
#

That is much safer than trying to shove Gemini thoughts into Chat Completions choices.

So the main thing is, Gemini native is neither choices nor output. Those are translation targets OmniRoute must choose.

If you want, I can sketch the exact translation function from Gemini SSE chunks to OpenAI Responses JSON.

strange palm
#

Give me a curl example matching my example for openai-responses that I can probe OmniRoute to see how it handles it

safe pumiceBOT
#

Yep, try these.

Streaming probe, OpenAI Responses style

curl -N https://YOUR-OMNIROUTE/v1/responses \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $OMNIROUTE_API_KEY" \
  -d '{
    "model": "gemma-4-31b-it",
    "stream": true,
    "input": [
      {
        "type": "message",
        "role": "user",
        "content": [
#
          { "type": "input_text", "text": "Ansver 1+1 only." }
        ]
      }
    ]
  }'

Non-streaming probe

curl https://YOUR-OMNIROUTE/v1/responses \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $OMNIROUTE_API_KEY" \
  -d '{
    "model": "gemma-4-31b-it",
#
    "stream": false,
    "input": [
      {
        "type": "message",
        "role": "user",
        "content": [
          { "type": "input_text", "text": "Ansver 1+1 only." }
        ]
      }
    ]
  }' | jq

What you want to see

If OmniRoute is doing Responses correctly, the final payload should look roughly like:

#
{
  "id": "resp_...",
  "object": "response",
  "status": "completed",
  "output": [
    {
      "type": "reasoning",
      "summary": "..."
    },
    {
      "type": "message",
      "role": "assistant",
      "content": [
        { "type": "output_text", "text": "2" }
      ]
#
    }
  ]
}

What means it is still Chat Completions-ish / broken for reasoning

If you see this shape instead:

{
  "choices": [
    {
      "message": {
        "role": "assistant",
#
        "content": "long thought text ... 2"
      }
    }
  ]
}

then OmniRoute is not giving you proper Responses output separation.

If you want, paste the raw /v1/responses result and I’ll sanity-check it.

strange palm
safe pumiceBOT
#

Yep, this is the smoking gun.

What your paste shows

Non-streaming

This returns:

{
  "object": "chat.completion",
  "choices": [...]
}

So if that came from /v1/responses, it is already spec-wrong.
That is Chat Completions shape, not Responses shape.

#

Streaming

This is closer to Responses, because it emits:

  • response.created
  • response.output_text.delta
  • response.completed

But it is still semantically wrong, because OmniRoute is putting Gemini’s thought: true text into:

"delta": "..."

for the assistant output text.

That means OpenClaw sees it as normal answer text, not reasoning.

#

So the bug is now very clear

OmniRoute is doing:

  • non-stream: Chat Completions object
  • stream: Responses event names, but with thoughts merged into assistant text

That is exactly why OpenClaw leaks the internal thinking on the streamed path.

What OmniRoute should do

For /v1/responses non-stream

Return a real Responses object:

{
#
  "id": "resp_...",
  "object": "response",
  "status": "completed",
  "output": [
    { "type": "reasoning", "summary": "..." },
    {
      "type": "message",
      "role": "assistant",
      "content": [{ "type": "output_text", "text": "2" }]
    }
  ]
}

For /v1/responses stream

Do not send Gemini thought chunks as response.output_text.delta.

#

Instead:

  • buffer thought: true chunks separately
  • stream only final visible answer via response.output_text.delta
  • optionally include reasoning as a separate reasoning item by the end

Very likely explanation for your OpenClaw behavior

OpenClaw is probably hitting the streaming upstream path, so it gets the broken stream translation, not the clean non-stream 2.

Short verdict

  • non-stream path: cleaner output, but wrong schema for /v1/responses
  • stream path: right event family, wrong content mapping

Honestly, you’ve narrowed it down perfectly. This is an OmniRoute translation bug now.

#

If you want, I can write the exact minimal correct streamed event sequence OmniRoute should emit for this Gemini example.

strange palm
strange palm
#

@safe pumice

safe pumiceBOT
#

Yep, I’m here 🦐

Do you want the minimal correct streamed openai-responses event sequence for OmniRoute next?

strange palm
#

yeah, and is it possible to disable the streaming from OpenClaw so I can avoid the thinking bloat until Omniroute is fixed

safe pumiceBOT
#

Yep, there’s a decent workaround.

Temporary workaround: disable upstream model streaming in OpenClaw

I checked the config schema, and there is a per-model flag:

agents: {
  defaults: {
    models: {
      "omniroute/gemma-4-31b-it": {
        streaming: false
      }
    }
  }
}
#

If you also keep it as your primary model, a fuller example is:

{
  agents: {
    defaults: {
      model: { primary: "omniroute/gemma-4-31b-it" },
      models: {
        "omniroute/gemma-4-31b-it": {
          alias: "Gemma 4 via OmniRoute",
          streaming: false
        }
      }
    }
  }
}
#

That should make OpenClaw use the non-streaming upstream path, which is exactly what you want until OmniRoute fixes its broken streamed Responses translation.

Important, this is separate from Discord preview streaming.
You already turned channel streaming off, but this one is about LLM transport streaming upstream.

If it does not pick up immediately, I’d restart the gateway after changing config.


Minimal correct streamed openai-responses event sequence for OmniRoute

Since OpenClaw’s Responses schema has reasoning items but not a special reasoning-delta event, the safest stream shape is:

  1. response.created
  2. response.in_progress
  3. buffer Gemini thought: true chunks internally
  4. response.output_item.added for the assistant message
#
  1. response.content_part.added
  2. response.output_text.delta only for the final visible answer, for example "2"
  3. response.output_text.done
  4. response.content_part.done
  5. response.output_item.done for the assistant message
  6. response.completed, with final response.output containing:
  • one reasoning item
  • one assistant message item

So the final response.completed should look roughly like:

{
  "type": "response.completed",
  "response": {
    "id": "resp_123",
#
    "object": "response",
    "created_at": 1776291308,
    "status": "completed",
    "model": "gemma-4-31b-it",
    "output": [
      {
        "id": "rs_1",
        "type": "reasoning",
        "summary": "* Input: \"Ansver 1+1 only.\" ..."
      },
      {
        "id": "msg_1",
        "type": "message",
        "role": "assistant",
        "content": [
#
          { "type": "output_text", "text": "2" }
        ]
      }
    ],
    "usage": {
      "input_tokens": 9,
      "output_tokens": 112,
      "total_tokens": 121
    }
  }
}

The key bit is:

never send Gemini thought: true chunks as response.output_text.delta

#

If you want, I can also give you a ready-to-paste OmniRoute issue report text.

strange palm
#

How would completions streaming look like? give me a simple curl for the same query I had

safe pumiceBOT
#

Yep, for Chat Completions streaming the curl would be:

curl -N https://YOUR-OMNIROUTE/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $OMNIROUTE_API_KEY" \
  -d '{
    "model": "gemma-4-31b-it",
    "stream": true,
    "messages": [
      {
        "role": "user",
        "content": "Ansver 1+1 only."
      }
    ]
  }'
#

Expected shape

You should see SSE chunks roughly like:

data: {"choices":[{"delta":{"role":"assistant","content":"2"},"index":0}]}

data: {"choices":[{"delta":{},"finish_reason":"stop","index":0}]}

data: [DONE]

What would be wrong

If OmniRoute streams something like:

#
data: {"choices":[{"delta":{"content":"* Input: \"Ansver 1+1 only.\" ..."}}]}

then it is still leaking Gemini thought text into the visible assistant stream.

If you want, I can also give you the non-streaming Chat Completions curl right below it so you can compare both side by side.

strange palm
safe pumiceBOT
#

Yep, same problem there too.

That paste shows OmniRoute is leaking the Gemini thoughts directly into Chat Completions delta.content:

"delta": { "content": "* Input: \"Ansver 1+1 only.\"\n" }

and only at the very end it reports:

"completion_tokens_details": { "reasoning_tokens": 101 }

So the proxy clearly knows those were reasoning tokens, but it still streamed them as visible assistant text. That is the bug.

What this means

#
  • Responses streaming: broken
  • Chat Completions streaming: also broken
  • Non-streaming: seems much cleaner in your earlier probe

So your best temporary workaround is still:

agents.defaults.models["omniroute/gemma-4-31b-it"].streaming = false

That should force OpenClaw onto the non-stream upstream path and avoid the thought bloat until OmniRoute fixes its streaming translation.

Short verdict for OmniRoute

For streamed Chat Completions, it should not put Gemini thought text into choices[].delta.content.

If it wants to stay Chat Completions-compatible, it should stream only the visible answer text, for your test basically just:

#
data: {"choices":[{"delta":{"role":"assistant"}}]}
data: {"choices":[{"delta":{"content":"2"}}]}
data: {"choices":[{"delta":{},"finish_reason":"stop"}]}
data: [DONE]

So yeah, you’ve confirmed it now, this is not an OpenClaw issue, it’s OmniRoute’s streamed translation on both API surfaces.

strange palm
safe pumiceBOT
#

Yep. Here’s a ready-to-paste one.

OmniRoute issue report

Title
Gemini streamed reasoning is leaked into OpenAI-compatible visible output on both /v1/responses and /v1/chat/completions

Body

I found a translation bug when OmniRoute fronts Gemini / Gemma models.

What Gemini returns natively

Gemini streaming returns content parts like:

#
{
  "candidates": [
    {
      "content": {
        "parts": [
          { "text": "internal reasoning...", "thought": true },
          { "text": "final answer" }
        ]
      }
    }
  ]
}

So Gemini clearly distinguishes:

#
  • hidden reasoning: part.thought === true
  • visible answer: normal text part

What OmniRoute currently does

1) Chat Completions streaming

OmniRoute streams Gemini reasoning directly into visible assistant deltas:

data: {
  "choices": [
    {
      "delta": {
        "content": "* Input: \"Ansver 1+1 only.\" ..."
      }
    }
#
  ]
}

At the end it also reports reasoning token metadata:

"completion_tokens_details": {
  "reasoning_tokens": 101
}

So the proxy knows these are reasoning tokens, but still emits them as visible content.

2) Responses streaming

OmniRoute emits response.output_text.delta containing Gemini reasoning text, instead of separating reasoning from final answer.

#

3) Responses non-streaming

The /v1/responses non-stream path appears to return Chat Completions shape:

{
  "object": "chat.completion",
  "choices": [...]
}

instead of Responses shape:

{
  "object": "response",
  "output": [...]
#
}

Expected behavior

For /v1/chat/completions

  • part.thought === true should not be emitted via choices[].delta.content
  • only visible final answer text should be streamed as content
  • reasoning can be dropped or kept in proxy-only metadata

For /v1/responses

  • Gemini thought parts should become a separate reasoning output item
  • visible answer should become a separate assistant message output item
  • response.output_text.delta should only stream visible answer text

Example final Responses shape:

#
{
  "object": "response",
  "output": [
    {
      "type": "reasoning",
      "summary": "internal reasoning..."
    },
    {
      "type": "message",
      "role": "assistant",
      "content": [
        { "type": "output_text", "text": "2" }
      ]
    }
  ]
#
}

Why this matters

Clients like OpenClaw correctly hide or show reasoning only when it is delivered separately.
If OmniRoute concatenates or streams Gemini thought text as visible assistant content, clients cannot distinguish reasoning from final answer.


How to implement it properly

Shared rule

When consuming Gemini parts:

#
if (part.thought === true) {
  reasoningBuffer += part.text ?? "";
} else {
  visibleBuffer += part.text ?? "";
}

For Chat Completions streaming

There is no clean standard reasoning slot, so:

  • do not stream thought text into delta.content
  • stream only visible text

So for your test prompt, the stream should be basically:

#
data: {"choices":[{"delta":{"role":"assistant"}}]}
data: {"choices":[{"delta":{"content":"2"}}]}
data: {"choices":[{"delta":{},"finish_reason":"stop"}]}
data: [DONE]

For Responses streaming

Best approach:

  1. buffer Gemini reasoning chunks separately
  2. stream only visible answer as response.output_text.delta
  3. on completion, include a separate reasoning item in the final response.completed

Minimal good event flow

#
event: response.created
event: response.in_progress

event: response.output_item.added         # assistant message item
event: response.content_part.added
event: response.output_text.delta         # only visible text, e.g. "2"
event: response.output_text.done
event: response.content_part.done
event: response.output_item.done

event: response.completed                 # final response includes:
                                          # - reasoning item
                                          # - assistant message item
data: [DONE]
#

Final response.completed should include

{
  "response": {
    "object": "response",
    "output": [
      {
        "id": "rs_1",
        "type": "reasoning",
        "summary": "internal reasoning..."
      },
      {
        "id": "msg_1",
        "type": "message",
        "role": "assistant",
#
        "content": [
          { "type": "output_text", "text": "2" }
        ]
      }
    ]
  }
}

Temporary OpenClaw workaround

Until OmniRoute is fixed, I’m using:

agents.defaults.models["omniroute/gemma-4-31b-it"].streaming = false
#

That avoids the broken streamed translation path.

If you want, I can also turn that into a shorter maintainer-friendly version.

strange palm
#

alright, but completions API cannot support the reasoning?

safe pumiceBOT
#

Not cleanly or portably, no.

For OpenAI-compatible Chat Completions, the standard visible output slot is basically just:

choices[].message.content

or streamed:

choices[].delta.content

So there is no standard separate reasoning channel like Responses has with output[].type = "reasoning".

Practically

#
  • Standard Chat Completions: final visible text only
  • Responses API: can separate reasoning from answer properly
  • Vendor extensions: possible, but non-standard

So OmniRoute could invent something like:

choices[0].message.reasoning

or

choices[0].delta.reasoning_content

but that is proxy-specific, not normal Chat Completions, and OpenClaw’s generic openai-completions path should not be expected to understand it.

#

Short answer

  • Can Chat Completions carry reasoning at all?
    Yes, with vendor-specific hacks.

  • Can it carry reasoning in a standard way that OpenClaw can reliably treat as separate reasoning?
    Not really.

If you want reasoning to round-trip correctly, Responses API is the right surface.

strange palm
#

Is there no standard for streaming the reasoning during the event-emitting? or is it only avalible just after the request is done?

safe pumiceBOT
#

Short answer, not in a clean portable standard that I’d trust across proxies.

For the surface you and I are talking about:

  • visible answer text has a clear streaming shape, like output_text.delta
  • reasoning does not have the same well-settled generic streaming shape in the compatibility layer OmniRoute is trying to provide

So the safest interoperable pattern is:

  • stream only final visible text live
  • keep reasoning buffered
  • emit reasoning as a separate reasoning item at output_item.done / response.completed

In practice

Good portable behavior

  • live stream: answer text only
#
  • end of request: attach reasoning item

Vendor-specific behavior

Some native providers do have their own live thought/reasoning events, but those are usually provider-specific, not something generic OpenAI-compatible clients can rely on uniformly.

So for OmniRoute -> OpenClaw, I would treat reasoning as:

available reliably at the end, not as a live generic delta stream

If they really want live reasoning, they’d need a proxy-specific extension, and then OpenClaw would also need explicit support for that extension.

strange palm