#sessions.list perf

1 messages Β· Page 1 of 1 (latest)

dense bison
#

@subtle gull @tropic fox I'm trying to repro this with 10k sessions, if you folks have a minimal repro I could test that would be great

#

Could very well be machine dependent, I'm testng on a mac so it probably has a quick SSD

subtle gull
#

im on something that has datacenter NVME SSDs

#

let me get the details

#

i just got things stabilized so i can use my main chat app, no longer bound to the confines of the tui

#

my chat app was abusing the problem big time, wide scope calls against sessions a lot of times frequently, worked fine until uh, 4.27?

#

now its scoped down much more, plus the cache and it seems to be happy

dense bison
#

hmm on main I made 10k sessions about 256kb each and it completes in 1s cold and 300ms hot

#

Without any cache

tropic fox
#

Mine is a 2.5GHz quad-core Intel MBP with 16GB memory running Sequoia 15.7.3. Only other things "running" were Tailscale, Clash proxy, Teamviewer, Obsidian and Things3 -- all in background. My numbers show only direct CPU usage excluding other processes.

dense bison
#

Got it, lemme push a quick benchmark script

#

Try this please:

Run on the affected machine:

git fetch https://github.com/obviyus/clawdbot.git bench/sessions-list-main
git checkout FETCH_HEAD
pnpm install
pnpm test:sessions:list:bench -- --sessions 10000 --cold-runs 3 --runs 5 --warmup 1 --include-derived-titles --include-last-message

If you have ~6GB free disk, also run:

pnpm test:sessions:list:bench -- --sessions 10000 --inflate-transcript-kib 256 --cold-runs 1 --runs 3 --warmup 1 --include-derived-titles --include-last-message

Reply with:

  • full command output
  • OpenClaw version/commit
  • machine + storage type
  • number of agents
  • whether Control UI/TUI was open while running it
tropic fox
#

imma get Claude to do that. Whilst OC is running. ok?

dense bison
#

Make sure to test on latest main please!

tropic fox
#

ah.. ok. that will put my claw out of action so I'll just have to spend a little longer preparing for that first... brb.

dense bison
#

Thank you, really appreciate it

subtle gull
#

let me get it to answer the right question the right way, heres a few of the details from me getting the hotfix in and getting my chat app, called clawdash, working with smarter bounded calls

Why prod hurts and test does not:

  • Prod has 30 agent dirs, 1,876 files, about 1.7GB under session dirs, and heavy active churn.
  • Forge alone has 367 session-dir files and about 375MB.
  • The expensive part is not just returning rows, it is repeatedly discovering, loading, merging, filtering, and ranking a much larger all-agent session
    universe.
  • Test has the same architecture but lower cardinality and less churn, so the same O(all sessions / all agents) shape stays below the pain threshold.
#

but even with clawdash stopped and using the TUI, the openclaw node process goes to 100% cpu util for that core

#

and it grinds to a halt

#

this is what we just finished implementing, and its making things pretty usable

  1. ClawDash hotfix
    • Caches channel computation for ~60-75s.
    • Uses single-flight, so concurrent callers share one request.
    • During active ClawCanvas sends, serves stale channel metadata for up to 5m.
    • Calls gateway with bounded params:
      - limit: 80-160
      - activeMinutes: 1440
      - includeDerivedTitles: false
      - includeLastMessage: false
  2. Gateway hotfix
    • Adds a sessions.list response cache, 15s TTL.
    • Coalesces identical in-flight requests.
    • Still recomputes on cache miss or invalidation.
  3. Non-hotfixed gateway path
    • Discovers all agent session stores if agentId is absent.
    • Loads and merges every agent’s sessions.json.
    • Iterates the full combined store.
    • Applies filters after store merge.
    • Applies limit only after scanning candidates.
    • Builds rows, then yields every 10 rows, but much of the expensive prework is synchronous.
tropic fox
#

cogitating ...

subtle gull
#

Gateway logs showed repeated slow sessions.list responses.
sessions.list 8123ms
sessions.list 8559ms
sessions.list 8337ms
sessions.list 8392ms
sessions.list 8578ms
sessions.list 8696ms
sessions.list 11470ms

count=55
min=115ms
p50=1811ms
p95=8628ms
max=11470ms
avg=4296ms

Interpretation:

β€’Fast path exists: some calls completed in ~500-700ms
β€’Bad path is very repeatable: many calls clustered around 8.1-8.7s
β€’Worst observed: 11.47s
β€’Tool-level call timed out at 10s
So this is probably not random network jitter. It smells like conditional hydration work: some requests or states hit a slow metadata path.

. User-visible impact
The direct UI symptom:

β€’Control UI / ClawDash session list became slow or stale
β€’Status updates looked broken because the list did not refresh promptly
β€’It created enough lag that the operator thought the assistant might be broken
Operational concern:

If sessions.list is taking 8-11s, it competes with ordinary chat operations on the same gateway. Even if chat still works, the UI becomes a pressure source. That is future outage bait: an observability/control surface should not degrade the thing it monitors.

#

god he loves talking about outages

tropic fox
#

main as in beta4 right?

dense bison
subtle gull
#
  1. Likely expensive areas
    The suspected expensive work is per-row hydration across many sessions.

Likely hot spots:

β€’title generation or derived title hydration
β€’last-message preview hydration
β€’transcript head/tail reads per session
β€’model/runtime metadata projection
β€’token/usage fallback calculation
β€’plugin/setup metadata projection
β€’repeated UI polling every ~30s
The key risk is if sessions.list performs anything close to a transcript audit for each row.

Expected behavior should be:

β€’cheap index read
β€’cached metadata
β€’bounded preview data
β€’opt-in expensive hydration

#

i think that more or less summarizes what i asked him to consolidate for you. im on npm branches, i could spin up another test server and migrate my install to that and convert it to a github main install

tropic fox
#

building... might take a while.

dense bison
subtle gull
#

im spinning up a system to test everything with

dense bison
#

Thank you both 🫑

subtle gull
#

we can go hog wild

dense bison
#

Let's get @digital frost in here too

digital frost
subtle gull
#

ive got a template and everything, sonnet will take care of this quickly

digital frost
#

am I QA now? what we testing

calm pebble
#

i tried to run it for another datapoint. it failed horribly and I ain't breaking anything for it

dense bison
tropic fox
#

baddies all up in this joint

digital frost
tropic fox
#

now where's that "whipping claude" GIF...

digital frost
#

I think they let me remove roles... one sec...

tropic fox
#

Brave plugin causing issues... removing it to save time.

digital frost
#

$ pnpm test:sessions:list:bench -- --sessions 10000 --cold-runs 3 --runs 5 --warmup 1 --include-derived-titles --include-last-message

openclaw@2026.5.3 test:sessions:list:bench <repo-root>
node --import tsx scripts/bench-sessions-list.ts -- --sessions 10000 --cold-runs 3 --runs 5 --warmup 1 --include-derived-titles --include-last-message

[sessions-list-bench] cold 1/3: rows=10000 wall=682.2ms eventLoopDelayMax=1.7ms
[sessions-list-bench] cold 2/3: rows=10000 wall=712.7ms eventLoopDelayMax=1.9ms
[sessions-list-bench] cold 3/3: rows=10000 wall=760.3ms eventLoopDelayMax=1.6ms
[sessions-list-bench] cloned 10000 sessions across 1 agents from 2/2 transcript-backed source rows across 1 stores (2420 MiB written)
[sessions-list-bench] warmup 1/1: rows=10000 wall=625.7ms eventLoopDelayMax=2.3ms
[sessions-list-bench] run 1/5: rows=10000 wall=111.0ms eventLoopDelayMax=2.0ms
[sessions-list-bench] run 2/5: rows=10000 wall=122.2ms eventLoopDelayMax=1.7ms
[sessions-list-bench] run 3/5: rows=10000 wall=115.6ms eventLoopDelayMax=2.4ms
[sessions-list-bench] run 4/5: rows=10000 wall=111.1ms eventLoopDelayMax=2.0ms
[sessions-list-bench] run 5/5: rows=10000 wall=117.0ms eventLoopDelayMax=1.7ms
[sessions-list-bench] cold summary: sessions=10000 min=682.2ms p50=712.7ms p95=760.3ms max=760.3ms avg=718.4ms
[sessions-list-bench] summary: sessions=10000 rows=10000 min=111.0ms p50=115.6ms p95=122.2ms max=122.2ms avg=115.4ms

#

$ pnpm test:sessions:list:bench -- --sessions 10000 --inflate-transcript-kib 256 --cold-runs 1 --runs 3 --warmup 1 --include-derived-titles --include-last-message

openclaw@2026.5.3 test:sessions:list:bench <repo-root>
node --import tsx scripts/bench-sessions-list.ts -- --sessions 10000 --inflate-transcript-kib 256 --cold-runs 1 --runs 3 --warmup 1 --include-derived-titles --include-last-message

[sessions-list-bench] cold 1/1: rows=10000 wall=685.6ms eventLoopDelayMax=1.7ms
[sessions-list-bench] cloned 10000 sessions across 1 agents from 2/2 transcript-backed source rows across 1 stores (3624 MiB written)
[sessions-list-bench] warmup 1/1: rows=10000 wall=785.2ms eventLoopDelayMax=2.1ms
[sessions-list-bench] run 1/3: rows=10000 wall=107.0ms eventLoopDelayMax=1.6ms
[sessions-list-bench] run 2/3: rows=10000 wall=112.0ms eventLoopDelayMax=1.4ms
[sessions-list-bench] run 3/3: rows=10000 wall=107.3ms eventLoopDelayMax=1.4ms
[sessions-list-bench] cold summary: sessions=10000 min=685.6ms p50=685.6ms p95=685.6ms max=685.6ms avg=685.6ms
[sessions-list-bench] summary: sessions=10000 rows=10000 min=107.0ms p50=107.3ms p95=112.0ms max=112.0ms avg=108.8ms

#

doesn't look like I can repro

subtle gull
#

how many agent directories do you have?

digital frost
subtle gull
#

hm. ask your agent to make you like 10 more agents? i dont know that they actually need to do anything

#

you could go through the tui and just say hi to all of them

#

actually scratch that, my test server has all 15 of my main agents on it

#

its all the subagents that seem to be the real difference between my prod and test server

tropic fox
#

it's the number of session files, isn't it?

subtle gull
#

you would think, but the benchmark makes it look like its not

digital frost
#

my QA paired programmer

#

reran with 10 agents

#

anything else you want me to try?

#

or I'm nuking these agents and sessions lol

dense bison
#

Thank you for testing 🫑

digital frost
subtle gull
#

wiring up auth on the test server now

tropic fox
#

claude sez...

⏺ Here are the baseline numbers on 2026.5.3 main, no sessions.list caching (pre-#62164):

  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚        Metric        β”‚ 200 sessions β”‚ 1000 sessions β”‚
  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
  β”‚ Warm p50             β”‚ ~1.35s       β”‚ ~5.4s         β”‚
  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
  β”‚ Warm p95             β”‚ ~1.37s       β”‚ ~6.1s         β”‚
  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
  β”‚ Event loop delay avg β”‚ ~307ms       β”‚ ~355ms        β”‚
  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
  β”‚ Cold min             β”‚ ~4.8s        β”‚ ~5.9s         β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

  The event loop blocking is significant (~350ms for 1k sessions), and the call scales roughly linearly. These are the "before"
  numbers for when #62164 eventually lands.

  Summary of what just happened:
  - Installed: OpenClaw 2026.5.3 from GitHub main (commit 5a6cedc1) via npm pack + npm install -g
  - Config cleaned up: Removed brave (broken npm package, no compiled output), codex, discord stale entries
  - Gateway: Running at pid 62167, reachable, version confirmed 2026.5.3
  - Benchmark run: sessions.list at 1000 sessions β†’ warm p50 ~5.4s, event loop delay ~355ms β€” this is the pre-#62164 baseline
#

actually getting codex to run it as well since it has more context about the fix.

dense bison
tropic fox
#

hmmm stupid clanker. I saw it deciding between 5.4 and 5.3 somehow. and said it was going to use the several hours older version cos it was already in npm. I stopped it, that's when I asked you, and I told it explicitly "I want you to update my local install with the version on github main". and it took a good 20-30mins to rebuild using pnpm. That's the best way I can articulate what I saw it do lol

digital frost
tropic fox
#

Benchmark 1 Output

openclaw@2026.5.3 test:sessions:list:bench /private/tmp/clawdbot2
node --import tsx scripts/bench-sessions-list.ts -- --sessions 10000 --cold-runs 3 --runs 5 --warmup 1 --include-derived-titles --include-last-
message

  [sessions-list-bench] cold 1/3: rows=10000 wall=47172.0ms eventLoopDelayMax=3189.8ms
  [sessions-list-bench] cold 2/3: rows=10000 wall=43038.5ms eventLoopDelayMax=206.7ms
  [sessions-list-bench] cold 3/3: rows=10000 wall=40875.8ms eventLoopDelayMax=196.3ms
  [sessions-list-bench] cloned 10000 sessions across 3 agents from 34/35 transcript-backed source rows across 6 stores (921 MiB written)
  [sessions-list-bench] warmup 1/1: rows=10000 wall=39691.1ms eventLoopDelayMax=415.5ms
  [sessions-list-bench] run 1/5: rows=10000 wall=36949.5ms eventLoopDelayMax=187.4ms
  [sessions-list-bench] run 2/5: rows=10000 wall=39557.2ms eventLoopDelayMax=335.3ms
  [sessions-list-bench] run 3/5: rows=10000 wall=42008.1ms eventLoopDelayMax=200.4ms
  [sessions-list-bench] run 4/5: rows=10000 wall=42586.7ms eventLoopDelayMax=203.0ms
  [sessions-list-bench] run 5/5: rows=10000 wall=40365.8ms eventLoopDelayMax=397.7ms
  [sessions-list-bench] cold summary: sessions=10000 min=40875.8ms p50=43038.5ms p95=47172.0ms max=47172.0ms avg=43695.4ms
  [sessions-list-bench] summary: sessions=10000 rows=10000 min=36949.5ms p50=40365.8ms p95=42586.7ms max=42586.7ms avg=40293.5ms
#

Benchmark 2 Output

openclaw@2026.5.3 test:sessions:list:bench /private/tmp/clawdbot2
node --import tsx scripts/bench-sessions-list.ts -- --sessions 10000 --inflate-transcript-kib 256 --cold-runs 1 --runs 3 --warmup 1 --include-
derived-titles --include-last-message

  [sessions-list-bench] cold 1/1: rows=10000 wall=46099.3ms eventLoopDelayMax=1950.4ms
  [sessions-list-bench] cloned 10000 sessions across 3 agents from 34/35 transcript-backed source rows across 6 stores (2893 MiB written)
  [sessions-list-bench] warmup 1/1: rows=10000 wall=43491.8ms eventLoopDelayMax=214.3ms
  [sessions-list-bench] run 1/3: rows=10000 wall=39897.2ms eventLoopDelayMax=211.7ms
  [sessions-list-bench] run 2/3: rows=10000 wall=41191.1ms eventLoopDelayMax=422.1ms
  [sessions-list-bench] run 3/3: rows=10000 wall=42582.5ms eventLoopDelayMax=213.9ms
  [sessions-list-bench] cold summary: sessions=10000 min=46099.3ms p50=46099.3ms p95=46099.3ms max=46099.3ms avg=46099.3ms
  [sessions-list-bench] summary: sessions=10000 rows=10000 min=39897.2ms p50=41191.1ms p95=42582.5ms max=42582.5ms avg=41223.6ms
#

pulling main again. 5 new commits...

tropic fox
#

Quick aside on the Brave issue:

    directory, no index.js or index.mjs. OpenClaw 2026.5.3 requires compiled JS artifacts for plugin packages and refuses to start if it finds a
    TypeScript-only entry. Every time npm install -g runs the new tarball, openclaw's plugin manager re-fetches and reinstalls brave from npm,
    bringing the broken package back.```
calm pebble
digital frost
#

story of my fuckin life right now

tropic fox
#

Quoth Claude:

⏺ Running on commit 3dcff3b β€” that's the latest main tip.
⏺ There it is. This pull added SESSIONS_LIST_DEFAULT_LIMIT = 100 β€” when no explicit limit is passed, sessions.list now caps at 100 rows by default, specifically to prevent large session stores from monopolizing the event loop.

Here's the comparison:

  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚                β”‚ Codex baseline (5a6cedc) β”‚ Now (3dcff3b) β”‚         Delta         β”‚
  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
  β”‚ Warm p50       β”‚ 40.4s                    β”‚ 1.0s          β”‚ -97.5%                β”‚
  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
  β”‚ Warm p95       β”‚ 42.6s                    β”‚ 1.0s          β”‚ -97.6%                β”‚
  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
  β”‚ Cold p50       β”‚ 43.0s                    β”‚ 2.3s          β”‚ -94.6%                β”‚
  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
  β”‚ ELD avg (warm) β”‚ ~265ms                   β”‚ ~228ms        β”‚ -14%                  β”‚
  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
  β”‚ rows returned  β”‚ 10,000                   β”‚ 100           β”‚ default limit applied β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Sorry for the delay. AC contractor kicked my router. lol.

calm pebble
# digital frost

it wasn't so bad. gave an excuse to version bump it, which tends to increase the downloads (it's probably bots but, hey, number goes up)

tropic fox
#

codex running the same benchmarks with 3dcff3b but sez it's unchanged for cold runs. :/ wtf. lol. waiting for final results...

#

that makes no sense. Must be some caching going on

#

Codex ran it in the wrong location and it picked up the old source tree. False alarm. The fix is real.

dense bison
#

Woo! Nice

#

So I'll close your caching PR in that case

calm pebble
#

does this land in 5.3? or next beta?

dense bison
#

In 5.3

tropic fox
#

mine is already closed. the 6000s one is someone elses, with the broader fix. Even codex says your patch is probably better than our fix because it bounds it by default. lol

#

My claw is currently awol... but I have yet to investigate why

digital frost
#

glad we figured it out

#

🐐

#

k gnight nerds

subtle gull
#

scrolling back up to try to test this out, i got tied up last night and by the time i got back to it the discord was down, im not seeing a link to running the test? i'm still hitting perf issues on 2026.5.3, and its being replicated with the test server that i built and shifted my agent & subagent sessions over

#

my agent did a check through the commit history on sessions.list and has this theory

The regression likely came from sessions.list becoming too rich after v2026.4.23: subagent graph enrichment, child-session indexing, thinking/model metadata, transcript/title preview work, and plugin/session projections all ended up on a polling path.

subtle gull
#

found it, running now, its running like crap with a "full" prod server load so i'll try a few scenarios

#

[sessions-list-bench] cold 1/3: rows=10000 wall=77389.5ms eventLoopDelayMax=1448.1ms
[sessions-list-bench] cold 2/3: rows=10000 wall=75926.4ms eventLoopDelayMax=119.4ms
[sessions-list-bench] cold 3/3: rows=10000 wall=75191.0ms eventLoopDelayMax=118.6ms
[sessions-list-bench] cloned 10000 sessions across 18 agents from 20/20 transcript-backed source rows across 18 stores (521 MiB written)

[sessions-list-bench] warmup 1/1: rows=10000 wall=75576.6ms eventLoopDelayMax=124.1ms
[sessions-list-bench] run 1/5: rows=10000 wall=75348.7ms eventLoopDelayMax=112.3ms


#

ok, this is odd. i shut down the TUI, shut down my chat app, nothing happening, then ran it again and im seeing similar figures plus the openclaw node process is not eating 100% cpu

#

pnpm test:sessions:list:bench -- --sessions 10000 --cold-runs 3 --runs 5 --warmup 1 --include-derived-titles --include-last-message

openclaw@2026.5.3 test:sessions:list:bench /tmp/clawdbot-bench
node --import tsx scripts/bench-sessions-list.ts -- --sessions 10000 --cold-runs 3 --runs 5 --warmup 1 --include-derived-titles --include-last-message

[sessions-list-bench] cold 1/3: rows=10000 wall=75461.7ms eventLoopDelayMax=662.2ms
[sessions-list-bench] cold 2/3: rows=10000 wall=75094.8ms eventLoopDelayMax=113.8ms

#

my test server that has no subagents

[sessions-list-bench] cold 1/3: rows=10000 wall=642.7ms eventLoopDelayMax=2.9ms
[sessions-list-bench] cold 2/3: rows=10000 wall=521.1ms eventLoopDelayMax=2.2ms
[sessions-list-bench] cold 3/3: rows=10000 wall=547.0ms eventLoopDelayMax=14.8ms
[sessions-list-bench] cloned 10000 sessions across 2 agents from 2/2 transcript-backed source rows across 20 stores (2717 MiB written)
[sessions-list-bench] warmup 1/1: rows=10000 wall=537.6ms eventLoopDelayMax=2.4ms
[sessions-list-bench] run 1/5: rows=10000 wall=161.7ms eventLoopDelayMax=2.2ms
[sessions-list-bench] run 2/5: rows=10000 wall=162.4ms eventLoopDelayMax=2.0ms
[sessions-list-bench] run 3/5: rows=10000 wall=157.8ms eventLoopDelayMax=2.3ms
[sessions-list-bench] run 4/5: rows=10000 wall=163.2ms eventLoopDelayMax=2.2ms
[sessions-list-bench] run 5/5: rows=10000 wall=161.5ms eventLoopDelayMax=2.1ms
[sessions-list-bench] cold summary: sessions=10000 min=521.1ms p50=547.0ms p95=642.7ms max=642.7ms avg=570.3ms
[sessions-list-bench] summary: sessions=10000 rows=10000 min=157.8ms p50=161.7ms p95=163.2ms max=163.2ms avg=161.3ms

#

my agent who is helping me with this pulled out the footgun card, and has a number of hypothesis we are testing out now, when i can get the shape of this to change wiht a discrete step i'll provide mroe detail

#

Fix direction:

β€’Do not build subagent read index for sessions.list unless subagent fields are requested or a spawnedBy/active-run filter requires it.
β€’Cache the subagent read index per request window or registry signature.
β€’Make descendant counts lazy and bounded.
β€’Avoid applying subagent graph enrichment to ordinary non-subagent rows by default.
This is probably the path that turns β€œ10k rows” into β€œ10k rows times subagent graph work.” Classic production footgun: metadata enrichment got promoted into the hot list path.

subtle gull
#

my agent patched the test runner to have source limit, 5 ran great, 6 fell flat

node --import tsx scripts/bench-sessions-list.ts -- --sessions 10000 --source-limit 5 --cold-runs 1 --runs 1 --warmup 0

[sessions-list-bench] cold 1/1: rows=10000 wall=395.5ms eventLoopDelayMax=2.7ms
[sessions-list-bench] cloned 10000 sessions across 5 agents from 5/20 transcript-backed source rows across 18 stores (370 MiB written)
[sessions-list-bench] run 1/1: rows=10000 wall=348.8ms eventLoopDelayMax=2.3ms
[sessions-list-bench] cold summary: sessions=10000 min=395.5ms p50=395.5ms p95=395.5ms max=395.5ms avg=395.5ms
[sessions-list-bench] summary: sessions=10000 rows=10000 min=348.8ms p50=348.8ms p95=348.8ms max=348.8ms avg=348.8ms
lumadmin@psiclaw-template:/tmp/clawdbot-bench-forge$ pnpm test:sessions:list:bench -- --sessions 10000 --source-limit 6 --cold-runs 1 --runs 1 --warmup 0

openclaw@2026.5.3 test:sessions:list:bench /tmp/clawdbot-bench-forge
node --import tsx scripts/bench-sessions-list.ts -- --sessions 10000 --source-limit 6 --cold-runs 1 --runs 1 --warmup 0

[sessions-list-bench] cold 1/1: rows=10000 wall=83061.9ms eventLoopDelayMax=129.2ms
[sessions-list-bench] cloned 10000 sessions across 6 agents from 6/20 transcript-backed source rows across 18 stores (559 MiB written)

subtle gull
#

found it

#

this is bizzare

#

Cause found: modelOverride.

It is not the session store shape generally. It is this field:

"modelOverride": "gpt-5.5"
ConfigπŸ“‹
In sessions.list, every row calls:

resolveSessionModelRef(cfg, entry, sessionAgentId)
CodeπŸ“‹
when entry.modelOverride exists. That path goes through plugin model normalization and reloads plugin metadata/manifest machinery per row. When the benchmark clones one bad source row into thousands of rows, it becomes catastrophic.

Trace evidence from CPU profile:

listSessionsFromStoreAsync
buildGatewaySessionRow
resolveSessionModelRef
resolvePersistedSelectedModelRef
parseModelRef
normalizeModelRef
normalizeProviderModelId
normalizeStaticProviderModelId
normalizeProviderModelIdWithManifest
loadManifestModelIdNormalizationPolicies
loadPluginMetadataSnapshot
loadPluginRegistrySnapshotWithMetadata
loadInstalledPluginIndex
discoverOpenClawPlugins
SnippetπŸ“‹
Repro evidence:

Original Forge template source-store:

--sessions 10000
--source-store forge-psiclaw-template/sessions/sessions.json
--agent-count 1
SnippetπŸ“‹
Result:

timeout at 180s
no cold sample emitted
πŸ“‹
Same store, only removing modelOverride:

cold 1/1: rows=10000 wall=555.7ms
run 1/1: rows=10000 wall=524.8ms

#

i backed into this by testing the agents directory cloned over to my test server and running the 10k sessions test against more and more folders at once

#

the first 5 had no override and executed just fine. the 6th turns out it had an override

#

after getting the script to display which folder it was running i started renaming the folders to get the alpha sort different until i validating it wasnt based on the number of sessions in the agents folder or the size of the sessions

#

{
"key": "agent:forge:webchat:forge-main",
"modelOverride": "gpt-5.5",
"model": "gpt-5.5",
"modelProvider": "openai-codex"
}

#

Root issue: entry.modelOverride caused model normalization per session row. With 10k cloned rows, that repeatedly hit plugin metadata discovery. Future outage, now with a receipt.

#

he did it! the outage has a fucking receipt

subtle gull
#

Classic future outage with better marketing.

#

makes me giggle

#

prepping PR

subtle gull
#

i think you found my PR @dense bison

#

my clanker is being a little odd, its on the dev server and its memory system has been heavily dev'd

#

77650

#

its working through getting it right

subtle gull
#

@dense bison i think we got it right

#

its a pretty straightforward test, a crazy little thing not related to subagents or amount of agents but to per session model overrides, evaluated for every line

#

a bit excessive, the cache is scoped to just the session.list call, and seems to be a big part of the root cause people started seeing around 4.29

#

but not everyone, since not everyone does things like pin sessions to models

subtle gull
#

Yep β€” confirmed.

PR #77650 is now:

β€’mergeable: true
β€’mergeable_state: clean
β€’Checks: 86 success, 1 neutral, 7 skipped
β€’0 queued / 0 in progress / 0 failing
β€’Head: b07c51a3f016
So it’s in final review/merge-ready state.

subtle gull
#

if i need to teach my clanker a think or two about submitting PRs im all ears, ive got a skill in a repo im developing to get this right

#

did a little research, agent was part opus 4.6 and part gpt 5.5

dense bison
#

Most PRs just get the automated tests to pass and then I have to manually verify

subtle gull
#

i tested it extensively to find the real problem

dense bison
digital frost
#

this is why you're allowed to ping me

#

❀️

tropic fox
subtle gull
#

hell yeah and the fix made it into 2026.5.4-beta.3 just in the nick of time, and it looks like 2026.5.4 went live too

#

short beta

tropic fox
#

I'm already using it. painless and quick.

#

release is bit-copy of beta3. great release.

subtle gull
#

looks like only part of this PR got applied?

So 2026.5.4 shipped with the thinkingLevelsByModelRef cache, but not the selected-model override cache from our PR. That explains why it feels like the fix did not apply: the exact hot path we identified is still present in prod.

#

my prod server is hosed after updating πŸ˜„ reapplying my original fix

#

09:33:47+00:00 info gateway/ws {"subsystem":"gateway/ws"} ⇄ res βœ“ sessions.list 197ms conn=d554971d…0751 id=691b95b3…b625
09:33:47+00:00 info gateway/ws {"subsystem":"gateway/ws"} ⇄ res βœ“ sessions.list 176ms conn=d554971d…0751 id=a63c9cf1…24ed
09:33:48+00:00 info gateway/ws {"subsystem":"gateway/ws"} ⇄ res βœ“ sessions.list 187ms conn=d554971d…0751 id=31fb61d5…6f02
09:34:08+00:00 info gateway/ws {"subsystem":"gateway/ws"} ⇄ res βœ“ sessions.list 14102ms conn=d554971d…0751 id=1c963b38…1a76
09:34:22+00:00 info gateway/ws {"subsystem":"gateway/ws"} ⇄ res βœ“ sessions.list 14243ms conn=d554971d…0751 id=f3c8a1a7…6489
09:34:22+00:00 info gateway/ws {"subsystem":"gateway/ws"} ⇄ res βœ“ chat.history 28354ms conn=d554971d…0751 id=3461a64e…6f00
09:34:37+00:00 info gateway/ws {"subsystem":"gateway/ws"} ⇄ res βœ“ sessions.list 14783ms conn=d554971d…0751 id=00fcfbb2…eb91
09:34:37+00:00 info gateway/ws {"subsystem":"gateway/ws"} ⇄ res βœ“ chat.history 29041ms conn=d554971d…0751 id=940ac2c5…0781
09:34:37+00:00 info gateway/ws {"subsystem":"gateway/ws"} ⇄ res βœ“ chat.history 14801ms conn=d554971d…0751 id=e9487977…b848
09:34:51+00:00 info gateway/ws {"subsystem":"gateway/ws"} ⇄ res βœ“ sessions.list 13652ms conn=d554971d…0751 id=7ad03cb9…e40f

#

that escalated quickly

#

oh its not in 2026.5.4 πŸ˜„

#

my agent told me that it was in 2026.5.4-beta.3 but it might have been checking local source which.. surprise would contain the fix it made

#

If I were patching next, I’d start in the Control UI/ClawCanvas request layer and add a small RPC broker: method + stableJson(params) β†’ existing in-flight request or cached recent result. That stops the bear instead of just feeding it tranquilizers.
uh oh, i called the problem a bear. goblinification warning klaxons are going off

#

So the row-cache fix is not gone. The remaining failure is the request storm plus too-short cache TTL. Default sessions.list cache is only 10s, so ClawCanvas keeps hitting the expensive cold path every ~20s. Future outage, now current outage.
full circle back to outages. i think he feels very vindicated. its not a future outage, its a current outage that he predicted

#

the fixes from my PR do not improve the current situation, there is more going on

#

im still investigating, 5.5 will keep going on this, but i need to go to bed

digital frost
#

robot borken right now, but I got you bb @subtle gull

#

bout time you tism maxers finally did a PR KEKWlaugh

subtle gull
#

ive had a smooth going with openclaw mostly and i finally hit an issue that i was having and the maintainers werent

#

actually need to modify the PR, my claw found the real problem

#

The current root-cause fix targets the model identity resolver path:

β€’slow call is resolveSessionModelRef(...), ~550-700ms per affected row
β€’many rows still hit it even after our PR, especially rows missing persisted modelProvider + model
β€’resolveSessionModelIdentityRef(...) falls back into resolveSessionModelRef(...)
β€’resolveSessionDisplayModelIdentityRef(...) can also re-enter default model resolution for CLI/provider display normalization
So the real fix is:

1.Use persisted entry.modelProvider + entry.model directly for lightweight sessions.list rows.
2.Cache resolveSessionModelRef by agentId + modelOverride/providerOverride + runtime model/provider.
3.Precompute default model refs per agent once per list.
4.Avoid display-model normalization in lightweight rows unless required.
Short version: our PR cached some derived row fields; this fix removes/caches the expensive default-model resolver itself. The old PR was a bandage around the hotspot. This is cutting out the hotspot.

subtle gull
#

232044+00:00 info gateway/ws {"subsystem":"gateway/ws"} ⇄ res βœ“ chat.history 59ms conn=b9f8ca9b…4ee0 id=4da6c5aa…673e
232101+00:00 info gateway/ws {"subsystem":"gateway/ws"} ⇄ res βœ“ chat.history 52ms conn=b9f8ca9b…4ee0 id=9baa2687…c1a7
232116+00:00 info gateway/ws {"subsystem":"gateway/ws"} ⇄ res βœ“ sessions.list 1119ms conn=b9f8ca9b…4ee0 id=5b88262c…ab42
232122+00:00 info gateway/ws {"subsystem":"gateway/ws"} ⇄ res βœ“ logs.tail 68ms conn=4440f309…bc82 id=75d76078…0ad3
232122+00:00 info gateway/ws {"subsystem":"gateway/ws"} ⇄ res βœ“ chat.history 71ms conn=b9f8ca9b…4ee0 id=0c2a2e54…ba0f
232141+00:00 info gateway/ws {"subsystem":"gateway/ws"} ⇄ res βœ“ chat.history 54ms conn=b9f8ca9b…4ee0 id=6c74419a…cc14
232202+00:00 info gateway/ws {"subsystem":"gateway/ws"} ⇄ res βœ“ chat.history 51ms conn=b9f8ca9b…4ee0 id=a2b1f84e…5d71
232217+00:00 info gateway/ws {"subsystem":"gateway/ws"} ⇄ res βœ“ sessions.list 1149ms conn=b9f8ca9b…4ee0 id=0bc8245d…5f8e
232221+00:00 info gateway/ws {"subsystem":"gateway/ws"} ⇄ res βœ“ chat.history 52ms conn=b9f8ca9b…4ee0 id=d44b741b…d008
232241+00:00 info gateway/ws {"subsystem":"gateway/ws"} ⇄ res βœ“ chat.history 55ms conn=b9f8ca9b…4ee0 id=88b8e8cc…f8b4

#

ok we might be cooking

#

fully loaded has 1.1s as worst case, vs uhhhh like 36s

digital frost
subtle gull
#

anything further than this

#

its all in sessions.list

#

getting the full analysis from my claw about what it found

#

we are going back to clean slate and patching just the sessions.ts that was responsible, it went in a few other weird experimental areas before we put a custom sessions.ts with extensive logging in to find out which call was the problem child

#

up until the last patch the peaks would still get up to 8000ms, which could start a backlog issue

subtle gull
#

updated PR incoming

tropic fox
digital frost
subtle gull
#

the core problem isnt an edge case, when it gets fixed it benefits a wide range of people, even instances where it doesnt blow the duration of sessions.list from ~500ms steady state to 30000ms+

#

even 6000-7000ms is enough to negatively impact the event queue and affect a lot of people. and the potential causes are quite a bit and common

#

this can happen with 1 agent and zero subagents provided the user has more than 1 model, hell, it probably even can happen with a single model

digital frost
subtle gull
#

what happens is that sessions.list looks at all off hte sessions in the openclaw folder, and all of the lines of conversation in the sessions, if a line has static model/thinking/provider assignments, it does it kicks off a blocking provider/model/thinking mode resolution path for each line

tropic fox
subtle gull
#

that call could be slow, especially if multiple providers and models are present

#

if you have a session file with 1000 entries in it, or 100 session files with 10 entries, things get ugly

tropic fox
#

I could probably also repro this but I preemptively archived most of my session files to avoid this and related issues (most of which were already fixed in the last push)

subtle gull
#

77902

#

its a bigger patch than the last one but 2026.5.4 changed things, the STR are explicit with showing the default path is the bad path, and how it blows up the more it hits the worse path

dense bison
subtle gull
#

i have a few revisions from recent copilot suggestions

#

its a little narrower

#

Read: your suspicion is right. For the active fleet stores, the design assumption is backwards. sessions.list is optimized around β€œmodel data may be missing,” but the common case is β€œmodel data is already persisted.”

Operational conclusion: the fast path should be the default:

1.Use providerOverride + modelOverride if present.
2.Else use persisted modelProvider + model if present.
3.Only fall back to expensive default/agent/model resolution when both are missing.
That keeps correctness for old or malformed rows, but stops making the normal path wear the blast suit.

#

only sampled 37 on 2 different servers

#

100% of them had the data, and currently sessions.list does model resolution each row if it has it, but doesnt when its missing. if it already has the data, why is it looking it up?

#

1000 messages and most of them had the data

dense bison
subtle gull
#

Latest solution: make persisted model identity the primary path for sessions.list.

Current shape:

1.Fast path first
β€’If row has providerOverride + modelOverride, use it.
β€’Else if row has persisted modelProvider + model, use it.
β€’Do not call default model resolution.
2.Fallback only for sparse/legacy rows
β€’If both persisted paths are missing, fall back to existing agent/default model resolution.
β€’That preserves correctness for old rows.
3.Lightweight rows stay lightweight
β€’Skip thinking enrichment for lightweight sessions.list rows.
β€’Do not build full model/thinking option metadata unless the caller needs detail-view data.
4.Optional storm guard
β€’Keep response caching / UI poll damping as secondary mitigation.
β€’But the real fix is avoiding the expensive resolver in the normal row path.
Evidence so far:

β€’openclaw-prod current Gateway store: 23/23 rows had persisted model identity
β€’local current store: 14/14 rows had persisted model identity
β€’broader prod sample: 372/408 artifacts complete, missing cases were older sparse session-jsonl
β€’prod benchmark:
β€’before: full list around 8s
β€’temp patched dist: around 640ms
β€’installed direct benchmark: 776-873ms
β€’real Gateway WS log: sessions.list 124ms
So my current recommendation is: fold the persisted-model fast path into the maintainer PR as the primary fix, then keep the cache pieces only as defensive pressure relief. The previous cache-only path was a bandage. This is the artery clamp.

#

PR update in flight, should land shortly

#

5.5 in high thinking is a little slow sometimes

dense bison
subtle gull
#

sorry, hitting long tool calls that are killing results, getting it tightened up

dense bison
#

haha no rush

subtle gull
#

this clanker is getting tired we've been at it off and on all day

#

now test is running worse than prod lol. prod was chilling doing nothing most of hte day just watching openclaw logs --follow session.list and chat.history metrics

#

uploaded, its running through the copilot/clawsweeper comments now, might have some more minor revisions

#

One Copilot note is still worth tightening before I reply: runtime model identity should bypass the default/agent cache key entirely. I’m making that small follow-up so the responses are backed by code, not hand-waving.
He doesn't like hand waving

#

The revised head is now e92c9ad7ea and pushed. I’m posting concise replies to the ClawSweeper/Copilot threads now, then I’ll try to resolve the review threads if the token has permission.

#

Comments:

β€’Replied to ClawSweeper
β€’Replied to all 5 Copilot review comments
β€’Resolved all 5 review threads via GraphQL

#

ok, i think hes done for now

dense bison
#

I don't like that it adds another cache so I'll see if I can keep it fast without it

subtle gull
#

yeah, ready. i tried to avoid caches but look at the results

#

made them very narrow

#

I’d describe it as request-local row-resolution memoization, not a broad cache.

Narrowness:

β€’Lifetime: one sessions.list call only. It dies after the response is built.
β€’Scope: only model identity resolution for rows in that single response.
β€’Keys: effective inputs that affect model identity: runtime model, fallback ref, override ref, agent/default model ref.
β€’No cross-request state: no TTL, no invalidation problem, no stale config risk.
β€’No persistence: nothing survives config changes, session writes, or process lifetime.
Why it matters: sessions.list is not resolving one model. It is resolving model identity across hundreds to thousands of session rows, and many rows share the same effective defaults. Without memoization, the same config/default resolution path gets recomputed per row. That is the bad scaling behavior.

The maintainer-preferred version, β€œfast without cache,” is ideal if the resolver itself can become O(1) and allocation-light. But for this PR, the safer narrow fix is to avoid changing resolver semantics and just avoid repeating identical resolver work inside one response.

Suggested phrasing:

This is not a long-lived cache. It is request-local memoization inside one sessions.list response build. The intent is to preserve the existing resolver semantics, including aliases and wrapper-provider inference, while avoiding repeated identical model identity resolution across rows that share the same effective runtime/default/override inputs. Nothing survives the request, so there is no invalidation surface.

#

it wanted to cache this and cache that, time based, dont worry it'll be good enough but i kept saying no and to drill in on the specific blocking part of the hot path

#

the benchmarks showed the results improved as the problem was more specifically addressed

dense bison
#

Good catch @subtle gull, pushing a fix

dense bison
#

Nice!!! Going to land this in a bit

#

Great work, thank you @subtle gull and @tropic fox

tropic fox
#

props to @subtle gull for this bit. My fixes didn't fix his install and I inadvertently swerved it by archiving my sessions (^^;)

digital frost
dense bison
#

PR Landed 🫑

digital frost
subtle gull
#

my claw looking at that 20s resolveProivderHookPlugin

#

thinking 'i kicked this problems ass twice now i can go for some more'

#

Different function, same future outage jacket

#

oh hes got its number

subtle gull
# dense bison PR Landed 🫑

it looks like this isnt in 2026.5.7, is it showing up soon or is the underlying process getting reworked from the findings? I'm keeping track so I can reapply the fixes each time I update

dense bison
subtle gull
#

maybe my clanker is missing something, it keeps looking for the patch signals and says its not landed yet

#

β€’But v2026.5.7 does not contain the PR code:
β€’src/gateway/session-utils.ts in tag v2026.5.7 has no modelIdentityByEntryKey
β€’no allowPluginNormalization
β€’no lightweight-row fast-path markers from the PR
β€’GitHub compare shows the release tag diverged from the PR merge commit, not ahead of it.
Operational read: 2026.5.7 was likely cut from a maintenance/release branch that did not include our merged sessions.list PR. The fix is merged to main, but not shipped in this release.

#

soon as i cut a commit on another project i'll check to see if im safe to update to 2026.5.7 and then update it and not patch, my symptoms are immediate and severe so it will be an easy test

calm pebble
#

I had/have a PR that stayed in unreleased too, though in that case I was asked to review if it was needed at all or needed to change shape after this release

subtle gull
# dense bison It isn’t? It should’ve been in the next release after I landed it

Confirmed: openclaw-prod is on 2026.5.7, the package is missing the fast-path markers, and the symptom is back. I have a rebased 2026.5.7 build ready; applying it with a package backup now.

02:02:36+00:00 warn diagnostic {"subsystem":"diagnostic"} liveness warning: reasons=event_loop_delay interval=32s eventLoopDelayP99Ms=650.6 eventLoopDelayMaxMs=5838.5 eventLoopUtilization=0.687 cpuCoreRatio=0.711 active=1 waiting=0 queued=1 recentPhases=sidecars.subagent-recovery:4ms,sidecars.main-session-recovery:4ms,post-attach.update-sentinel:0ms,sidecars.model-prewarm:581ms,sidecars.session-locks:835ms,post-ready.maintenance:138ms work=[active=agent:forge:webchat:forge-main(processing,q=1,age=5s last=run:completed) queued=agent:forge:webchat:forge-main(processing,q=1,age=5s last=run:completed)]
02:02:37+00:00 info gateway/ws {"subsystem":"gateway/ws"} ⇄ res βœ“ chat.history 591ms conn=da87abf7…1a00 id=e210e490…5c3f
02:02:45+00:00 info gateway/ws {"subsystem":"gateway/ws"} ⇄ res βœ“ sessions.list 6502ms conn=da87abf7…1a00 id=a4c41974…01b8
02:02:45+00:00 info gateway/ws {"subsystem":"gateway/ws"} ⇄ res βœ“ logs.tail 6513ms conn=8e61cb8e…97ca id=cb88e5c3…3c9b
02:02:57+00:00 info gateway/ws {"subsystem":"gateway/ws"} ⇄ res βœ“ chat.history 595ms conn=da87abf7…1a00 id=f580d9d1…543e
02:03:07+00:00 info gateway/ws {"subsystem":"gateway/ws"} ⇄ res βœ“ sessions.list 6564ms conn=da87abf7…1a00 id=7e299566…abaf
02:03:17+00:00 info gateway/ws {"subsystem":"gateway/ws"} ⇄ res βœ“ chat.history 582ms conn=da87abf7…1a00 id=77f8feeb…e7e0
02:03:31+00:00 info gateway/ws {"subsystem":"gateway/ws"} ⇄ res βœ“ sessions.list 9194ms conn=da87abf7…1a00 id=fffc9483…1a18
02:03:32+00:00 info gateway/ws {"subsystem":"gateway/ws"} ⇄ res βœ“ chat.history 585ms conn=da87abf7…1a00 id=bacdf968…4389
02:03:52+00:00 info gateway/ws {"subsystem":"gateway/ws"} ⇄ res βœ“ sessions.list 6197ms conn=da87abf7…1a00 id=6ed2e67d…b38c
02:03:54+00:00 info gateway/ws {"subsystem":"gateway/ws"} ⇄ res βœ“ chat.history 617ms conn=da87abf7…1a00 id=bbab0b17…c024
02:04:14+00:00 info gateway/ws {"subsystem":"gateway/ws"} ⇄ res βœ“ sessions.list 6417ms conn=da87abf7…1a00 id=d5e3b7c8…2c7f
02:04:14+00:00 info gateway/ws {"subsystem":"gateway/ws"} ⇄ res βœ“ chat.history 7011ms conn=da87abf7…1a00 id=5b39928f…b341
02:04:36+00:00 info gateway/ws {"subsystem":"gateway/ws"} ⇄ res βœ“ sessions.list 6477ms conn=da87abf7…1a00 id=2323de92…ee2f
02:04:36+00:00 info gateway/ws {"subsystem":"gateway/ws"} ⇄ res βœ“ chat.history 7114ms conn=da87abf7…1a00 id=989a9b11…f76

yeah it doesnt look like its in

dense bison
subtle gull
#

after it repatched

021042+00:00 warn diagnostic {"subsystem":"diagnostic"} liveness warning: reasons=event_loop_delay interval=34s eventLoopDelayP99Ms=302 eventLoopDelayMaxMs=4810.9 eventLoopUtilization=0.472 cpuCoreRatio=0.483 active=1 waiting=0 queued=1 recentPhases=sidecars.subagent-recovery:3ms,sidecars.main-session-recovery:4ms,post-attach.update-sentinel:0ms,sidecars.model-prewarm:643ms,post-ready.maintenance:178ms,sidecars.session-locks:1217ms work=[active=agent:forge:webchat:forge-main(processing,q=1,age=8s last=run:completed) queued=agent:forge:webchat:forge-main(processing,q=1,age=8s last=run:completed)]
021044+00:00 info gateway/ws {"subsystem":"gateway/ws"} ⇄ res βœ“ logs.tail 842ms conn=b767019f…acba id=0d9a0228…fbfc
021057+00:00 info gateway/ws {"subsystem":"gateway/ws"} ⇄ res βœ“ sessions.list 1368ms conn=2edb97db…4bfe id=19fcfcb5…511d
021113+00:00 info gateway/ws {"subsystem":"gateway/ws"} ⇄ res βœ“ sessions.list 1384ms conn=2edb97db…4bfe id=58352e11…9490
021130+00:00 info gateway/ws {"subsystem":"gateway/ws"} ⇄ res βœ“ sessions.list 1384ms conn=2edb97db…4bfe id=fcacf92b…826a
021140+00:00 info gateway/ws {"subsystem":"gateway/ws"} ⇄ res βœ“ sessions.list 1384ms conn=2edb97db…4bfe id=0202e0b5…1bab
021148+00:00 info gateway/ws {"subsystem":"gateway/ws"} ⇄ res βœ“ sessions.list 1261ms conn=2edb97db…4bfe id=1f0f9d5d…3bcd
021150+00:00 info gateway/ws {"subsystem":"gateway/ws"} ⇄ res βœ“ logs.tail 834ms conn=7f63f694…8e09 id=2ed91323…b5b2
021206+00:00 info gateway/ws {"subsystem":"gateway/ws"} ⇄ res βœ“ sessions.list 1384ms conn=2edb97db…4bfe id=1ae0ae1e…edd8
021223+00:00 info gateway/ws {"subsystem":"gateway/ws"} ⇄ res βœ“ sessions.list 1369ms conn=2edb97db…4bfe id=901a65db…15c9
021239+00:00 info gateway/ws {"subsystem":"gateway/ws"} ⇄ res βœ“ sessions.list 1481ms conn=2edb97db…4bfe id=9053e7fa…b90a

higher than what it was when we made the PR
i asked specifically for 77902 tracing

#

For #77902: yes, GitHub reports it merged.

β€’PR: openclaw/openclaw#77902
β€’State: closed
β€’merged=True
β€’Merged at: 2026-05-06T07:33:55Z
β€’Merge commit: 3e0fcafb8720ba3c1bb0fb430b88013eff5c6d08
β€’Title: fix(gateway): fast-path qualified session-list model refs
Important caveat: it was merged, but it was not present in shipped v2026.5.7, which is why the prod hotpatch was still needed.

3.Code comparison
In v2026.5.7, SessionListRowContext only has the thinking-level cache:

type SessionListRowContext = {
subagentRuns: ReturnType;
storeChildSessionsByKey: Map;
thinkingLevelsByModelRef: Map>;
};

In origin/main, it has the missing selected-model cache:

type SessionListRowContext = {
subagentRuns: ReturnType;
storeChildSessionsByKey: Map;
selectedModelByOverrideRef: Map>;
thinkingLevelsByModelRef: Map>;
};

4.Row context initialization

v2026.5.7:

return {
subagentRuns,
storeChildSessionsByKey: buildStoreChildSessionIndex(params.store, params.now, subagentRuns),
thinkingLevelsByModelRef: new Map(),
};

origin/main:

return {
subagentRuns,
storeChildSessionsByKey: buildStoreChildSessionIndex(params.store, params.now, subagentRuns),
selectedModelByOverrideRef: new Map(),
thinkingLevelsByModelRef: new Map(),
};

5.Missing helper function in shipped tag
origin/main contains:

function resolveSessionSelectedModelRef(params: {
cfg: OpenClawConfig;
entry?: SessionEntry;
agentId: string;
rowContext?: SessionListRowContext;
}): ReturnType | null {
const override = normalizeStoredOverrideModel({
providerOverride: params.entry?.providerOverride,
modelOverride: params.entry?.modelOverride,
});
if (!override.modelOverride) {
return null;
}
if (!params.rowContext) {
return resolveSessionModelRef(params.cfg, params.entry, params.agentId);
}
const key = [
normalizeAgentId(params.agentId),
override.providerOverride ?? "",
override.modelOverride,
].join("\0");
const cached = params.rowContext.selectedModelByOverrideRef.get(key);
if (cached) {
return cached;
}
const selected = resolveSessionModelRef(params.cfg, params.entry, params.agentId);
params.rowContext.selectedModelByOverrideRef.set(key, selected);
return selected;
}

v2026.5.7 does not have this helper. It jumps from createSessionRowModelCacheKey() directly to resolveSessionRowThinkingLevels().

6.Call-site difference

v2026.5.7 still directly resolves per row:

const resolved = resolveSessionModelRef(cfg, entry, agentId);
return { provider: resolved.provider, model: resolved.model };

origin/main uses the cached helper in row construction:

const selectedModel = resolveSessionSelectedModelRef({
cfg,
entry,
agentId: sessionAgentId,
rowContext,
});

Conclusion: #77902 is merged to GitHub main, but v2026.5.7 was cut without that code. The shipped tag lacks the actual cache field, initializer, helper, and row call-site change.

#

ok that was a bit big, sorry

#

its also looking at the extra latency vs 2026.5.5 which is what i was running prior to todays patch. I think our steady state was around 400ms and peaks to 1200ms, not steady at 1200ms and peaks to 4800ms

subtle gull
dense bison
#

Ahhh got it

subtle gull
#

and my clanker thinks the extra latency is from background maintanence tasks related trying to help recover failed cron jobs

#

so not a sessions.list issue, different topic for a different PR

dense bison