#OpenClaw `/v1/chat/completions` hangs forever on first chat after fresh boot (per-tenant multi-machi

1 messages Β· Page 1 of 1 (latest)

sweet sun
#

Hey, hitting a chat hang on OpenClaw.

Context: Multi-tenant SaaS β€” one Paperclip member host plus one OpenClaw machine per company on Fly.io (1 GB shared 1Γ— CPU, image openclaw-template, started via gateway --allow-unconfigured --bind lan).

Symptom: First POST /v1/chat/completions returns HTTP 200 + SSE headers in ~200 ms, then 0 body bytes for 60-180 s. After client disconnect: [model-fallback/decision] reason=timeout detail=This operation was aborted. Session JSONL *.jsonl.lock is left behind with the live gateway PID; the JSONL itself is never written. Earlier we saw [session-write-lock] releasing lock held for 474 797 ms (~8 min, 32Γ— max) β€” looks like a leak.

What works: [gateway] ready + acpx runtime backend ready log normally (~9-18 s); GET /healthz 200 OK in 200 ms; auth (OPENCLAW_GATEWAY_TOKEN) accepted; direct curl https://api.anthropic.com/v1/messages from inside the machine returns 200 OK in ~2 s, so creds + egress + model are fine.

Tried: BOOTSTRAP.md workflow active and setupCompletedAt pre-seeded so bootstrapMode=none (identical hang both ways); rollback to two prior image builds; fly machine restart Γ— 4; client timeouts 30 / 60 / 90 / 120 / 180 s (all hang the full budget).

Guess: Race between watchClientDisconnect (src/gateway/openai-http.ts) and acquireSessionWriteLock (src/agents/session-write-lock.ts) on the first chat β€” abort fires before the first model token, lock isn't released cleanly.
Questions: Is --allow-unconfigured + per-tenant multi-machine deployment supported in production, or do we need a written agent definition + auth-profiles.json? And is the OpenAI-compat HTTP path first-class for production, or should we be on the WebSocket gateway protocol?

Full report