Hey, hitting a chat hang on OpenClaw.
Context: Multi-tenant SaaS β one Paperclip member host plus one OpenClaw machine per company on Fly.io (1 GB shared 1Γ CPU, image openclaw-template, started via gateway --allow-unconfigured --bind lan).
Symptom: First POST /v1/chat/completions returns HTTP 200 + SSE headers in ~200 ms, then 0 body bytes for 60-180 s. After client disconnect: [model-fallback/decision] reason=timeout detail=This operation was aborted. Session JSONL *.jsonl.lock is left behind with the live gateway PID; the JSONL itself is never written. Earlier we saw [session-write-lock] releasing lock held for 474 797 ms (~8 min, 32Γ max) β looks like a leak.
What works: [gateway] ready + acpx runtime backend ready log normally (~9-18 s); GET /healthz 200 OK in 200 ms; auth (OPENCLAW_GATEWAY_TOKEN) accepted; direct curl https://api.anthropic.com/v1/messages from inside the machine returns 200 OK in ~2 s, so creds + egress + model are fine.
Tried: BOOTSTRAP.md workflow active and setupCompletedAt pre-seeded so bootstrapMode=none (identical hang both ways); rollback to two prior image builds; fly machine restart Γ 4; client timeouts 30 / 60 / 90 / 120 / 180 s (all hang the full budget).
Guess: Race between watchClientDisconnect (src/gateway/openai-http.ts) and acquireSessionWriteLock (src/agents/session-write-lock.ts) on the first chat β abort fires before the first model token, lock isn't released cleanly.
Questions: Is --allow-unconfigured + per-tenant multi-machine deployment supported in production, or do we need a written agent definition + auth-profiles.json? And is the OpenAI-compat HTTP path first-class for production, or should we be on the WebSocket gateway protocol?