#setup observability with grafana cloud
1 messages · Page 1 of 1 (latest)
First, can you paste openclaw --version? “latest” can mean different builds, and observability changed a bit over time.
Assuming current OpenClaw, the best setup for Grafana is:
- Run Grafana Alloy or an OpenTelemetry Collector on the Ubuntu box
- Have OpenClaw export OTLP to that local collector
- Let the collector forward to Grafana Cloud
I would not start with Prometheus scraping. OpenClaw already has OTel export for diagnostics, so OTLP is the clean path.
Minimal OpenClaw config, JSON5:
{
plugins: {
allow: ["diagnostics-otel"],
entries: {
"diagnostics-otel": { enabled: true },
},
},
diagnostics: {
enabled: true,
otel: {
enabled: true,
endpoint: "http://127.0.0.1:4318",
protocol: "http/protobuf",
serviceName: "openclaw-gateway",
traces: true,
metrics: true,
logs: false, // start false, enable later if needed
sampleRate: 0.2,
flushIntervalMs: 60000,
},
},
}
Then:
openclaw gateway restart
openclaw logs --follow
openclaw status --all
A few important bits:
- Use
http/protobuf. Docs notegrpcis ignored today. - Start with metrics + traces. Add logs later, they can get noisy and expensive.
- For debugging, use targeted
diagnostics.flags, not"*"permanently.
Best first Grafana dashboards:
- token usage and cost by provider/model
- run duration / p95 latency
- webhook/message error rate
- queue depth / wait time
- stuck sessions / session state changes
Docs:
- https://docs.openclaw.ai/logging#diagnostics--opentelemetry
- https://docs.openclaw.ai/gateway/configuration-reference#diagnostics
If you want, paste your version and whether you’re using Grafana Cloud or self-hosted Grafana, and I can sketch the exact Alloy config next.
OpenClaw 2026.4.14 (323493f)
Grafana cloud for now to evaluate, then perhaps self-hosted later on
Nice, 2026.4.14 is new enough for the OTel path. For what you want, I’d go with:
OpenClaw -> local Grafana Alloy on the Ubuntu box -> Grafana Cloud
That’s the best setup because:
- OpenClaw stays pointed at
localhost - Alloy handles auth, retry, buffering, and routing
- when you move to self-hosted later, you mostly just swap Alloy exporters, not OpenClaw config
1) Enable OpenClaw diagnostics export
Easiest path:
openclaw plugins enable diagnostics-otel
Then in your config, add something like this, JSON5:
{
plugins: {
// if you already have an allow list, append diagnostics-otel, don't replace everything
allow: ["diagnostics-otel"],
entries: {
"diagnostics-otel": { enabled: true },
},
},
diagnostics: {
enabled: true,
otel: {
enabled: true,
endpoint: "http://127.0.0.1:4318",
protocol: "http/protobuf",
serviceName: "openclaw-gateway",
traces: true,
metrics: true,
logs: false,
sampleRate: 1.0, // good for evaluation, lower later if traffic is high
flushIntervalMs: 10000,
},
},
}
A couple of gotchas from the docs:
- use
http/protobuf
grpcis listed in config, but is effectively ignored today- start with metrics + traces
- enable logs later only if you really want them, they can get noisy and OTLP log export uses the main logger output
2) Put Alloy in front of Grafana Cloud
In Grafana Cloud, use their OpenTelemetry / Alloy quickstart to generate the exporter side.
Then make Alloy listen on local OTLP HTTP :4318, and point OpenClaw at that.
That gives you the clean migration path later:
- today: Alloy -> Grafana Cloud
- later: Alloy -> Tempo / Loki / Mimir / self-hosted collector
3) Restart and verify
openclaw gateway restart
openclaw logs --follow
openclaw status --all
What you want to see:
- no exporter/auth errors in logs
- diagnostics flowing after real message traffic
- spans/metrics appearing in Grafana Cloud within a minute or two
4) Best first dashboards
These are the most useful OpenClaw signals to start with:
Metrics
openclaw.tokensopenclaw.cost.usdopenclaw.run.duration_msopenclaw.context.tokensopenclaw.message.processedopenclaw.message.duration_msopenclaw.queue.depthopenclaw.queue.wait_msopenclaw.session.stateopenclaw.session.stuck
Spans
openclaw.model.usageopenclaw.webhook.processedopenclaw.message.processedopenclaw.session.stuck
My recommended first dashboard rows:
-
Usage / Cost
- tokens by provider/model
- USD cost by provider/model
-
Latency
- run duration p50/p95
- message processing duration p50/p95
-
Reliability
- message processed outcome
- webhook errors
- stuck session count
-
Backpressure
- queue depth
- queue wait time
5) When you turn logs on
If you later flip this on:
logs: true
also keep logging.level fairly conservative. OTLP logs can get big, and console redaction does not protect OTLP logs the same way.
OpenClaw docs:
- https://docs.openclaw.ai/logging#diagnostics--opentelemetry
- https://docs.openclaw.ai/gateway/configuration-reference#diagnostics
If you want, I can do the next step and sketch a minimal Alloy config for Grafana Cloud that matches this OpenClaw setup.
please do
Yep, here’s the minimal shape I’d use.
OpenClaw side
If you already ran:
openclaw plugins enable diagnostics-otel
then the main config you need is just:
{
diagnostics: {
enabled: true,
otel: {
enabled: true,
endpoint: "http://127.0.0.1:4318",
protocol: "http/protobuf",
serviceName: "openclaw-gateway",
traces: true,
metrics: true,
logs: false,
sampleRate: 1.0,
flushIntervalMs: 10000,
},
},
}
Notes:
- keep Alloy bound to
127.0.0.1, not0.0.0.0 - start with
logs: false - if you already use
plugins.allow, make surediagnostics-otelis included there
Minimal Alloy config
Something like /etc/alloy/config.alloy:
otelcol.auth.basic "grafana_cloud" {
username = sys.env("GRAFANA_CLOUD_OTLP_USER")
password = sys.env("GRAFANA_CLOUD_OTLP_API_KEY")
}
otelcol.receiver.otlp "openclaw" {
http {
endpoint = "127.0.0.1:4318"
}
output {
metrics = [otelcol.processor.batch.openclaw.input]
traces = [otelcol.processor.batch.openclaw.input]
}
}
otelcol.processor.batch "openclaw" {
output {
metrics = [otelcol.exporter.otlphttp.grafana_cloud.input]
traces = [otelcol.exporter.otlphttp.grafana_cloud.input]
}
}
otelcol.exporter.otlphttp "grafana_cloud" {
client {
endpoint = sys.env("GRAFANA_CLOUD_OTLP_ENDPOINT")
auth = otelcol.auth.basic.grafana_cloud.handler
}
}
Env vars for Alloy
Use the values from the Grafana Cloud OpenTelemetry setup page:
GRAFANA_CLOUD_OTLP_ENDPOINT=https://<your-grafana-cloud-otlp-endpoint>/otlp
GRAFANA_CLOUD_OTLP_USER=<your-grafana-cloud-otlp-username>
GRAFANA_CLOUD_OTLP_API_KEY=<your-grafana-cloud-api-key>
Important bit: the endpoint should be the base OTLP endpoint, usually ending in /otlp, because Alloy adds /v1/metrics and /v1/traces.
If you later want logs too
- In OpenClaw:
logs: true
- In Alloy, add
logsto both output blocks:
output {
metrics = [otelcol.processor.batch.openclaw.input]
traces = [otelcol.processor.batch.openclaw.input]
logs = [otelcol.processor.batch.openclaw.input]
}
and
output {
metrics = [otelcol.exporter.otlphttp.grafana_cloud.input]
traces = [otelcol.exporter.otlphttp.grafana_cloud.input]
logs = [otelcol.exporter.otlphttp.grafana_cloud.input]
}
I’d still leave logs off until metrics/traces are clean.
Verify it
After reloading Alloy and restarting OpenClaw:
openclaw gateway restart
openclaw logs --follow
Also watch Alloy logs, then trigger some real OpenClaw traffic. You should start seeing:
openclaw.tokensopenclaw.cost.usdopenclaw.run.duration_msopenclaw.message.processed
openclaw.queue.depth
OpenClaw docs:
- https://docs.openclaw.ai/logging#diagnostics--opentelemetry
- https://docs.openclaw.ai/gateway/configuration-reference#diagnostics
If you want, next I can sketch the first Grafana dashboard panels I’d build from those signals.
yes please
Yep. I’d start with one overview dashboard and one trace drill-down view.
Also, before building panels, add a few dashboard variables:
providermodelchannellane
I would not use sessionKey or chatId as top-level variables, too high-cardinality. Keep those for traces/logs.
Dashboard 1, OpenClaw Overview
Row 1, KPIs
Use Stat panels for the last 1h / 24h:
- Messages processed
openclaw.message.processed- Error count
openclaw.webhook.error- plus failed
openclaw.message.processedbyopenclaw.outcome
- Tokens used
openclaw.tokens
- Cost (USD)
openclaw.cost.usd
- p95 run duration
openclaw.run.duration_ms
Break down by:
openclaw.provideropenclaw.modelopenclaw.channel
Row 2, Model performance
Use Time series panels:
- Run duration p50 / p95
openclaw.run.duration_ms
- Context size
openclaw.context.tokens
- Tokens over time
openclaw.tokens
- Cost over time
openclaw.cost.usd
Useful splits:
- by provider
- by model
This row tells you fast if one model is slower, more expensive, or blowing up context.
Row 3, Message pipeline health
Use Time series:
- Webhook duration
openclaw.webhook.duration_ms
- Message processing duration
openclaw.message.duration_ms
- Queue depth
openclaw.queue.depth
- Queue wait
openclaw.queue.wait_ms
Useful splits:
openclaw.channelopenclaw.laneopenclaw.outcome
This is the “am I backing up?” row.
Row 4, Reliability
Use Stacked time series or Bar chart:
- Processed outcome
openclaw.message.processedsplit byopenclaw.outcome
- Webhook errors
openclaw.webhook.errorsplit by channel/webhook
- Session states
openclaw.session.statesplit byopenclaw.stateandopenclaw.reason
- Stuck sessions
openclaw.session.stuckopenclaw.session.stuck_age_ms
This row catches real trouble fast.
Dashboard 2, Trace drill-down
Once traces are flowing, use Tempo / traces view for these span names:
openclaw.model.usageopenclaw.webhook.processedopenclaw.message.processedopenclaw.session.stuck
Best filters:
service.name = openclaw-gatewayopenclaw.provideropenclaw.modelopenclaw.channelopenclaw.outcomeopenclaw.reason
What I’d look for first:
- slow
openclaw.model.usagespans by model - failed
openclaw.message.processedspans by outcome/reason - long queue-to-process chains before stuck sessions
Optional logs dashboard, later
Only if you enable OTLP logs.
Panels:
- Error logs over time
- Warn logs over time
- Recent error stream
Filter on:
service.name = openclaw-gateway
level = error|warn
My recommended first 6 panels
If you want the leanest useful dashboard, build these first:
- Messages processed / min
- Error rate
- Tokens by model
- Cost by model
- p95 run duration
- Queue depth + wait time
That gets you 80% of the value right away.
One note: in Grafana Cloud, OTel metrics often get normalized when exposed through Prometheus-style queries, so the exact final metric names/suffixes may differ a bit. But the logical signals above are the right ones to target.