#setup observability with grafana cloud

1 messages · Page 1 of 1 (latest)

steep onyx
#

Hi Kril,

I have latest Openclaw installed on a remote Ubuntu server. Tell me the best way to setup up diagnostic dashboards at Grafana

quasi sandBOT
violet garnetBOT
#

First, can you paste openclaw --version? “latest” can mean different builds, and observability changed a bit over time.

Assuming current OpenClaw, the best setup for Grafana is:

  1. Run Grafana Alloy or an OpenTelemetry Collector on the Ubuntu box
  2. Have OpenClaw export OTLP to that local collector
  3. Let the collector forward to Grafana Cloud

I would not start with Prometheus scraping. OpenClaw already has OTel export for diagnostics, so OTLP is the clean path.

Minimal OpenClaw config, JSON5:

{
  plugins: {
    allow: ["diagnostics-otel"],
#
    entries: {
      "diagnostics-otel": { enabled: true },
    },
  },

  diagnostics: {
    enabled: true,
    otel: {
      enabled: true,
      endpoint: "http://127.0.0.1:4318",
      protocol: "http/protobuf",
      serviceName: "openclaw-gateway",
      traces: true,
      metrics: true,
      logs: false, // start false, enable later if needed
#
      sampleRate: 0.2,
      flushIntervalMs: 60000,
    },
  },
}

Then:

openclaw gateway restart
openclaw logs --follow
openclaw status --all

A few important bits:

#
  • Use http/protobuf. Docs note grpc is ignored today.
  • Start with metrics + traces. Add logs later, they can get noisy and expensive.
  • For debugging, use targeted diagnostics.flags, not "*" permanently.

Best first Grafana dashboards:

  • token usage and cost by provider/model
  • run duration / p95 latency
  • webhook/message error rate
  • queue depth / wait time
  • stuck sessions / session state changes

Docs:

If you want, paste your version and whether you’re using Grafana Cloud or self-hosted Grafana, and I can sketch the exact Alloy config next.

steep onyx
#

OpenClaw 2026.4.14 (323493f)

Grafana cloud for now to evaluate, then perhaps self-hosted later on

violet garnetBOT
#

Nice, 2026.4.14 is new enough for the OTel path. For what you want, I’d go with:

OpenClaw -> local Grafana Alloy on the Ubuntu box -> Grafana Cloud

That’s the best setup because:

  • OpenClaw stays pointed at localhost
  • Alloy handles auth, retry, buffering, and routing
  • when you move to self-hosted later, you mostly just swap Alloy exporters, not OpenClaw config

1) Enable OpenClaw diagnostics export

Easiest path:

openclaw plugins enable diagnostics-otel
#

Then in your config, add something like this, JSON5:

{
  plugins: {
    // if you already have an allow list, append diagnostics-otel, don't replace everything
    allow: ["diagnostics-otel"],
    entries: {
      "diagnostics-otel": { enabled: true },
    },
  },

  diagnostics: {
    enabled: true,
    otel: {
      enabled: true,
#
      endpoint: "http://127.0.0.1:4318",
      protocol: "http/protobuf",
      serviceName: "openclaw-gateway",
      traces: true,
      metrics: true,
      logs: false,
      sampleRate: 1.0,        // good for evaluation, lower later if traffic is high
      flushIntervalMs: 10000,
    },
  },
}

A couple of gotchas from the docs:

  • use http/protobuf
#
  • grpc is listed in config, but is effectively ignored today
  • start with metrics + traces
  • enable logs later only if you really want them, they can get noisy and OTLP log export uses the main logger output

2) Put Alloy in front of Grafana Cloud

In Grafana Cloud, use their OpenTelemetry / Alloy quickstart to generate the exporter side.
Then make Alloy listen on local OTLP HTTP :4318, and point OpenClaw at that.

That gives you the clean migration path later:

  • today: Alloy -> Grafana Cloud
  • later: Alloy -> Tempo / Loki / Mimir / self-hosted collector

3) Restart and verify

#
openclaw gateway restart
openclaw logs --follow
openclaw status --all

What you want to see:

  • no exporter/auth errors in logs
  • diagnostics flowing after real message traffic
  • spans/metrics appearing in Grafana Cloud within a minute or two

4) Best first dashboards

These are the most useful OpenClaw signals to start with:

Metrics

#
  • openclaw.tokens
  • openclaw.cost.usd
  • openclaw.run.duration_ms
  • openclaw.context.tokens
  • openclaw.message.processed
  • openclaw.message.duration_ms
  • openclaw.queue.depth
  • openclaw.queue.wait_ms
  • openclaw.session.state
  • openclaw.session.stuck

Spans

  • openclaw.model.usage
  • openclaw.webhook.processed
  • openclaw.message.processed
  • openclaw.session.stuck
#

My recommended first dashboard rows:

  1. Usage / Cost

    • tokens by provider/model
    • USD cost by provider/model
  2. Latency

    • run duration p50/p95
    • message processing duration p50/p95
  3. Reliability

    • message processed outcome
    • webhook errors
    • stuck session count
  4. Backpressure

    • queue depth
steep onyx
#

please do

violet garnetBOT
#

Yep, here’s the minimal shape I’d use.

OpenClaw side

If you already ran:

openclaw plugins enable diagnostics-otel

then the main config you need is just:

{
  diagnostics: {
    enabled: true,
#
    otel: {
      enabled: true,
      endpoint: "http://127.0.0.1:4318",
      protocol: "http/protobuf",
      serviceName: "openclaw-gateway",
      traces: true,
      metrics: true,
      logs: false,
      sampleRate: 1.0,
      flushIntervalMs: 10000,
    },
  },
}

Notes:

#
  • keep Alloy bound to 127.0.0.1, not 0.0.0.0
  • start with logs: false
  • if you already use plugins.allow, make sure diagnostics-otel is included there

Minimal Alloy config

Something like /etc/alloy/config.alloy:

otelcol.auth.basic "grafana_cloud" {
  username = sys.env("GRAFANA_CLOUD_OTLP_USER")
  password = sys.env("GRAFANA_CLOUD_OTLP_API_KEY")
}

otelcol.receiver.otlp "openclaw" {
  http {
#
    endpoint = "127.0.0.1:4318"
  }

  output {
    metrics = [otelcol.processor.batch.openclaw.input]
    traces  = [otelcol.processor.batch.openclaw.input]
  }
}

otelcol.processor.batch "openclaw" {
  output {
    metrics = [otelcol.exporter.otlphttp.grafana_cloud.input]
    traces  = [otelcol.exporter.otlphttp.grafana_cloud.input]
  }
}
#

otelcol.exporter.otlphttp "grafana_cloud" {
  client {
    endpoint = sys.env("GRAFANA_CLOUD_OTLP_ENDPOINT")
    auth     = otelcol.auth.basic.grafana_cloud.handler
  }
}

Env vars for Alloy

Use the values from the Grafana Cloud OpenTelemetry setup page:

GRAFANA_CLOUD_OTLP_ENDPOINT=https://<your-grafana-cloud-otlp-endpoint>/otlp
#
GRAFANA_CLOUD_OTLP_USER=<your-grafana-cloud-otlp-username>
GRAFANA_CLOUD_OTLP_API_KEY=<your-grafana-cloud-api-key>

Important bit: the endpoint should be the base OTLP endpoint, usually ending in /otlp, because Alloy adds /v1/metrics and /v1/traces.

If you later want logs too

  1. In OpenClaw:
logs: true
  1. In Alloy, add logs to both output blocks:
#
output {
  metrics = [otelcol.processor.batch.openclaw.input]
  traces  = [otelcol.processor.batch.openclaw.input]
  logs    = [otelcol.processor.batch.openclaw.input]
}

and

output {
  metrics = [otelcol.exporter.otlphttp.grafana_cloud.input]
  traces  = [otelcol.exporter.otlphttp.grafana_cloud.input]
  logs    = [otelcol.exporter.otlphttp.grafana_cloud.input]
}
#

I’d still leave logs off until metrics/traces are clean.

Verify it

After reloading Alloy and restarting OpenClaw:

openclaw gateway restart
openclaw logs --follow

Also watch Alloy logs, then trigger some real OpenClaw traffic. You should start seeing:

  • openclaw.tokens
  • openclaw.cost.usd
  • openclaw.run.duration_ms
  • openclaw.message.processed
steep onyx
#

yes please

violet garnetBOT
#

Yep. I’d start with one overview dashboard and one trace drill-down view.

Also, before building panels, add a few dashboard variables:

  • provider
  • model
  • channel
  • lane

I would not use sessionKey or chatId as top-level variables, too high-cardinality. Keep those for traces/logs.

Dashboard 1, OpenClaw Overview

Row 1, KPIs

Use Stat panels for the last 1h / 24h:

  • Messages processed
#
  • openclaw.message.processed
  • Error count
    • openclaw.webhook.error
    • plus failed openclaw.message.processed by openclaw.outcome
  • Tokens used
    • openclaw.tokens
  • Cost (USD)
    • openclaw.cost.usd
  • p95 run duration
    • openclaw.run.duration_ms

Break down by:

  • openclaw.provider
  • openclaw.model
  • openclaw.channel

Row 2, Model performance

#

Use Time series panels:

  • Run duration p50 / p95
    • openclaw.run.duration_ms
  • Context size
    • openclaw.context.tokens
  • Tokens over time
    • openclaw.tokens
  • Cost over time
    • openclaw.cost.usd

Useful splits:

  • by provider
  • by model

This row tells you fast if one model is slower, more expensive, or blowing up context.

#

Row 3, Message pipeline health

Use Time series:

  • Webhook duration
    • openclaw.webhook.duration_ms
  • Message processing duration
    • openclaw.message.duration_ms
  • Queue depth
    • openclaw.queue.depth
  • Queue wait
    • openclaw.queue.wait_ms

Useful splits:

  • openclaw.channel
  • openclaw.lane
  • openclaw.outcome
#

This is the “am I backing up?” row.

Row 4, Reliability

Use Stacked time series or Bar chart:

  • Processed outcome
    • openclaw.message.processed split by openclaw.outcome
  • Webhook errors
    • openclaw.webhook.error split by channel/webhook
  • Session states
    • openclaw.session.state split by openclaw.state and openclaw.reason
  • Stuck sessions
    • openclaw.session.stuck
    • openclaw.session.stuck_age_ms

This row catches real trouble fast.

#

Dashboard 2, Trace drill-down

Once traces are flowing, use Tempo / traces view for these span names:

  • openclaw.model.usage
  • openclaw.webhook.processed
  • openclaw.message.processed
  • openclaw.session.stuck

Best filters:

  • service.name = openclaw-gateway
  • openclaw.provider
  • openclaw.model
  • openclaw.channel
  • openclaw.outcome
  • openclaw.reason
#

What I’d look for first:

  • slow openclaw.model.usage spans by model
  • failed openclaw.message.processed spans by outcome/reason
  • long queue-to-process chains before stuck sessions

Optional logs dashboard, later

Only if you enable OTLP logs.

Panels:

  • Error logs over time
  • Warn logs over time
  • Recent error stream

Filter on:

  • service.name = openclaw-gateway
#
  • level = error|warn

My recommended first 6 panels

If you want the leanest useful dashboard, build these first:

  1. Messages processed / min
  2. Error rate
  3. Tokens by model
  4. Cost by model
  5. p95 run duration
  6. Queue depth + wait time

That gets you 80% of the value right away.

One note: in Grafana Cloud, OTel metrics often get normalized when exposed through Prometheus-style queries, so the exact final metric names/suffixes may differ a bit. But the logical signals above are the right ones to target.