Skip to content

Monitoring — Prometheus + Grafana

The sandbox exposes a Prometheus surface on an optional listener. This page is the operator-facing end-to-end: get a Prometheus scraping the sandbox, import the two Grafana dashboards shipped in the repo, and act on what they show.

  • Start the sandbox with -metrics-addr=:9090 (or any port) to turn on the /metrics listener.
  • Point a Prometheus scrape config at that port.
  • Import deploy/grafana/agent-health.json + deploy/grafana/agent-arena.json into Grafana; pick your Prometheus datasource.

Full setup below.

Terminal window
codegen-sandbox \
-addr=:8080 \
-metrics-addr=:9090 \
-workspace=/workspace

See Metrics for every knob the metrics listener accepts, and for what each emitted metric family means.

Minimal static-config scrape:

# prometheus.yml
scrape_configs:
- job_name: codegen-sandbox
scrape_interval: 30s
static_configs:
- targets: ["codegen-sandbox:9090"]

Kubernetes / PodMonitor:

apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: codegen-sandbox
spec:
selector:
matchLabels:
app: codegen-sandbox
podMetricsEndpoints:
- port: metrics # matches the containerPort named "metrics" on 9090
interval: 30s

The /metrics listener is unauthenticated on purpose — restrict access at the network layer (firewall / NetworkPolicy / mesh). See Metrics § Enabling the endpoint for the rationale.

Two JSON files live under deploy/grafana/:

FileAudiencePurpose
agent-health.jsonOperators / on-call”Is my agent having a bad day?”
agent-arena.jsonSales demos, leadership, curiosity”Look what’s happening right now”

In Grafana: Dashboards → New → Import. Upload the JSON (or paste its contents), then pick your Prometheus datasource for the DS_PROMETHEUS variable.

Both dashboards target Grafana 11.x (schema version 39) and use templated datasource variables so the same JSON imports cleanly across environments.

Full import recipes (including curl-based API import and dashboard-as-code via grafana-cli) are in the dashboard README.

Each panel maps to a specific on-call action:

PanelWhat it tells youSuggested action
Time since last greenSeconds since the last clean run_tests / run_lint / run_typecheck.> 5 min: check the agent transcript. > 30 min: agent is probably stuck in a fix-one-thing-break-two cycle.
Tool error rateErrored tool calls / total, rolling window (default 100 calls).> 5%: check which tool is failing on the per-tool timeseries. > 20%: the agent is flailing — interrupt.
Test-failure streakConsecutive run_tests whose failure count didn’t decrease.> 3: agent isn’t making progress; > 6: probable thrash. Reset via a snapshot_restore or agent interrupt.
Tool latency p95 / p99Per-tool latency histogram.p95 > 5s on run_tests is normal; p95 > 5s on Read / Write / Edit is a signal (disk / network issue).
Denylist hits (last 1h)Bash commands rejected by the denylist, grouped by matched token (sudo, mkfs, …).Any hit is worth a look. Pattern of hits = check the agent prompt; the sandbox is holding the line but the agent is probing.
Scrub hits (last 1h)Scrub-middleware matches by pattern.Pattern spikes = content with that shape is flowing through. Good signal that scrub is earning its keep.
Path-containment violationsRejected writes / reads that resolved outside the workspace.Non-zero = the agent is trying to escape the sandbox. Investigate.
Workspace sizeBytes in the workspace volume (excluding .git / node_modules).Unbounded growth = stuck build loop, leaked artefacts, or agent writing the same file over and over.
Tool-repetition bursts(tool, args) tuples seen more than N times in a window.Any entry = “ping-pong” signal — agent is calling the same thing repeatedly. Check the targeted file.
Bash exit mixBash foreground exit codes bucketed.Dominant non-zero = builds / tests are failing; a sustained exit=124 (timeout) row = runaway commands.

Built for demos and curiosity, not alerting. Panels show:

  • Vibe (stat, huge, green ↔ red) — (1 − error_rate) × (1 − clamp(time_since_green / 600, 0, 1)). Vibes-based by design; owns its unseriousness.
  • Sessions online, characters/minute vs average human typing speed (overlaid reference line at ~400 cpm), words per minute (CPM ÷ 5).
  • Tools per minute (stacked) — “busyness” chart showing which tools the agent is leaning on.
  • Read : write ratio — the “measure twice, cut once” index. Higher = more deliberation per byte changed.
  • Dangerous-command attempts blocked — lifetime denylist trophy case.
  • Language of the day — donut chart of sandbox_tool_calls_total{language} over the last hour.
  • Scrub leaderboard — most-caught secret type.

What the dashboards deliberately don’t cover

Section titled “What the dashboards deliberately don’t cover”

Some panels from the original #29 sketch depend on metrics the sandbox doesn’t yet expose:

  • Per-session breakouts (top 10 sessions by time-since-green, ping-pong by file). Today’s agent-health gauges are process-scoped, not session-scoped.
  • Most-grepped string / most-read file / largest write bounded-label trackers.
  • Agent personality (avg function length via tree-sitter, comment density).

They’re parked for follow-up instrumentation rather than baked in as broken queries. See the dashboard README for the full list.

  • Metrics — metric inventory, scrape configuration.
  • Agent health — the thinking behind sandbox_agent_* gauges.
  • Tracing — OpenTelemetry complement to the Prometheus surface.