Skip to content

Agent Health Metrics

Counter metrics (sandbox_tool_calls_total, sandbox_edit_bytes_total, …) answer what happened. Agent-health metrics answer whether the agent is making progress: a failing-test streak that never shrinks, a rising tool error rate, or the same tool invocation repeating inside a short window each signal the agent is stuck even when individual tool calls look fine.

The agent-health surface is a strict extension of the Prometheus pipeline: when -metrics-addr is set, the health tracker is constructed automatically and every tool call flows into it through the existing observability middleware. There is no separate enable flag.

All four live on the existing /metrics endpoint, prefixed sandbox_agent_.

MetricTypeLabelsMeaning
sandbox_agent_test_failure_streakgaugeConsecutive run_tests invocations whose failure count did not decrease. Reset on decrease or exit=0.
sandbox_agent_time_since_last_green_secondsgaugeSeconds since the last run_tests / run_lint / run_typecheck that exited 0. Reads 0 on process start so “never green” never looks like “green at epoch”.
sandbox_agent_tool_error_rategaugeErrored tool calls divided by total tool calls over the configured rolling window (default 100 calls). Value is clamped to [0, 1].
sandbox_agent_tool_repetition_totalcountertoolIncrements once per burst when the same (tool, args-hash) pair repeats at least -metrics-tool-repetition-threshold times inside -metrics-tool-repetition-window.

The tool label on the repetition counter is drawn from the closed MCP tool list — no argument value or file path ever becomes a label, so the active series count stays bounded.

Three flags sit next to -metrics-addr in cmd/sandbox/main.go:

Terminal window
codegen-sandbox \
-metrics-addr=:9090 \
-metrics-tool-repetition-window=10m \
-metrics-tool-repetition-threshold=3 \
-metrics-error-rate-window=100
  • -metrics-tool-repetition-window (time.Duration, default 10m) — how far back repeat counts are considered.
  • -metrics-tool-repetition-threshold (int, default 3) — minimum repeats within the window before the counter increments.
  • -metrics-error-rate-window (int, default 100) — size of the rolling outcome buffer feeding the error-rate gauge; expressed as a count of recent tool calls, not a duration.

The gauges initialise to 0 on process start. The time-since-last-green gauge advances every second via a lightweight ticker so dashboards see a smooth line regardless of Prometheus scrape cadence.

sandbox_agent_test_failure_streak climbing steadily

The agent keeps running run_tests but the failing set never shrinks — classic “thrashing” signature. Correlate with sandbox_tool_calls_total{tool="run_tests"} to confirm the streak comes from actual test runs rather than stale state.

sandbox_agent_time_since_last_green_seconds crossing a multi-minute threshold

The agent hasn’t produced a clean verify run in that interval. Combined with elevated sandbox_tool_calls_total{tool="Edit"}, it’s a strong signal the agent is editing without compiling — the right time to interrupt and checkpoint.

sandbox_agent_tool_error_rate sustained > 0.3

Roughly one in three tool calls is erroring. Often correlates with path violations (sandbox_path_violations_total), denylist hits (sandbox_denylist_hits_total), or tool-argument mis-understandings. Alert thresholds depend on the agent’s typical baseline — in steady state, a well-behaved agent stays below ~0.05.

sandbox_agent_tool_repetition_total{tool="Read"} rising fast

The agent is reading the same file with the same arguments repeatedly — usually a prompt-loop symptom. The counter increments once per burst, so a steady rise means many distinct bursts, not one long one.

  • The tracker lives in internal/metrics/health and is wired into internal/server through the existing observabilityRegistrar. Only one Prometheus registry exists.
  • Args hashing is sha256(tool + "\n" + canonicalJSON(args))[:8]. Map keys are sorted before encoding so argument order can’t create spurious duplicates.
  • Every tracker method is safe on a nil receiver, so call sites (server, verify tools) never need a nil-guard.
  • The verify tools (run_tests, run_lint, run_typecheck) call ObserveGreen on clean exits; run_tests additionally calls ObserveTestResult with the parsed failure count so the streak gauge can advance per language-specific parser.