Agent Health Metrics
Counter metrics (sandbox_tool_calls_total, sandbox_edit_bytes_total, …) answer what happened. Agent-health metrics answer whether the agent is making progress: a failing-test streak that never shrinks, a rising tool error rate, or the same tool invocation repeating inside a short window each signal the agent is stuck even when individual tool calls look fine.
The agent-health surface is a strict extension of the Prometheus pipeline: when -metrics-addr is set, the health tracker is constructed automatically and every tool call flows into it through the existing observability middleware. There is no separate enable flag.
Emitted metrics
Section titled “Emitted metrics”All four live on the existing /metrics endpoint, prefixed sandbox_agent_.
| Metric | Type | Labels | Meaning |
|---|---|---|---|
sandbox_agent_test_failure_streak | gauge | — | Consecutive run_tests invocations whose failure count did not decrease. Reset on decrease or exit=0. |
sandbox_agent_time_since_last_green_seconds | gauge | — | Seconds since the last run_tests / run_lint / run_typecheck that exited 0. Reads 0 on process start so “never green” never looks like “green at epoch”. |
sandbox_agent_tool_error_rate | gauge | — | Errored tool calls divided by total tool calls over the configured rolling window (default 100 calls). Value is clamped to [0, 1]. |
sandbox_agent_tool_repetition_total | counter | tool | Increments once per burst when the same (tool, args-hash) pair repeats at least -metrics-tool-repetition-threshold times inside -metrics-tool-repetition-window. |
The tool label on the repetition counter is drawn from the closed MCP tool list — no argument value or file path ever becomes a label, so the active series count stays bounded.
Tunables
Section titled “Tunables”Three flags sit next to -metrics-addr in cmd/sandbox/main.go:
codegen-sandbox \ -metrics-addr=:9090 \ -metrics-tool-repetition-window=10m \ -metrics-tool-repetition-threshold=3 \ -metrics-error-rate-window=100-metrics-tool-repetition-window(time.Duration, default10m) — how far back repeat counts are considered.-metrics-tool-repetition-threshold(int, default3) — minimum repeats within the window before the counter increments.-metrics-error-rate-window(int, default100) — size of the rolling outcome buffer feeding the error-rate gauge; expressed as a count of recent tool calls, not a duration.
The gauges initialise to 0 on process start. The time-since-last-green gauge advances every second via a lightweight ticker so dashboards see a smooth line regardless of Prometheus scrape cadence.
Spike patterns to watch
Section titled “Spike patterns to watch”sandbox_agent_test_failure_streak climbing steadily
The agent keeps running
run_testsbut the failing set never shrinks — classic “thrashing” signature. Correlate withsandbox_tool_calls_total{tool="run_tests"}to confirm the streak comes from actual test runs rather than stale state.
sandbox_agent_time_since_last_green_seconds crossing a multi-minute threshold
The agent hasn’t produced a clean verify run in that interval. Combined with elevated
sandbox_tool_calls_total{tool="Edit"}, it’s a strong signal the agent is editing without compiling — the right time to interrupt and checkpoint.
sandbox_agent_tool_error_rate sustained > 0.3
Roughly one in three tool calls is erroring. Often correlates with path violations (
sandbox_path_violations_total), denylist hits (sandbox_denylist_hits_total), or tool-argument mis-understandings. Alert thresholds depend on the agent’s typical baseline — in steady state, a well-behaved agent stays below ~0.05.
sandbox_agent_tool_repetition_total{tool="Read"} rising fast
The agent is reading the same file with the same arguments repeatedly — usually a prompt-loop symptom. The counter increments once per burst, so a steady rise means many distinct bursts, not one long one.
Implementation notes
Section titled “Implementation notes”- The tracker lives in
internal/metrics/healthand is wired intointernal/serverthrough the existingobservabilityRegistrar. Only one Prometheus registry exists. - Args hashing is
sha256(tool + "\n" + canonicalJSON(args))[:8]. Map keys are sorted before encoding so argument order can’t create spurious duplicates. - Every tracker method is safe on a nil receiver, so call sites (server, verify tools) never need a nil-guard.
- The verify tools (
run_tests,run_lint,run_typecheck) callObserveGreenon clean exits;run_testsadditionally callsObserveTestResultwith the parsed failure count so the streak gauge can advance per language-specific parser.