Telemetry¶

Bubo emits OpenTelemetry metrics and traces, so dashboard rollups stay outside the poller. All metrics are namespaced llm_review.* and registered in src/bubo/telemetry/metrics.py.

Emitted metrics¶

Metric	Type	What it counts	Key attributes	Example sample (OTLP)
`llm_review.runs`	counter	One increment per completed review run (a single MR/PR worker exiting).	`repo`, `model`, `status` (`success`/`no_findings`/`failed`/`skipped`), `review_mode` (`poller`/`manual`), `dry_run`	`llm_review.runs{repo="g/r", model="gpt-5.5", status="success", review_mode="poller", dry_run="false"} 1`
`llm_review.findings`	counter	One increment per finding lifecycle event — initial post AND every outcome transition picked up by `--sync-outcomes`.	`repo`, `status` (`planned`/`posted`/`skipped`/`pending_external_id`/`resolved`/`disputed`/`false_positive`/`duplicate`/`deleted`/`developer_replied`), `dry_run`, `finding_type`, `severity`, `category`	`llm_review.findings{repo="g/r", status="posted", severity="blocking", category="correctness"} 1`
`llm_review.tokens`	counter	LLM token consumption per review, split into four streams.	`repo`, `model`, `status`, `review_mode`, `dry_run`, `operation` (`input`/`output`/`cached`/`total`)	`llm_review.tokens{repo="g/r", model="gpt-5.5", operation="total"} 65926`
`llm_review.cost.usd`	counter	Estimated provider cost in USD per review, summed from the configured `[telemetry.pricing.*]` rates.	`repo`, `model`, `status`, `review_mode`, `dry_run`	`llm_review.cost.usd{repo="g/r", model="gpt-5.5"} 0.32963`
`llm_review.failures`	counter	One increment per failed pipeline stage.	`repo`, `error_type` (Python exception class), `operation` (`review` for a failed worker, `outcome_sync` for a failed `--sync-outcomes` fetch)	`llm_review.failures{repo="g/r", operation="outcome_sync", error_type="HTTPError"} 1`
`llm_review.latency.review_seconds`	histogram	Per-review wall-clock from worker start to finish.	`repo`, `model`, `status`, `review_mode`, `dry_run`	`llm_review.latency.review_seconds_sum{repo="g/r"} 412.3` `llm_review.latency.review_seconds_count{repo="g/r"} 3`
`llm_review.latency.queue_seconds`	histogram	Time from when a job is written to the queue to when a worker picks it up — your saturation signal.	`repo`	`llm_review.latency.queue_seconds_sum{repo="g/r"} 6.1` `llm_review.latency.queue_seconds_count{repo="g/r"} 3`

Two switches in config/env.toml let you trim emission cost:

[telemetry].emit_finding_events — drop the per-finding counter entirely if you only care about run-level rollups.
[telemetry].emit_outcome_sync — drop the outcome-derived increments on llm_review.findings (resolved/disputed/false_positive/etc.) and keep only the initial-post events.

Common dashboards¶

Throughput: rate(llm_review.runs[5m]) grouped by repo + status.
Saturation: histogram_quantile(0.95, llm_review.latency.queue_seconds_bucket) — if this climbs, raise max_merge_requests_per_poll.
Cost: sum(rate(llm_review.cost.usd[1d])) by (repo, model).
Quality / ROI: ratio of llm_review.findings{status="resolved"} over llm_review.findings{status="posted"}, grouped by severity.
Reliability: rate(llm_review.failures[5m]) broken out by operation.

Tracing — the per-review span tree¶

Each review is one trace. The llm_review.run span has a child span per stage, so a trace shows exactly where a review's wall-clock went (a slow review is almost always the agent stage, not Bubo):

llm_review.run                      repo, mr_iid, sha, run_id
├─ llm_review.checkout              repo, sha
├─ llm_review.provenance            repo            (only when governance is on)
├─ llm_review.agent                 repo, model, tokens_input/output/total,
│                                   cost_usd, exit_code
└─ llm_review.post                  repo, dry_run, findings_posted/planned/skipped

The attributes ride on the spans, so any OpenTelemetry-aware backend (Tempo, Jaeger, Honeycomb, Grafana, Datadog, …) can break latency, tokens, and cost down by stage and by model without Bubo shipping a dashboard of its own — Bubo emits the data; you bring the tool. Spans are emitted only when [telemetry].otlp_endpoint is set; otherwise this is a no-op.

Cardinality discipline¶

Metric attributes are kept low-cardinality on purpose. MR IID, SHA, file path, line number, fingerprint, and discussion ID live in SQLite or span attributes/events only, never as metric labels — so the dashboard backend doesn't melt as the queue accelerates. Spans carry the full per-MR context for trace-level drilldown.