# Lane notes — Category 5 (Auditability & Observability) - **Owner:** ztaylor - **Last live-state check:** — - **Fixtures:** reuse gateway `zeb-gateway-test` + ref server `arcade-eval-ref` for generating tool-call traffic (see `../../config/targets.yaml`; ref-server tunnel is ephemeral — re-establish if down). ## Orientation (read before starting) `../../LIVE-POC.md` → "Observability" + "Known behaviors". Key facts: - **Logs → ELK** via the Vector daemonset (works today; engine logs visible in Kibana with `Tracing.TraceId`/`CorrelationId`/`NetCore.RequestPath`). - **Metrics → Grafana/Mimir** via the Grafana Agent Operator (ServiceMonitor/PodMonitor scrape → remote_write to Mimir, tenant `X-Scope-OrgID: k8s-backstage-v4`). **NOT ELK.** - **Engine OTLP metrics are dropped today** — `arcade-otel-collector:4318` doesn't resolve (no collector deployed). Confirmed in Kibana 2026-06-18. ## Plan (the three signals + admin + residency) 1. **OTEL pipeline health** — `kubectl -n arcade get svc,deploy,pod | grep -i otel`; check engine `OTEL_EXPORTER_OTLP_*` env + chart OTEL collector values. Confirm the drop. 2. **Metrics export remediation (primary objective; with the user — touches `apps/arcade`)** — deploy/enable a collector so `arcade-otel-collector:4318` resolves, then bridge into Prometheus/Mimir: EITHER (idiomatic) collector `prometheus` exporter `/metrics` + a `ServiceMonitor` (label `release: prometheus-operator`, NOT `grafana-agent: external`), OR (push) `prometheusremotewrite` exporter → `http://mimir-nginx.mimir.observability-wus2/api/v1/push` + `X-Scope-OrgID: k8s-backstage-v4`. Then generate tool-call traffic and confirm per-tool/per-user metrics appear in Grafana. 3. **Execution audit (logs)** — make tool calls; query ELK for records with user/tool/ts/outcome; assess field completeness. (Arcade's own audit log covers admin actions only, by design.) 4. **Trace propagation** — send a call with trace context; check it joins agent→tool (engine already emits TraceId in ELK; test whether OTEL traces export + join). 5. **Admin audit log** — make an admin change (update a gateway); confirm it's logged in Arcade. 6. **Data residency** — confirm no telemetry egresses to Arcade when self-hosted (collector/exporter targets ST-internal only). 7. **InfoSec sign-off (Dane)** — gate dependency, not ours to execute; record status. ## Log - (start here)