cat1: FINALIZE scorecard (draft 4/5); STATUS + cat-5 NOTES ready for fresh-session handoff

This commit is contained in:
2026-06-22 09:55:01 -04:00
parent 8b48f5813e
commit 53f960409e
5 changed files with 95 additions and 24 deletions
+35
View File
@@ -0,0 +1,35 @@
# Lane notes — Category 5 (Auditability & Observability)
- **Owner:** ztaylor
- **Last live-state check:** —
- **Fixtures:** reuse gateway `zeb-gateway-test` + ref server `arcade-eval-ref` for generating tool-call traffic (see `../../config/targets.yaml`; ref-server tunnel is ephemeral — re-establish if down).
## Orientation (read before starting)
`../../LIVE-POC.md` → "Observability" + "Known behaviors". Key facts:
- **Logs → ELK** via the Vector daemonset (works today; engine logs visible in Kibana with
`Tracing.TraceId`/`CorrelationId`/`NetCore.RequestPath`).
- **Metrics → Grafana/Mimir** via the Grafana Agent Operator (ServiceMonitor/PodMonitor scrape →
remote_write to Mimir, tenant `X-Scope-OrgID: k8s-backstage-v4`). **NOT ELK.**
- **Engine OTLP metrics are dropped today** — `arcade-otel-collector:4318` doesn't resolve (no
collector deployed). Confirmed in Kibana 2026-06-18.
## Plan (the three signals + admin + residency)
1. **OTEL pipeline health**`kubectl -n arcade get svc,deploy,pod | grep -i otel`; check engine
`OTEL_EXPORTER_OTLP_*` env + chart OTEL collector values. Confirm the drop.
2. **Metrics export remediation (primary objective; with the user — touches `apps/arcade`)**
deploy/enable a collector so `arcade-otel-collector:4318` resolves, then bridge into Prometheus/Mimir:
EITHER (idiomatic) collector `prometheus` exporter `/metrics` + a `ServiceMonitor` (label
`release: prometheus-operator`, NOT `grafana-agent: external`), OR (push) `prometheusremotewrite`
exporter → `http://mimir-nginx.mimir.observability-wus2/api/v1/push` + `X-Scope-OrgID: k8s-backstage-v4`.
Then generate tool-call traffic and confirm per-tool/per-user metrics appear in Grafana.
3. **Execution audit (logs)** — make tool calls; query ELK for records with user/tool/ts/outcome;
assess field completeness. (Arcade's own audit log covers admin actions only, by design.)
4. **Trace propagation** — send a call with trace context; check it joins agent→tool (engine already
emits TraceId in ELK; test whether OTEL traces export + join).
5. **Admin audit log** — make an admin change (update a gateway); confirm it's logged in Arcade.
6. **Data residency** — confirm no telemetry egresses to Arcade when self-hosted (collector/exporter
targets ST-internal only).
7. **InfoSec sign-off (Dane)** — gate dependency, not ours to execute; record status.
## Log
- (start here)