cat1: FINALIZE scorecard (draft 4/5); STATUS + cat-5 NOTES ready for fresh-session handoff
This commit is contained in:
@@ -0,0 +1,35 @@
|
||||
# Lane notes — Category 5 (Auditability & Observability)
|
||||
|
||||
- **Owner:** ztaylor
|
||||
- **Last live-state check:** —
|
||||
- **Fixtures:** reuse gateway `zeb-gateway-test` + ref server `arcade-eval-ref` for generating tool-call traffic (see `../../config/targets.yaml`; ref-server tunnel is ephemeral — re-establish if down).
|
||||
|
||||
## Orientation (read before starting)
|
||||
`../../LIVE-POC.md` → "Observability" + "Known behaviors". Key facts:
|
||||
- **Logs → ELK** via the Vector daemonset (works today; engine logs visible in Kibana with
|
||||
`Tracing.TraceId`/`CorrelationId`/`NetCore.RequestPath`).
|
||||
- **Metrics → Grafana/Mimir** via the Grafana Agent Operator (ServiceMonitor/PodMonitor scrape →
|
||||
remote_write to Mimir, tenant `X-Scope-OrgID: k8s-backstage-v4`). **NOT ELK.**
|
||||
- **Engine OTLP metrics are dropped today** — `arcade-otel-collector:4318` doesn't resolve (no
|
||||
collector deployed). Confirmed in Kibana 2026-06-18.
|
||||
|
||||
## Plan (the three signals + admin + residency)
|
||||
1. **OTEL pipeline health** — `kubectl -n arcade get svc,deploy,pod | grep -i otel`; check engine
|
||||
`OTEL_EXPORTER_OTLP_*` env + chart OTEL collector values. Confirm the drop.
|
||||
2. **Metrics export remediation (primary objective; with the user — touches `apps/arcade`)** —
|
||||
deploy/enable a collector so `arcade-otel-collector:4318` resolves, then bridge into Prometheus/Mimir:
|
||||
EITHER (idiomatic) collector `prometheus` exporter `/metrics` + a `ServiceMonitor` (label
|
||||
`release: prometheus-operator`, NOT `grafana-agent: external`), OR (push) `prometheusremotewrite`
|
||||
exporter → `http://mimir-nginx.mimir.observability-wus2/api/v1/push` + `X-Scope-OrgID: k8s-backstage-v4`.
|
||||
Then generate tool-call traffic and confirm per-tool/per-user metrics appear in Grafana.
|
||||
3. **Execution audit (logs)** — make tool calls; query ELK for records with user/tool/ts/outcome;
|
||||
assess field completeness. (Arcade's own audit log covers admin actions only, by design.)
|
||||
4. **Trace propagation** — send a call with trace context; check it joins agent→tool (engine already
|
||||
emits TraceId in ELK; test whether OTEL traces export + join).
|
||||
5. **Admin audit log** — make an admin change (update a gateway); confirm it's logged in Arcade.
|
||||
6. **Data residency** — confirm no telemetry egresses to Arcade when self-hosted (collector/exporter
|
||||
targets ST-internal only).
|
||||
7. **InfoSec sign-off (Dane)** — gate dependency, not ours to execute; record status.
|
||||
|
||||
## Log
|
||||
- (start here)
|
||||
Reference in New Issue
Block a user