2.4 KiB
2.4 KiB
Lane notes — Category 5 (Auditability & Observability)
- Owner: ztaylor
- Last live-state check: —
- Fixtures: reuse gateway
zeb-gateway-test+ ref serverarcade-eval-reffor generating tool-call traffic (see../../config/targets.yaml; ref-server tunnel is ephemeral — re-establish if down).
Orientation (read before starting)
../../LIVE-POC.md → "Observability" + "Known behaviors". Key facts:
- Logs → ELK via the Vector daemonset (works today; engine logs visible in Kibana with
Tracing.TraceId/CorrelationId/NetCore.RequestPath). - Metrics → Grafana/Mimir via the Grafana Agent Operator (ServiceMonitor/PodMonitor scrape →
remote_write to Mimir, tenant
X-Scope-OrgID: k8s-backstage-v4). NOT ELK. - Engine OTLP metrics are dropped today —
arcade-otel-collector:4318doesn't resolve (no collector deployed). Confirmed in Kibana 2026-06-18.
Plan (the three signals + admin + residency)
- OTEL pipeline health —
kubectl -n arcade get svc,deploy,pod | grep -i otel; check engineOTEL_EXPORTER_OTLP_*env + chart OTEL collector values. Confirm the drop. - Metrics export remediation (primary objective; with the user — touches
apps/arcade) — deploy/enable a collector soarcade-otel-collector:4318resolves, then bridge into Prometheus/Mimir: EITHER (idiomatic) collectorprometheusexporter/metrics+ aServiceMonitor(labelrelease: prometheus-operator, NOTgrafana-agent: external), OR (push)prometheusremotewriteexporter →http://mimir-nginx.mimir.observability-wus2/api/v1/push+X-Scope-OrgID: k8s-backstage-v4. Then generate tool-call traffic and confirm per-tool/per-user metrics appear in Grafana. - Execution audit (logs) — make tool calls; query ELK for records with user/tool/ts/outcome; assess field completeness. (Arcade's own audit log covers admin actions only, by design.)
- Trace propagation — send a call with trace context; check it joins agent→tool (engine already emits TraceId in ELK; test whether OTEL traces export + join).
- Admin audit log — make an admin change (update a gateway); confirm it's logged in Arcade.
- Data residency — confirm no telemetry egresses to Arcade when self-hosted (collector/exporter targets ST-internal only).
- InfoSec sign-off (Dane) — gate dependency, not ours to execute; record status.
Log
- (start here)