Files

2.4 KiB

Lane notes — Category 5 (Auditability & Observability)

  • Owner: ztaylor
  • Last live-state check:
  • Fixtures: reuse gateway zeb-gateway-test + ref server arcade-eval-ref for generating tool-call traffic (see ../../config/targets.yaml; ref-server tunnel is ephemeral — re-establish if down).

Orientation (read before starting)

../../LIVE-POC.md → "Observability" + "Known behaviors". Key facts:

  • Logs → ELK via the Vector daemonset (works today; engine logs visible in Kibana with Tracing.TraceId/CorrelationId/NetCore.RequestPath).
  • Metrics → Grafana/Mimir via the Grafana Agent Operator (ServiceMonitor/PodMonitor scrape → remote_write to Mimir, tenant X-Scope-OrgID: k8s-backstage-v4). NOT ELK.
  • Engine OTLP metrics are dropped todayarcade-otel-collector:4318 doesn't resolve (no collector deployed). Confirmed in Kibana 2026-06-18.

Plan (the three signals + admin + residency)

  1. OTEL pipeline healthkubectl -n arcade get svc,deploy,pod | grep -i otel; check engine OTEL_EXPORTER_OTLP_* env + chart OTEL collector values. Confirm the drop.
  2. Metrics export remediation (primary objective; with the user — touches apps/arcade) — deploy/enable a collector so arcade-otel-collector:4318 resolves, then bridge into Prometheus/Mimir: EITHER (idiomatic) collector prometheus exporter /metrics + a ServiceMonitor (label release: prometheus-operator, NOT grafana-agent: external), OR (push) prometheusremotewrite exporter → http://mimir-nginx.mimir.observability-wus2/api/v1/push + X-Scope-OrgID: k8s-backstage-v4. Then generate tool-call traffic and confirm per-tool/per-user metrics appear in Grafana.
  3. Execution audit (logs) — make tool calls; query ELK for records with user/tool/ts/outcome; assess field completeness. (Arcade's own audit log covers admin actions only, by design.)
  4. Trace propagation — send a call with trace context; check it joins agent→tool (engine already emits TraceId in ELK; test whether OTEL traces export + join).
  5. Admin audit log — make an admin change (update a gateway); confirm it's logged in Arcade.
  6. Data residency — confirm no telemetry egresses to Arcade when self-hosted (collector/exporter targets ST-internal only).
  7. InfoSec sign-off (Dane) — gate dependency, not ours to execute; record status.

Log

  • (start here)