docs: _TEMPLATE + all-10 criteria-section stubs (verbatim criteria)

2026-06-18 10:10:17 -04:00
parent 29c5b2c8be
commit 593e1e63b6
13 changed files with 510 additions and 0 deletions
@@ -0,0 +1,54 @@
+# Category 5 — Auditability and Observability (weight 12)
+
+> Verbatim criteria/gates from the criteria Google Doc. Fill Score/Evidence locally; **the human
+> pastes**. 1–5 scale; anchors at 1/3/5.
+
+**How tool execution logging works (verbatim, confirmed with Arcade, Jun 15):** Arcade's built-in
+audit log covers administrative operations only (gateway creation, server registration, API key
+management) — this is by design, not a gap. Tool execution observability is handled via
+OpenTelemetry (OTEL): when deploying the Arcade image to Kubernetes, OTEL can be enabled to ship
+telemetry to any observability collector (Datadog, ELK Stack, etc.). When self-hosted, no telemetry
+flows back to Arcade — all data stays in ServiceTitan's infrastructure. This is the path to satisfy
+InfoSec's execution audit requirement.
+
+**ServiceTitan reality (this deployment — see ../../LIVE-POC.md):** logs → ELK (Vector daemonset);
+**metrics → Grafana/Mimir** (Grafana Agent scrapes ServiceMonitors → remote_write to Mimir). The
+engine emits OTLP metrics but they are **dropped** today — `arcade-otel-collector:4318` does not
+resolve (no collector deployed). Remediation = deploy a collector + bridge it into Prometheus/Mimir.
+
+## Scores
+| # | Criterion (verbatim) | Score (1–5) | Evidence / note |
+|---|---|---|---|
+| 1 | OTEL enabled on the self-hosted Arcade deployment — execution telemetry ships to ServiceTitan's observability stack (Datadog or ELK). |  |  |
+| 2 | Every tool call produces a log record with: user, tool invoked, timestamp, outcome — queryable in Datadog or ELK. |  |  |
+| 3 | Admin audit log — all configuration changes (gateways, servers, API keys, policies) are logged in Arcade. |  |  |
+| 4 | Per-tool and per-user usage metrics (call counts, error rates, latency) visible in the observability stack. |  |  |
+| 5 | Trace propagation — tool call traces joinable to agent and application traces via OTEL. |  |  |
+| 6 | No telemetry data leaves ServiceTitan's infrastructure to Arcade when self-hosted. |  |  |
+
+**Average:** ___   **Category score:** ___
+
+## Score anchors
+- **1** — No OTEL support; no execution telemetry available outside Arcade's dashboard
+- **3** — OTEL works but configuration is manual or underdocumented; trace propagation requires custom work
+- **5** — OTEL is documented and easy to enable; full execution telemetry in Datadog/ELK; trace propagation works end-to-end
+
+## Benchmark tests
+| # | Test (verbatim) | Result | Evidence |
+|---|---|---|---|
+| 1 | Enable OTEL on the self-hosted Arcade Kubernetes deployment. Make a tool call. Verify a record appears in Datadog (or ELK) with: user_id, tool name, timestamp, outcome. |  |  |
+| 2 | Make an administrative change (update a gateway). Verify the change appears in Arcade's admin audit log. |  |  |
+| 3 | Propagate a trace ID from an agent call through to the tool execution. Verify the trace is end-to-end visible in the observability stack. |  |  |
+| 4 | Confirm no tool execution telemetry is transmitted to Arcade's own systems when running self-hosted. |  |  |
+
+## Suggested pass/fail gates
+| Gate | Pass condition (verbatim) | Result | Evidence |
+|---|---|---|---|
+| OTEL integration | OTEL enabled on self-hosted deployment; execution telemetry flows to Datadog or ELK |  |  |
+| Execution audit | Every tool call produces a retrievable record with user, tool, timestamp, outcome in ServiceTitan's observability stack |  |  |
+| Admin audit | All Arcade configuration changes are logged in the admin audit log |  |  |
+| Data residency | No tool execution telemetry transmitted to Arcade when self-hosted — confirmed |  |  |
+| InfoSec sign-off | Dane Snyder confirms the OTEL-based execution audit satisfies the access audit requirement |  |  |
+
+## Findings
+-