3.7 KiB
Category 5 — Auditability and Observability (weight 12)
Verbatim criteria/gates from the criteria Google Doc. Fill Score/Evidence locally; the human pastes. 1–5 scale; anchors at 1/3/5.
How tool execution logging works (verbatim, confirmed with Arcade, Jun 15): Arcade's built-in audit log covers administrative operations only (gateway creation, server registration, API key management) — this is by design, not a gap. Tool execution observability is handled via OpenTelemetry (OTEL): when deploying the Arcade image to Kubernetes, OTEL can be enabled to ship telemetry to any observability collector (Datadog, ELK Stack, etc.). When self-hosted, no telemetry flows back to Arcade — all data stays in ServiceTitan's infrastructure. This is the path to satisfy InfoSec's execution audit requirement.
ServiceTitan reality (this deployment — see ../../LIVE-POC.md): logs → ELK (Vector daemonset);
metrics → Grafana/Mimir (Grafana Agent scrapes ServiceMonitors → remote_write to Mimir). The
engine emits OTLP metrics but they are dropped today — arcade-otel-collector:4318 does not
resolve (no collector deployed). Remediation = deploy a collector + bridge it into Prometheus/Mimir.
Scores
| # | Criterion (verbatim) | Score (1–5) | Evidence / note |
|---|---|---|---|
| 1 | OTEL enabled on the self-hosted Arcade deployment — execution telemetry ships to ServiceTitan's observability stack (Datadog or ELK). | ||
| 2 | Every tool call produces a log record with: user, tool invoked, timestamp, outcome — queryable in Datadog or ELK. | ||
| 3 | Admin audit log — all configuration changes (gateways, servers, API keys, policies) are logged in Arcade. | ||
| 4 | Per-tool and per-user usage metrics (call counts, error rates, latency) visible in the observability stack. | ||
| 5 | Trace propagation — tool call traces joinable to agent and application traces via OTEL. | ||
| 6 | No telemetry data leaves ServiceTitan's infrastructure to Arcade when self-hosted. |
Average: ___ Category score: ___
Score anchors
- 1 — No OTEL support; no execution telemetry available outside Arcade's dashboard
- 3 — OTEL works but configuration is manual or underdocumented; trace propagation requires custom work
- 5 — OTEL is documented and easy to enable; full execution telemetry in Datadog/ELK; trace propagation works end-to-end
Benchmark tests
| # | Test (verbatim) | Result | Evidence |
|---|---|---|---|
| 1 | Enable OTEL on the self-hosted Arcade Kubernetes deployment. Make a tool call. Verify a record appears in Datadog (or ELK) with: user_id, tool name, timestamp, outcome. | ||
| 2 | Make an administrative change (update a gateway). Verify the change appears in Arcade's admin audit log. | ||
| 3 | Propagate a trace ID from an agent call through to the tool execution. Verify the trace is end-to-end visible in the observability stack. | ||
| 4 | Confirm no tool execution telemetry is transmitted to Arcade's own systems when running self-hosted. |
Suggested pass/fail gates
| Gate | Pass condition (verbatim) | Result | Evidence |
|---|---|---|---|
| OTEL integration | OTEL enabled on self-hosted deployment; execution telemetry flows to Datadog or ELK | ||
| Execution audit | Every tool call produces a retrievable record with user, tool, timestamp, outcome in ServiceTitan's observability stack | ||
| Admin audit | All Arcade configuration changes are logged in the admin audit log | ||
| Data residency | No tool execution telemetry transmitted to Arcade when self-hosted — confirmed | ||
| InfoSec sign-off | Dane Snyder confirms the OTEL-based execution audit satisfies the access audit requirement |