docs: _TEMPLATE + all-10 criteria-section stubs (verbatim criteria)

2026-06-18 10:10:17 -04:00
parent 29c5b2c8be
commit 593e1e63b6
13 changed files with 510 additions and 0 deletions
@@ -0,0 +1,10 @@
+# Lane notes — Category N
+
+Working scratchpad for this lane. Keep terse; the scored deliverable is `criteria-section-N.md`.
+
+- **Owner:**
+- **Last live-state check:**
+- **Fixtures used:** (gateway slug, server, user_ids — see `../../config/targets.yaml`)
+
+## Log
+- (date) — what was done / found
@@ -0,0 +1,29 @@
+# Category N — <Name> (weight W)
+
+> Verbatim criteria / gates / questions from the criteria Google Doc. Fill Score / Evidence /
+> Findings / Answers locally; **the human pastes** into the Google Doc. 1–5 scale; anchors at 1/3/5.
+
+## Scores
+| # | Criterion (verbatim) | Score (1–5) | Evidence / note |
+|---|---|---|---|
+| 1 | <verbatim criterion> |  |  |
+
+**Average:** ___   **Category score:** ___
+
+## Score anchors
+- **1** — <anchor>
+- **3** — <anchor>
+- **5** — <anchor>
+
+## Benchmark questions / tests
+| # | Question / test (verbatim) | Answer / result | Evidence |
+|---|---|---|---|
+| 1 | <verbatim> |  |  |
+
+## Suggested pass/fail gates
+| Gate | Pass condition (verbatim) | Result | Evidence |
+|---|---|---|---|
+| <gate> | <verbatim> |  |  |
+
+## Findings
+- 
@@ -0,0 +1,43 @@
+# Category 1 — Functional MCP Gateway Capability (weight 8)
+
+> Verbatim criteria / gates / questions from the criteria Google Doc. Fill Score / Evidence /
+> Findings / Answers locally; **the human pastes** into the Google Doc. 1–5 scale; anchors at 1/3/5.
+
+## Scores
+| # | Criterion (verbatim) | Score (1–5) | Evidence / note |
+|---|---|---|---|
+| 1 | Implements MCP protocol correctly — tool listing, tool invocation, error responses. |  |  |
+| 2 | Gateway tool curation — ability to expose a subset of tools from underlying servers to a given doorway. |  |  |
+| 3 | Per-user tool scoping — different users see different tool lists based on their explicit grants. |  |  |
+| 4 | Supports all required MCP clients without custom adapters (Claude Code, Cursor, LangGraph, internal agent frameworks). |  |  |
+| 5 | Tool execution isolation — one user's tool call cannot access another user's tokens or context. |  |  |
+| 6 | Supports mixing prebuilt (global catalog) and custom (self-hosted) servers behind a single gateway URL. |  |  |
+| 7 | Gateway is pure metadata — adding or removing tools does not require server redeployment. |  |  |
+| 8 | Dynamic tool registration — new tools become available without gateway restart. |  |  |
+
+**Average:** ___   **Category score:** ___
+
+## Score anchors
+- **1** — Basic MCP server, no per-user scoping or curation
+- **3** — Gateway curation works; per-user scoping requires workarounds
+- **5** — Full per-user tool scoping, mixed-server gateways, zero-config for MCP clients
+
+## Benchmark questions
+| # | Question (verbatim) | Answer | Evidence |
+|---|---|---|---|
+| 1 | Can a Claude Code client connect to the gateway and see only the tools granted to the current user? |  |  |
+| 2 | Can the same gateway URL serve two different users with different tool lists? |  |  |
+| 3 | Can we add a tool to the gateway without restarting any server or the Engine? |  |  |
+| 4 | Can we expose tools from both a prebuilt connector and a custom self-hosted server through one gateway endpoint? |  |  |
+| 5 | What happens when a client requests a tool the user has not been granted? |  |  |
+
+## Suggested pass/fail gates
+| Gate | Pass condition (verbatim) | Result | Evidence |
+|---|---|---|---|
+| MCP protocol compliance | Any compliant MCP client connects without custom adapters |  |  |
+| Tool curation | Gateway tool list matches exactly the configured allow-list |  |  |
+| Per-user isolation | User A cannot see or invoke tools granted only to User B |  |  |
+| Mixed server gateway | Prebuilt and custom server tools coexist behind one gateway URL |  |  |
+
+## Findings
+- 
@@ -0,0 +1,49 @@
+# Category 10 — Product Fit — Tools Catalog and Multi-Tenancy (weight 5)
+
+> *Scored only if the engineering team proceeds to evaluate Arcade as the MCP gateway layer for
+> ServiceTitan's customer-facing tools catalog.* Verbatim criteria/gates from the criteria Google
+> Doc. Fill Score/Evidence locally; **the human pastes**. 1–5 scale; anchors at 1/3/5.
+
+**The multi-tenancy problem (verbatim):** ServiceTitan is a multi-tenant SaaS serving tens of
+thousands of business tenants. Creating one Arcade project per tenant is not a viable architecture.
+The requirement is a single shared Arcade deployment where tenant isolation is enforced within it:
+Tenant A's users cannot access Tenant B's tokens, tool grants, or data. Arcade's native isolation
+boundary is the **project**; within a project, isolation is at the `user_id` level.
+
+## Scores
+| # | Criterion (verbatim) | Score (1–5) | Evidence / note |
+|---|---|---|---|
+| 1 | Native multi-tenant isolation within a single project — Tenant A's tokens, tool grants, and policy are fully isolated from Tenant B's without separate projects. |  |  |
+| 2 | Per-tenant tool access policies — different tenants can have different tool allowlists and Contextual Access rules. |  |  |
+| 3 | Per-tenant quota and rate limits — one tenant's usage cannot degrade another's. |  |  |
+| 4 | Cross-tenant token isolation — provably no path for Tenant A's token to be served on a Tenant B tool call. |  |  |
+| 5 | New tenants can be provisioned programmatically via API — no manual steps, no UI clicks. |  |  |
+| 6 | Gateway configuration is API-driven to support programmatic tenant onboarding at scale. |  |  |
+| 7 | Custom servers built for internal use can be reused for the product use case without re-architecting. |  |  |
+
+**Average:** ___   **Category score:** ___
+
+## Score anchors
+- **1** — No multi-tenant model; one project per tenant is the only isolation path — does not scale
+- **3** — user_id-level token isolation works within a project; tenant-level policy and quota require significant custom work
+- **5** — Native multi-tenant model within a single deployment — per-tenant isolation, policy, quota, and API-driven onboarding all supported
+
+## Benchmark questions
+| # | Question (verbatim) | Answer | Evidence |
+|---|---|---|---|
+| 1 | Does Arcade have a native multi-tenancy model within a single project, or does tenant isolation require one project per tenant? |  |  |
+| 2 | If `tenant_id:user_id` is used as the user_id, does Arcade enforce any tenant-level policy or quota boundaries, or is it purely token isolation? |  |  |
+| 3 | Can per-tenant tool access policies (different tool lists per tenant) be managed via API? |  |  |
+| 4 | Can a new tenant be onboarded — token vault initialized, tool grants set, gateway access configured — entirely via API with no manual steps? |  |  |
+| 5 | What is the recommended architecture for serving tens of thousands of tenants from a single Arcade deployment? |  |  |
+
+## Suggested pass/fail gates
+| Gate | Pass condition (verbatim) | Result | Evidence |
+|---|---|---|---|
+| Multi-tenant isolation | Tenant A's tokens and tool grants are provably inaccessible to Tenant B within a single deployment |  |  |
+| No per-tenant project | Tenant isolation does not require one Arcade project per tenant |  |  |
+| API-driven onboarding | A new tenant can be fully provisioned via API with no manual steps |  |  |
+| Per-tenant policy | Different tenants can have different tool allowlists managed programmatically |  |  |
+
+## Findings
+- 
@@ -0,0 +1,49 @@
+# Category 2 — Delegated Authorization and Identity (weight 20)
+
+> The load-bearing category: every tool call executes as the calling user, using that user's own
+> credentials, and the agent code never sees the token. Verbatim criteria/gates from the criteria
+> Google Doc. Fill Score/Evidence locally; **the human pastes**. 1–5 scale; anchors at 1/3/5.
+
+## Scores
+| # | Criterion (verbatim) | Score (1–5) | Evidence / note |
+|---|---|---|---|
+| 1 | Per-user OAuth token vault — tokens are stored and refreshed per user, per service, per scope. |  |  |
+| 2 | Tool calls execute as the calling user — not a shared service account or bot credential. |  |  |
+| 3 | Okta (OIDC/SAML) integration as the primary IDP for gateway access. |  |  |
+| 4 | Custom OAuth provider support — ability to register non-standard OAuth providers (Snowflake, Workday, TenantTalk via Okta). |  |  |
+| 5 | Token refresh is handled automatically without requiring user re-authentication on every call. |  |  |
+| 6 | The LLM and agent code never see raw tokens — token injection happens server-side in the Engine. |  |  |
+| 7 | Token vault is project-scoped — no cross-project token leakage. |  |  |
+| 8 | Admin consent — ability for an admin to pre-authorize a scope on behalf of a class of users. |  |  |
+| 9 | Admin-initiated token revocation — an admin can invalidate all vault tokens for a specific user directly in Arcade, without touching any downstream provider. Primary use case: employee offboarding or security incident response. |  |  |
+
+**Average:** ___   **Category score:** ___
+
+## Score anchors
+- **1** — Shared API keys or service accounts only; no per-user identity
+- **3** — Per-user OAuth works for prebuilt connectors; custom providers require undocumented manual steps; revocation requires going to each provider individually
+- **5** — Full per-user vault, Okta integration, custom OAuth providers documented and working, token refresh transparent, admin-initiated revocation works from one place
+
+## Benchmark tests
+| # | Test (verbatim) | Result | Evidence |
+|---|---|---|---|
+| 1 | Call a tool as User A. Verify it executes with User A's credentials by checking the downstream system's own audit log (e.g., GitHub shows the call as User A, not a service account). |  |  |
+| 2 | Revoke User A's OAuth token in the provider. Verify the next tool call triggers a consent/re-auth flow rather than silently failing or falling back to a shared credential. |  |  |
+| 3 | Configure a custom OAuth provider (Snowflake or Workday). Complete a full per-user token flow end-to-end: authorize → vault stores token → tool call executes as that user. |  |  |
+| 4 | Configure TenantTalk authentication via Okta as a custom OAuth provider. Verify the Engine brokers the token correctly. |  |  |
+| 5 | Verify token refresh: let a token expire. Confirm the next call either refreshes transparently or returns a clear re-auth prompt. |  |  |
+| 6 | Admin-initiated revocation: as an admin, invalidate all vault tokens for User A in Arcade directly (no downstream provider action). Verify User A's next tool call fails or triggers re-auth, across all connected systems simultaneously. |  |  |
+
+## Suggested pass/fail gates
+| Gate | Pass condition (verbatim) | Result | Evidence |
+|---|---|---|---|
+| Per-user execution | Tool calls provably execute as the calling user (verifiable in the downstream system's own logs) |  |  |
+| No shared credentials | No service account or shared token is used in any tool call path |  |  |
+| Okta integration | Gateway access works end-to-end through Okta OIDC/SAML |  |  |
+| Custom OAuth | At least one custom provider (Snowflake or Workday) configured and functional |  |  |
+| Token isolation | No user's token is accessible by, or executed as, another user |  |  |
+| Downstream revocation | Revoking a token at the provider level triggers re-auth on the next call — no silent fallback |  |  |
+| Admin-initiated revocation | An admin can invalidate all of a specific user's vault tokens in Arcade directly, taking effect across all connected systems without touching each provider individually |  |  |
+
+## Findings
+- Note (deployment): live POC upstream IdP is **Entra ID**, not Okta yet — score criterion 3 against that gap.
@@ -0,0 +1,44 @@
+# Category 3 — Tool-Level Access Control and Policy (weight 15)
+
+> Verbatim criteria/gates from the criteria Google Doc. Fill Score/Evidence locally; **the human
+> pastes**. 1–5 scale; anchors at 1/3/5.
+
+## Scores
+| # | Criterion (verbatim) | Score (1–5) | Evidence / note |
+|---|---|---|---|
+| 1 | Tool-level allow-list per user — a user can only call tools explicitly granted to them; the gateway enforces this, not the client. |  |  |
+| 2 | Contextual Access rules — per-user tool visibility and invocation policy layered on top of the gateway allow-list. |  |  |
+| 3 | Input filtering — ability to block or rewrite tool inputs based on policy before execution reaches the server. |  |  |
+| 4 | Output redaction — ability to mask or strip sensitive fields from tool outputs before they reach the agent. |  |  |
+| 5 | Policy is enforced at the Engine, not the client — a malicious or compromised client cannot bypass it. |  |  |
+| 6 | All policy decisions (allow, block, redact) are logged. |  |  |
+| 7 | Per-user tool grants can be updated without restarting the gateway or any server. |  |  |
+| 8 | Gateway scopes map to Okta groups — access managed in Okta, not a separate system. |  |  |
+
+**Average:** ___   **Category score:** ___
+
+## Score anchors
+- **1** — Gateway-level tool list only; no per-user scoping or input/output policy
+- **3** — Per-user grants work; Contextual Access input/output rules require significant manual work
+- **5** — Full per-user policy, Contextual Access input/output rules, Okta-managed scopes, all decisions audited
+
+## Benchmark tests
+| # | Test (verbatim) | Result | Evidence |
+|---|---|---|---|
+| 1 | Grant User A access to GitHub tools and User B access to Atlassian tools. Verify User A cannot invoke Atlassian tools even if they know the tool name. |  |  |
+| 2 | Write a Contextual Access rule that blocks inputs containing a specific pattern (e.g., a mock SSN). Send a matching input — verify it is blocked before execution and logged. |  |  |
+| 3 | Write a Contextual Access rule that redacts a field from tool outputs. Verify the field is absent from the agent's response. |  |  |
+| 4 | Update User A's tool grants (add a new tool). Verify the change takes effect without restarting anything. |  |  |
+| 5 | Confirm policy enforcement point: attempt to bypass Contextual Access by calling the server directly (bypassing the Engine). Confirm this is architecturally prevented or explicitly documented as a known boundary. |  |  |
+
+## Suggested pass/fail gates
+| Gate | Pass condition (verbatim) | Result | Evidence |
+|---|---|---|---|
+| Tool isolation | Cross-user tool calls are rejected at the Engine regardless of client behavior |  |  |
+| Input policy | Blocked inputs are rejected before execution, not after |  |  |
+| Output policy | Redacted fields are absent from the agent's response |  |  |
+| Audit | Every policy decision (allow/block/redact) produces a retrievable log entry |  |  |
+| Dynamic grants | Tool grant updates take effect without service restart |  |  |
+
+## Findings
+- 
@@ -0,0 +1,56 @@
+# Category 4 — Connector Coverage and Custom Server Development (weight 10)
+
+> Verbatim criteria/gates from the criteria Google Doc. Fill Score/Evidence locally; **the human
+> pastes**. 1–5 scale; anchors at 1/3/5.
+
+## Scores
+| # | Criterion (verbatim) | Score (1–5) | Evidence / note |
+|---|---|---|---|
+| 1 | Prebuilt catalog covers required systems (GitHub, Salesforce, Atlassian/Jira). |  |  |
+| 2 | Python SDK (arcade-mcp) supports building custom servers with minimal boilerplate. |  |  |
+| 3 | Tool schema is auto-derived from Python type annotations — no manual schema authoring. |  |  |
+| 4 | Local development loop works without cloud infrastructure (stdio mode). |  |  |
+| 5 | Custom servers can be registered as self-hosted (HTTPS endpoint) and routed by the Engine. |  |  |
+| 6 | Custom OAuth provider registration — Engine brokers per-user tokens for custom systems. |  |  |
+| 7 | Custom servers can be versioned and updated without gateway downtime. |  |  |
+
+**Average:** ___   **Category score:** ___
+
+## Score anchors
+- **1** — No SDK; custom integration requires raw HTTP server and manual schema
+- **3** — SDK works for basic cases; custom OAuth is underdocumented; some systems blocked
+- **5** — SDK is productive, custom OAuth providers are documented and straightforward, all six systems have a working path
+
+## Coverage of required systems (verbatim)
+| System | Prebuilt? | Path |
+|---|---|---|
+| GitHub | Yes | Prebuilt (global catalog) |
+| Salesforce | Yes | Prebuilt (global catalog) |
+| Atlassian / Jira | Partial | Prebuilt; confirm Confluence coverage |
+| HubSpot | Yes | Prebuilt (global catalog) |
+| QuickBooks | No | Custom server + custom OAuth provider |
+| Sage | No | Custom server + custom OAuth provider |
+| Snowflake | No | Custom server + custom OAuth provider |
+| Workday | No | Custom server + custom OAuth provider |
+| TenantTalk | No (internal) | Custom server + Okta-backed OAuth |
+
+## Benchmark tests
+| # | Test (verbatim) | Result | Evidence |
+|---|---|---|---|
+| 1 | Build a minimal custom server for one internal API using the arcade-mcp SDK. Measure time from schema to first successful local tool call. Target: under 2 hours. |  |  |
+| 2 | Register the custom server as self-hosted (HTTPS endpoint). Verify Engine routing works and tool calls reach the server. |  |  |
+| 3 | Configure Snowflake (or equivalent) as a custom OAuth provider. Complete per-user token flow end-to-end. |  |  |
+| 4 | Verify tool schema is auto-derived from Python type annotations. Confirm no manual JSON schema authoring is required. |  |  |
+| 5 | Update a custom tool's implementation. Verify the change takes effect without restarting the gateway. |  |  |
+
+## Suggested pass/fail gates
+| Gate | Pass condition (verbatim) | Result | Evidence |
+|---|---|---|---|
+| SDK productivity | Custom server from scratch to first local tool call in under 2 hours |  |  |
+| Self-hosted registration | HTTPS endpoint registers and routes correctly through the Engine |  |  |
+| Custom OAuth | At least one non-standard OAuth provider configured end-to-end |  |  |
+| Schema derivation | Tool schema is auto-derived; no manual JSON schema authoring required |  |  |
+| All systems covered | A working path (prebuilt or custom) exists for all six required systems |  |  |
+
+## Findings
+- 
@@ -0,0 +1,54 @@
+# Category 5 — Auditability and Observability (weight 12)
+
+> Verbatim criteria/gates from the criteria Google Doc. Fill Score/Evidence locally; **the human
+> pastes**. 1–5 scale; anchors at 1/3/5.
+
+**How tool execution logging works (verbatim, confirmed with Arcade, Jun 15):** Arcade's built-in
+audit log covers administrative operations only (gateway creation, server registration, API key
+management) — this is by design, not a gap. Tool execution observability is handled via
+OpenTelemetry (OTEL): when deploying the Arcade image to Kubernetes, OTEL can be enabled to ship
+telemetry to any observability collector (Datadog, ELK Stack, etc.). When self-hosted, no telemetry
+flows back to Arcade — all data stays in ServiceTitan's infrastructure. This is the path to satisfy
+InfoSec's execution audit requirement.
+
+**ServiceTitan reality (this deployment — see ../../LIVE-POC.md):** logs → ELK (Vector daemonset);
+**metrics → Grafana/Mimir** (Grafana Agent scrapes ServiceMonitors → remote_write to Mimir). The
+engine emits OTLP metrics but they are **dropped** today — `arcade-otel-collector:4318` does not
+resolve (no collector deployed). Remediation = deploy a collector + bridge it into Prometheus/Mimir.
+
+## Scores
+| # | Criterion (verbatim) | Score (1–5) | Evidence / note |
+|---|---|---|---|
+| 1 | OTEL enabled on the self-hosted Arcade deployment — execution telemetry ships to ServiceTitan's observability stack (Datadog or ELK). |  |  |
+| 2 | Every tool call produces a log record with: user, tool invoked, timestamp, outcome — queryable in Datadog or ELK. |  |  |
+| 3 | Admin audit log — all configuration changes (gateways, servers, API keys, policies) are logged in Arcade. |  |  |
+| 4 | Per-tool and per-user usage metrics (call counts, error rates, latency) visible in the observability stack. |  |  |
+| 5 | Trace propagation — tool call traces joinable to agent and application traces via OTEL. |  |  |
+| 6 | No telemetry data leaves ServiceTitan's infrastructure to Arcade when self-hosted. |  |  |
+
+**Average:** ___   **Category score:** ___
+
+## Score anchors
+- **1** — No OTEL support; no execution telemetry available outside Arcade's dashboard
+- **3** — OTEL works but configuration is manual or underdocumented; trace propagation requires custom work
+- **5** — OTEL is documented and easy to enable; full execution telemetry in Datadog/ELK; trace propagation works end-to-end
+
+## Benchmark tests
+| # | Test (verbatim) | Result | Evidence |
+|---|---|---|---|
+| 1 | Enable OTEL on the self-hosted Arcade Kubernetes deployment. Make a tool call. Verify a record appears in Datadog (or ELK) with: user_id, tool name, timestamp, outcome. |  |  |
+| 2 | Make an administrative change (update a gateway). Verify the change appears in Arcade's admin audit log. |  |  |
+| 3 | Propagate a trace ID from an agent call through to the tool execution. Verify the trace is end-to-end visible in the observability stack. |  |  |
+| 4 | Confirm no tool execution telemetry is transmitted to Arcade's own systems when running self-hosted. |  |  |
+
+## Suggested pass/fail gates
+| Gate | Pass condition (verbatim) | Result | Evidence |
+|---|---|---|---|
+| OTEL integration | OTEL enabled on self-hosted deployment; execution telemetry flows to Datadog or ELK |  |  |
+| Execution audit | Every tool call produces a retrievable record with user, tool, timestamp, outcome in ServiceTitan's observability stack |  |  |
+| Admin audit | All Arcade configuration changes are logged in the admin audit log |  |  |
+| Data residency | No tool execution telemetry transmitted to Arcade when self-hosted — confirmed |  |  |
+| InfoSec sign-off | Dane Snyder confirms the OTEL-based execution audit satisfies the access audit requirement |  |  |
+
+## Findings
+- 
@@ -0,0 +1,50 @@
+# Category 6 — Security and Compliance (weight 10)
+
+> Verbatim criteria/gates from the criteria Google Doc. Fill Score/Evidence locally; **the human
+> pastes**. 1–5 scale; anchors at 1/3/5.
+
+## Scores
+| # | Criterion (verbatim) | Score (1–5) | Evidence / note |
+|---|---|---|---|
+| 1 | PII masking or redaction at the gateway layer — without changes to tool code. |  |  |
+| 2 | Input blocking — Contextual Access policy can block tool calls based on content. |  |  |
+| 3 | MCPs can be scaled to less than human access. |  |  |
+| 4 | Output redaction — sensitive fields removed from responses before reaching the agent. |  |  |
+| 5 | Data processing agreement (DPA) and sub-processor disclosure in place. |  |  |
+| 6 | SOC 2 / ISO 27001 certification (or equivalent) confirmed. |  |  |
+| 7 | Data boundary acceptable to InfoSec — tool call payloads route through Arcade's Engine; execution stays in ServiceTitan's infrastructure. |  |  |
+| 8 | Raw OAuth tokens are never exposed to the LLM, agent code, or logs. |  |  |
+| 9 | Secrets management integration (Azure Key Vault or equivalent) for API key storage. |  |  |
+| 10 | Potential for log forwarding for telemetry, alerting |  |  |
+| 11 | Potential integration for DLP tooling if possible |  |  |
+| 12 | Data boundary guardrails (able to block querying all records from a table) |  |  |
+
+**Average:** ___   **Category score:** ___
+
+## Score anchors
+- **1** — No policy enforcement; payloads flow unmodified; DPA and certifications unconfirmed
+- **3** — Some policy controls exist; DPA in progress; compliance posture requires follow-up
+- **5** — Full policy enforcement, DPA executed, compliant data boundary, tokens never exposed
+
+## Benchmark tests
+| # | Test (verbatim) | Result | Evidence |
+|---|---|---|---|
+| 1 | Send a tool input containing a mock SSN. Verify it is redacted before reaching the tool function via a Contextual Access rule. |  |  |
+| 2 | Send a tool output containing a mock API key string. Verify it is redacted before reaching the agent. |  |  |
+| 3 | Attempt a tool call with an expired or revoked credential. Verify rejection with a clean error — no fallback to a shared credential. |  |  |
+| 4 | Attempt to call a tool that has been restricted by the MCP gateway that the person usually can perform |  |  |
+| 5 | Attempt to pull all records from an MCP integration, instead of focused data |  |  |
+| 6 | Review the DPA and sub-processor list against ServiceTitan's data governance requirements. |  |  |
+| 7 | Confirm in the Engine architecture that raw tokens never appear in logs, traces, or agent responses. |  |  |
+
+## Suggested pass/fail gates
+| Gate | Pass condition (verbatim) | Result | Evidence |
+|---|---|---|---|
+| Data boundary | Tool call payloads through Arcade Engine + execution in ServiceTitan infrastructure — acceptable to InfoSec |  |  |
+| No token exposure | Raw OAuth tokens are never visible in logs, traces, or agent responses |  |  |
+| DPA | Data processing agreement is executed before the pilot ends |  |  |
+| PII policy | At least one PII redaction rule works end-to-end |  |  |
+| Compliance | SOC 2 or equivalent certification confirmed |  |  |
+
+## Findings
+- 
@@ -0,0 +1,42 @@
+# Category 7 — Performance and Availability (weight 8)
+
+> Because every gateway-mediated tool call routes through the Arcade Engine — even when the custom
+> server is self-hosted — Engine latency and availability are a floor on the entire agent stack.
+> Verbatim criteria/gates from the criteria Google Doc. Fill Score/Evidence locally; **the human
+> pastes**. 1–5 scale; anchors at 1/3/5.
+
+## Scores
+| # | Criterion (verbatim) | Score (1–5) | Evidence / note |
+|---|---|---|---|
+| 1 | Engine-added latency per tool call is within acceptable bounds for interactive agent use. |  |  |
+| 2 | Engine SLA — defined uptime guarantees with incident response process. |  |  |
+| 3 | Failure behavior when Engine is unavailable: fail-closed with a clean, catchable error. |  |  |
+| 4 | Self-hosted server HA — multi-replica, pod failure handling, no dropped calls on restart. |  |  |
+| 5 | Multi-region failover design — documented and validated. |  |  |
+| 6 | Engine geographic placement and round-trip latency from ServiceTitan's primary region. |  |  |
+
+**Average:** ___   **Category score:** ___
+
+## Score anchors
+- **1** — Engine SLA undocumented; failure behavior is a hang or silent failure; no HA guidance
+- **3** — SLA documented; HA works with manual configuration; failure behavior is known but requires client-side handling
+- **5** — SLA with incident response in writing; HA is the documented default; failure behavior is clean and observable
+
+## Benchmark tests
+| # | Test (verbatim) | Result | Evidence |
+|---|---|---|---|
+| 1 | Make 100 tool calls through the Engine to a self-hosted server. Measure P50, P95, P99 round-trip latency. Compare against a direct server call (bypassing the Engine) to isolate Engine-added overhead. |  |  |
+| 2 | Simulate Engine unavailability (block the Engine endpoint). Confirm tool calls fail with a clean, catchable error — not a hang or silent failure. |  |  |
+| 3 | Deploy the custom server with multiple replicas. Kill one pod. Confirm tool calls continue without dropped requests. |  |  |
+| 4 | Confirm Engine SLA documentation: uptime percentage, response time commitment, and P0 escalation path. |  |  |
+
+## Suggested pass/fail gates
+| Gate | Pass condition (verbatim) | Result | Evidence |
+|---|---|---|---|
+| Engine overhead | P95 Engine-added latency is under 500ms for standard (non-streaming) tool calls |  |  |
+| SLA documented | Engine uptime SLA and incident response process confirmed in writing |  |  |
+| HA | Self-hosted server survives pod failure; no tool calls dropped during pod restart |  |  |
+| Fail behavior | Engine outage produces a clean, catchable error to the agent — no hangs |  |  |
+
+## Findings
+- 
@@ -0,0 +1,41 @@
+# Category 8 — Deployment and Operations (weight 7)
+
+> Verbatim criteria/gates from the criteria Google Doc. Fill Score/Evidence locally; **the human
+> pastes**. 1–5 scale; anchors at 1/3/5.
+
+## Scores
+| # | Criterion (verbatim) | Score (1–5) | Evidence / note |
+|---|---|---|---|
+| 1 | Helm chart available and documented for self-hosted server deployment in Kubernetes. |  |  |
+| 2 | Zero-downtime configuration updates — gateway and policy changes do not interrupt in-flight calls. |  |  |
+| 3 | GitOps-compatible — gateway, server, and policy configuration is expressible as code. |  |  |
+| 4 | Upgrade and rollback process is documented and tested. |  |  |
+| 5 | Runbooks for common failure scenarios. |  |  |
+| 6 | Vendor support model during the pilot: dedicated solutions engineer, response SLA. |  |  |
+| 7 | P0/P1 escalation path after the pilot, in production. |  |  |
+
+**Average:** ___   **Category score:** ___
+
+## Score anchors
+- **1** — No Helm chart; manual deployment only; no dedicated support
+- **3** — Helm chart works with gaps; zero-downtime config updates unverified; support exists but is not dedicated
+- **5** — Helm-native, GitOps-compatible, zero-downtime config, dedicated SE, documented escalation path
+
+## Benchmark tests
+| # | Test (verbatim) | Result | Evidence |
+|---|---|---|---|
+| 1 | Deploy a self-hosted custom server to a Kubernetes namespace via Helm chart. Measure time from clean namespace to first successful tool call. Target: under 1 day. |  |  |
+| 2 | Update a gateway configuration (add a tool). Verify in-flight calls are not dropped. |  |  |
+| 3 | Simulate a configuration rollback. Verify the rollback completes cleanly and the prior configuration is restored. |  |  |
+| 4 | Stage a K8s namespace and confirm the Helm deployment matches the architecture recommended for production. |  |  |
+
+## Suggested pass/fail gates
+| Gate | Pass condition (verbatim) | Result | Evidence |
+|---|---|---|---|
+| Helm deployment | Full Kubernetes deployment via Helm chart in under 1 day |  |  |
+| Config safety | Gateway configuration changes are zero-downtime |  |  |
+| Rollback | Prior configuration can be restored cleanly |  |  |
+| Dedicated SE | A dedicated solutions engineer is available and responsive during the pilot |  |  |
+
+## Findings
+- Note: a live deployment already exists (`k8s-backstage-v2/apps/arcade`, chart 1.8.8, Flux/GitOps) — a head start for this category's evidence.
@@ -0,0 +1,43 @@
+# Category 9 — Developer Experience (weight 5)
+
+> Verbatim criteria/gates from the criteria Google Doc. Fill Score/Evidence locally; **the human
+> pastes**. 1–5 scale; anchors at 1/3/5.
+
+## Scores
+| # | Criterion (verbatim) | Score (1–5) | Evidence / note |
+|---|---|---|---|
+| 1 | Local development loop is productive — stdio mode enables tool development without cloud infrastructure (Stage 1: code runs locally, MCP client spawns the server directly). |  |  |
+| 2 | Tunnel-based development loop is supported — a developer can expose their locally running MCP server through a tunnel (Cloudflare, ngrok) and register it against a shared dev Arcade instance to exercise the full request chain (gateway → Engine → tunnel → local server) without deploying to Kubernetes. This is the primary development pattern for custom server authors. |  |  |
+| 3 | A shared dev Arcade instance is available for ServiceTitan developers to register tunnel endpoints against — no need to provision a personal Arcade org for every developer. |  |  |
+| 4 | Cloudflare tunnel (or equivalent) is the standardized proxy mechanism — documented, with a permanent named-tunnel option so the registered server URL does not change on every session restart. |  |  |
+| 5 | SDK documentation is complete, accurate, and has working examples. |  |  |
+| 6 | Error messages are actionable — auth failures, misconfigurations, and policy blocks identify the root cause. |  |  |
+| 7 | MCP client integration requires no custom adapters or wrappers. |  |  |
+| 8 | Gateway and server management are automatable via API. |  |  |
+
+**Average:** ___   **Category score:** ___
+
+## Score anchors
+- **1** — No tunnel support; local development requires full Kubernetes deployment to test the gateway chain
+- **3** — Tunnel registration works but is underdocumented; no shared dev instance; engineers figure it out individually
+- **5** — Tunnel loop is documented with a standard Cloudflare recipe; shared dev instance available; engineers are productive without platform hand-holding
+
+## Benchmark tests
+| # | Test (verbatim) | Result | Evidence |
+|---|---|---|---|
+| 1 | **Stage 1 — local stdio:** Time an engineer from SDK install to first successful local tool call (stdio mode, no Arcade infrastructure). Target: under 2 hours. |  |  |
+| 2 | **Stage 2 — tunnel registration (the key test):** Developer runs a local MCP server in HTTP mode, opens a Cloudflare tunnel, registers the tunnel URL as a self-hosted server in the dev Arcade instance, and makes a tool call that flows: Claude Code → gateway → Engine → Cloudflare tunnel → local server. Verify the full chain works including auth and Contextual Access. Measure time from working local server to first successful gateway-mediated call. Target: under 1 day. |  |  |
+| 3 | Verify Cloudflare named tunnel (permanent hostname) — confirm the registered URL survives session restarts without re-editing the server registration. |  |  |
+| 4 | Intentionally misconfigure an OAuth provider. Measure how quickly the error message identifies the root cause. |  |  |
+| 5 | Integrate with Claude Code from scratch — time from gateway URL to working tool invocation in Claude Code. Target: under 30 minutes. |  |  |
+
+## Suggested pass/fail gates
+| Gate | Pass condition (verbatim) | Result | Evidence |
+|---|---|---|---|
+| Tunnel loop | Full gateway → Engine → Cloudflare tunnel → local server chain works end-to-end |  |  |
+| Permanent tunnel URL | Named tunnel hostname persists across session restarts without re-registration |  |  |
+| Shared dev instance | ServiceTitan developers can register local servers against a shared dev Arcade org without individual account provisioning |  |  |
+| Time to first call | Engineer reaches a working gateway-mediated tool call in under 1 day from scratch |  |  |
+
+## Findings
+-