From 593e1e63b6a3047dc4cd4cfa2495e78563d0e501 Mon Sep 17 00:00:00 2001 From: iztaylor Date: Thu, 18 Jun 2026 10:10:17 -0400 Subject: [PATCH] docs: _TEMPLATE + all-10 criteria-section stubs (verbatim criteria) --- categories/_TEMPLATE/NOTES.md | 10 ++++ categories/_TEMPLATE/criteria-section.md | 29 ++++++++++ categories/_TEMPLATE/tests/.gitkeep | 0 .../cat1-functional/criteria-section-1.md | 43 ++++++++++++++ .../cat10-product-fit/criteria-section-10.md | 49 ++++++++++++++++ .../criteria-section-2.md | 49 ++++++++++++++++ .../cat3-access-policy/criteria-section-3.md | 44 +++++++++++++++ .../cat4-connectors/criteria-section-4.md | 56 +++++++++++++++++++ .../cat5-auditability/criteria-section-5.md | 54 ++++++++++++++++++ .../cat6-security/criteria-section-6.md | 50 +++++++++++++++++ .../cat7-performance/criteria-section-7.md | 42 ++++++++++++++ .../cat8-deployment-ops/criteria-section-8.md | 41 ++++++++++++++ categories/cat9-devex/criteria-section-9.md | 43 ++++++++++++++ 13 files changed, 510 insertions(+) create mode 100644 categories/_TEMPLATE/NOTES.md create mode 100644 categories/_TEMPLATE/criteria-section.md create mode 100644 categories/_TEMPLATE/tests/.gitkeep create mode 100644 categories/cat1-functional/criteria-section-1.md create mode 100644 categories/cat10-product-fit/criteria-section-10.md create mode 100644 categories/cat2-delegated-authz/criteria-section-2.md create mode 100644 categories/cat3-access-policy/criteria-section-3.md create mode 100644 categories/cat4-connectors/criteria-section-4.md create mode 100644 categories/cat5-auditability/criteria-section-5.md create mode 100644 categories/cat6-security/criteria-section-6.md create mode 100644 categories/cat7-performance/criteria-section-7.md create mode 100644 categories/cat8-deployment-ops/criteria-section-8.md create mode 100644 categories/cat9-devex/criteria-section-9.md diff --git a/categories/_TEMPLATE/NOTES.md b/categories/_TEMPLATE/NOTES.md new file mode 100644 index 0000000..35e8940 --- /dev/null +++ b/categories/_TEMPLATE/NOTES.md @@ -0,0 +1,10 @@ +# Lane notes — Category N + +Working scratchpad for this lane. Keep terse; the scored deliverable is `criteria-section-N.md`. + +- **Owner:** +- **Last live-state check:** +- **Fixtures used:** (gateway slug, server, user_ids — see `../../config/targets.yaml`) + +## Log +- (date) — what was done / found diff --git a/categories/_TEMPLATE/criteria-section.md b/categories/_TEMPLATE/criteria-section.md new file mode 100644 index 0000000..fe4e4ca --- /dev/null +++ b/categories/_TEMPLATE/criteria-section.md @@ -0,0 +1,29 @@ +# Category N — (weight W) + +> Verbatim criteria / gates / questions from the criteria Google Doc. Fill Score / Evidence / +> Findings / Answers locally; **the human pastes** into the Google Doc. 1–5 scale; anchors at 1/3/5. + +## Scores +| # | Criterion (verbatim) | Score (1–5) | Evidence / note | +|---|---|---|---| +| 1 | | | | + +**Average:** ___ **Category score:** ___ + +## Score anchors +- **1** — +- **3** — +- **5** — + +## Benchmark questions / tests +| # | Question / test (verbatim) | Answer / result | Evidence | +|---|---|---|---| +| 1 | | | | + +## Suggested pass/fail gates +| Gate | Pass condition (verbatim) | Result | Evidence | +|---|---|---|---| +| | | | | + +## Findings +- diff --git a/categories/_TEMPLATE/tests/.gitkeep b/categories/_TEMPLATE/tests/.gitkeep new file mode 100644 index 0000000..e69de29 diff --git a/categories/cat1-functional/criteria-section-1.md b/categories/cat1-functional/criteria-section-1.md new file mode 100644 index 0000000..8ee554f --- /dev/null +++ b/categories/cat1-functional/criteria-section-1.md @@ -0,0 +1,43 @@ +# Category 1 — Functional MCP Gateway Capability (weight 8) + +> Verbatim criteria / gates / questions from the criteria Google Doc. Fill Score / Evidence / +> Findings / Answers locally; **the human pastes** into the Google Doc. 1–5 scale; anchors at 1/3/5. + +## Scores +| # | Criterion (verbatim) | Score (1–5) | Evidence / note | +|---|---|---|---| +| 1 | Implements MCP protocol correctly — tool listing, tool invocation, error responses. | | | +| 2 | Gateway tool curation — ability to expose a subset of tools from underlying servers to a given doorway. | | | +| 3 | Per-user tool scoping — different users see different tool lists based on their explicit grants. | | | +| 4 | Supports all required MCP clients without custom adapters (Claude Code, Cursor, LangGraph, internal agent frameworks). | | | +| 5 | Tool execution isolation — one user's tool call cannot access another user's tokens or context. | | | +| 6 | Supports mixing prebuilt (global catalog) and custom (self-hosted) servers behind a single gateway URL. | | | +| 7 | Gateway is pure metadata — adding or removing tools does not require server redeployment. | | | +| 8 | Dynamic tool registration — new tools become available without gateway restart. | | | + +**Average:** ___ **Category score:** ___ + +## Score anchors +- **1** — Basic MCP server, no per-user scoping or curation +- **3** — Gateway curation works; per-user scoping requires workarounds +- **5** — Full per-user tool scoping, mixed-server gateways, zero-config for MCP clients + +## Benchmark questions +| # | Question (verbatim) | Answer | Evidence | +|---|---|---|---| +| 1 | Can a Claude Code client connect to the gateway and see only the tools granted to the current user? | | | +| 2 | Can the same gateway URL serve two different users with different tool lists? | | | +| 3 | Can we add a tool to the gateway without restarting any server or the Engine? | | | +| 4 | Can we expose tools from both a prebuilt connector and a custom self-hosted server through one gateway endpoint? | | | +| 5 | What happens when a client requests a tool the user has not been granted? | | | + +## Suggested pass/fail gates +| Gate | Pass condition (verbatim) | Result | Evidence | +|---|---|---|---| +| MCP protocol compliance | Any compliant MCP client connects without custom adapters | | | +| Tool curation | Gateway tool list matches exactly the configured allow-list | | | +| Per-user isolation | User A cannot see or invoke tools granted only to User B | | | +| Mixed server gateway | Prebuilt and custom server tools coexist behind one gateway URL | | | + +## Findings +- diff --git a/categories/cat10-product-fit/criteria-section-10.md b/categories/cat10-product-fit/criteria-section-10.md new file mode 100644 index 0000000..7cc8b34 --- /dev/null +++ b/categories/cat10-product-fit/criteria-section-10.md @@ -0,0 +1,49 @@ +# Category 10 — Product Fit — Tools Catalog and Multi-Tenancy (weight 5) + +> *Scored only if the engineering team proceeds to evaluate Arcade as the MCP gateway layer for +> ServiceTitan's customer-facing tools catalog.* Verbatim criteria/gates from the criteria Google +> Doc. Fill Score/Evidence locally; **the human pastes**. 1–5 scale; anchors at 1/3/5. + +**The multi-tenancy problem (verbatim):** ServiceTitan is a multi-tenant SaaS serving tens of +thousands of business tenants. Creating one Arcade project per tenant is not a viable architecture. +The requirement is a single shared Arcade deployment where tenant isolation is enforced within it: +Tenant A's users cannot access Tenant B's tokens, tool grants, or data. Arcade's native isolation +boundary is the **project**; within a project, isolation is at the `user_id` level. + +## Scores +| # | Criterion (verbatim) | Score (1–5) | Evidence / note | +|---|---|---|---| +| 1 | Native multi-tenant isolation within a single project — Tenant A's tokens, tool grants, and policy are fully isolated from Tenant B's without separate projects. | | | +| 2 | Per-tenant tool access policies — different tenants can have different tool allowlists and Contextual Access rules. | | | +| 3 | Per-tenant quota and rate limits — one tenant's usage cannot degrade another's. | | | +| 4 | Cross-tenant token isolation — provably no path for Tenant A's token to be served on a Tenant B tool call. | | | +| 5 | New tenants can be provisioned programmatically via API — no manual steps, no UI clicks. | | | +| 6 | Gateway configuration is API-driven to support programmatic tenant onboarding at scale. | | | +| 7 | Custom servers built for internal use can be reused for the product use case without re-architecting. | | | + +**Average:** ___ **Category score:** ___ + +## Score anchors +- **1** — No multi-tenant model; one project per tenant is the only isolation path — does not scale +- **3** — user_id-level token isolation works within a project; tenant-level policy and quota require significant custom work +- **5** — Native multi-tenant model within a single deployment — per-tenant isolation, policy, quota, and API-driven onboarding all supported + +## Benchmark questions +| # | Question (verbatim) | Answer | Evidence | +|---|---|---|---| +| 1 | Does Arcade have a native multi-tenancy model within a single project, or does tenant isolation require one project per tenant? | | | +| 2 | If `tenant_id:user_id` is used as the user_id, does Arcade enforce any tenant-level policy or quota boundaries, or is it purely token isolation? | | | +| 3 | Can per-tenant tool access policies (different tool lists per tenant) be managed via API? | | | +| 4 | Can a new tenant be onboarded — token vault initialized, tool grants set, gateway access configured — entirely via API with no manual steps? | | | +| 5 | What is the recommended architecture for serving tens of thousands of tenants from a single Arcade deployment? | | | + +## Suggested pass/fail gates +| Gate | Pass condition (verbatim) | Result | Evidence | +|---|---|---|---| +| Multi-tenant isolation | Tenant A's tokens and tool grants are provably inaccessible to Tenant B within a single deployment | | | +| No per-tenant project | Tenant isolation does not require one Arcade project per tenant | | | +| API-driven onboarding | A new tenant can be fully provisioned via API with no manual steps | | | +| Per-tenant policy | Different tenants can have different tool allowlists managed programmatically | | | + +## Findings +- diff --git a/categories/cat2-delegated-authz/criteria-section-2.md b/categories/cat2-delegated-authz/criteria-section-2.md new file mode 100644 index 0000000..8653ee0 --- /dev/null +++ b/categories/cat2-delegated-authz/criteria-section-2.md @@ -0,0 +1,49 @@ +# Category 2 — Delegated Authorization and Identity (weight 20) + +> The load-bearing category: every tool call executes as the calling user, using that user's own +> credentials, and the agent code never sees the token. Verbatim criteria/gates from the criteria +> Google Doc. Fill Score/Evidence locally; **the human pastes**. 1–5 scale; anchors at 1/3/5. + +## Scores +| # | Criterion (verbatim) | Score (1–5) | Evidence / note | +|---|---|---|---| +| 1 | Per-user OAuth token vault — tokens are stored and refreshed per user, per service, per scope. | | | +| 2 | Tool calls execute as the calling user — not a shared service account or bot credential. | | | +| 3 | Okta (OIDC/SAML) integration as the primary IDP for gateway access. | | | +| 4 | Custom OAuth provider support — ability to register non-standard OAuth providers (Snowflake, Workday, TenantTalk via Okta). | | | +| 5 | Token refresh is handled automatically without requiring user re-authentication on every call. | | | +| 6 | The LLM and agent code never see raw tokens — token injection happens server-side in the Engine. | | | +| 7 | Token vault is project-scoped — no cross-project token leakage. | | | +| 8 | Admin consent — ability for an admin to pre-authorize a scope on behalf of a class of users. | | | +| 9 | Admin-initiated token revocation — an admin can invalidate all vault tokens for a specific user directly in Arcade, without touching any downstream provider. Primary use case: employee offboarding or security incident response. | | | + +**Average:** ___ **Category score:** ___ + +## Score anchors +- **1** — Shared API keys or service accounts only; no per-user identity +- **3** — Per-user OAuth works for prebuilt connectors; custom providers require undocumented manual steps; revocation requires going to each provider individually +- **5** — Full per-user vault, Okta integration, custom OAuth providers documented and working, token refresh transparent, admin-initiated revocation works from one place + +## Benchmark tests +| # | Test (verbatim) | Result | Evidence | +|---|---|---|---| +| 1 | Call a tool as User A. Verify it executes with User A's credentials by checking the downstream system's own audit log (e.g., GitHub shows the call as User A, not a service account). | | | +| 2 | Revoke User A's OAuth token in the provider. Verify the next tool call triggers a consent/re-auth flow rather than silently failing or falling back to a shared credential. | | | +| 3 | Configure a custom OAuth provider (Snowflake or Workday). Complete a full per-user token flow end-to-end: authorize → vault stores token → tool call executes as that user. | | | +| 4 | Configure TenantTalk authentication via Okta as a custom OAuth provider. Verify the Engine brokers the token correctly. | | | +| 5 | Verify token refresh: let a token expire. Confirm the next call either refreshes transparently or returns a clear re-auth prompt. | | | +| 6 | Admin-initiated revocation: as an admin, invalidate all vault tokens for User A in Arcade directly (no downstream provider action). Verify User A's next tool call fails or triggers re-auth, across all connected systems simultaneously. | | | + +## Suggested pass/fail gates +| Gate | Pass condition (verbatim) | Result | Evidence | +|---|---|---|---| +| Per-user execution | Tool calls provably execute as the calling user (verifiable in the downstream system's own logs) | | | +| No shared credentials | No service account or shared token is used in any tool call path | | | +| Okta integration | Gateway access works end-to-end through Okta OIDC/SAML | | | +| Custom OAuth | At least one custom provider (Snowflake or Workday) configured and functional | | | +| Token isolation | No user's token is accessible by, or executed as, another user | | | +| Downstream revocation | Revoking a token at the provider level triggers re-auth on the next call — no silent fallback | | | +| Admin-initiated revocation | An admin can invalidate all of a specific user's vault tokens in Arcade directly, taking effect across all connected systems without touching each provider individually | | | + +## Findings +- Note (deployment): live POC upstream IdP is **Entra ID**, not Okta yet — score criterion 3 against that gap. diff --git a/categories/cat3-access-policy/criteria-section-3.md b/categories/cat3-access-policy/criteria-section-3.md new file mode 100644 index 0000000..c117e8c --- /dev/null +++ b/categories/cat3-access-policy/criteria-section-3.md @@ -0,0 +1,44 @@ +# Category 3 — Tool-Level Access Control and Policy (weight 15) + +> Verbatim criteria/gates from the criteria Google Doc. Fill Score/Evidence locally; **the human +> pastes**. 1–5 scale; anchors at 1/3/5. + +## Scores +| # | Criterion (verbatim) | Score (1–5) | Evidence / note | +|---|---|---|---| +| 1 | Tool-level allow-list per user — a user can only call tools explicitly granted to them; the gateway enforces this, not the client. | | | +| 2 | Contextual Access rules — per-user tool visibility and invocation policy layered on top of the gateway allow-list. | | | +| 3 | Input filtering — ability to block or rewrite tool inputs based on policy before execution reaches the server. | | | +| 4 | Output redaction — ability to mask or strip sensitive fields from tool outputs before they reach the agent. | | | +| 5 | Policy is enforced at the Engine, not the client — a malicious or compromised client cannot bypass it. | | | +| 6 | All policy decisions (allow, block, redact) are logged. | | | +| 7 | Per-user tool grants can be updated without restarting the gateway or any server. | | | +| 8 | Gateway scopes map to Okta groups — access managed in Okta, not a separate system. | | | + +**Average:** ___ **Category score:** ___ + +## Score anchors +- **1** — Gateway-level tool list only; no per-user scoping or input/output policy +- **3** — Per-user grants work; Contextual Access input/output rules require significant manual work +- **5** — Full per-user policy, Contextual Access input/output rules, Okta-managed scopes, all decisions audited + +## Benchmark tests +| # | Test (verbatim) | Result | Evidence | +|---|---|---|---| +| 1 | Grant User A access to GitHub tools and User B access to Atlassian tools. Verify User A cannot invoke Atlassian tools even if they know the tool name. | | | +| 2 | Write a Contextual Access rule that blocks inputs containing a specific pattern (e.g., a mock SSN). Send a matching input — verify it is blocked before execution and logged. | | | +| 3 | Write a Contextual Access rule that redacts a field from tool outputs. Verify the field is absent from the agent's response. | | | +| 4 | Update User A's tool grants (add a new tool). Verify the change takes effect without restarting anything. | | | +| 5 | Confirm policy enforcement point: attempt to bypass Contextual Access by calling the server directly (bypassing the Engine). Confirm this is architecturally prevented or explicitly documented as a known boundary. | | | + +## Suggested pass/fail gates +| Gate | Pass condition (verbatim) | Result | Evidence | +|---|---|---|---| +| Tool isolation | Cross-user tool calls are rejected at the Engine regardless of client behavior | | | +| Input policy | Blocked inputs are rejected before execution, not after | | | +| Output policy | Redacted fields are absent from the agent's response | | | +| Audit | Every policy decision (allow/block/redact) produces a retrievable log entry | | | +| Dynamic grants | Tool grant updates take effect without service restart | | | + +## Findings +- diff --git a/categories/cat4-connectors/criteria-section-4.md b/categories/cat4-connectors/criteria-section-4.md new file mode 100644 index 0000000..facd0d3 --- /dev/null +++ b/categories/cat4-connectors/criteria-section-4.md @@ -0,0 +1,56 @@ +# Category 4 — Connector Coverage and Custom Server Development (weight 10) + +> Verbatim criteria/gates from the criteria Google Doc. Fill Score/Evidence locally; **the human +> pastes**. 1–5 scale; anchors at 1/3/5. + +## Scores +| # | Criterion (verbatim) | Score (1–5) | Evidence / note | +|---|---|---|---| +| 1 | Prebuilt catalog covers required systems (GitHub, Salesforce, Atlassian/Jira). | | | +| 2 | Python SDK (arcade-mcp) supports building custom servers with minimal boilerplate. | | | +| 3 | Tool schema is auto-derived from Python type annotations — no manual schema authoring. | | | +| 4 | Local development loop works without cloud infrastructure (stdio mode). | | | +| 5 | Custom servers can be registered as self-hosted (HTTPS endpoint) and routed by the Engine. | | | +| 6 | Custom OAuth provider registration — Engine brokers per-user tokens for custom systems. | | | +| 7 | Custom servers can be versioned and updated without gateway downtime. | | | + +**Average:** ___ **Category score:** ___ + +## Score anchors +- **1** — No SDK; custom integration requires raw HTTP server and manual schema +- **3** — SDK works for basic cases; custom OAuth is underdocumented; some systems blocked +- **5** — SDK is productive, custom OAuth providers are documented and straightforward, all six systems have a working path + +## Coverage of required systems (verbatim) +| System | Prebuilt? | Path | +|---|---|---| +| GitHub | Yes | Prebuilt (global catalog) | +| Salesforce | Yes | Prebuilt (global catalog) | +| Atlassian / Jira | Partial | Prebuilt; confirm Confluence coverage | +| HubSpot | Yes | Prebuilt (global catalog) | +| QuickBooks | No | Custom server + custom OAuth provider | +| Sage | No | Custom server + custom OAuth provider | +| Snowflake | No | Custom server + custom OAuth provider | +| Workday | No | Custom server + custom OAuth provider | +| TenantTalk | No (internal) | Custom server + Okta-backed OAuth | + +## Benchmark tests +| # | Test (verbatim) | Result | Evidence | +|---|---|---|---| +| 1 | Build a minimal custom server for one internal API using the arcade-mcp SDK. Measure time from schema to first successful local tool call. Target: under 2 hours. | | | +| 2 | Register the custom server as self-hosted (HTTPS endpoint). Verify Engine routing works and tool calls reach the server. | | | +| 3 | Configure Snowflake (or equivalent) as a custom OAuth provider. Complete per-user token flow end-to-end. | | | +| 4 | Verify tool schema is auto-derived from Python type annotations. Confirm no manual JSON schema authoring is required. | | | +| 5 | Update a custom tool's implementation. Verify the change takes effect without restarting the gateway. | | | + +## Suggested pass/fail gates +| Gate | Pass condition (verbatim) | Result | Evidence | +|---|---|---|---| +| SDK productivity | Custom server from scratch to first local tool call in under 2 hours | | | +| Self-hosted registration | HTTPS endpoint registers and routes correctly through the Engine | | | +| Custom OAuth | At least one non-standard OAuth provider configured end-to-end | | | +| Schema derivation | Tool schema is auto-derived; no manual JSON schema authoring required | | | +| All systems covered | A working path (prebuilt or custom) exists for all six required systems | | | + +## Findings +- diff --git a/categories/cat5-auditability/criteria-section-5.md b/categories/cat5-auditability/criteria-section-5.md new file mode 100644 index 0000000..a33c0b2 --- /dev/null +++ b/categories/cat5-auditability/criteria-section-5.md @@ -0,0 +1,54 @@ +# Category 5 — Auditability and Observability (weight 12) + +> Verbatim criteria/gates from the criteria Google Doc. Fill Score/Evidence locally; **the human +> pastes**. 1–5 scale; anchors at 1/3/5. + +**How tool execution logging works (verbatim, confirmed with Arcade, Jun 15):** Arcade's built-in +audit log covers administrative operations only (gateway creation, server registration, API key +management) — this is by design, not a gap. Tool execution observability is handled via +OpenTelemetry (OTEL): when deploying the Arcade image to Kubernetes, OTEL can be enabled to ship +telemetry to any observability collector (Datadog, ELK Stack, etc.). When self-hosted, no telemetry +flows back to Arcade — all data stays in ServiceTitan's infrastructure. This is the path to satisfy +InfoSec's execution audit requirement. + +**ServiceTitan reality (this deployment — see ../../LIVE-POC.md):** logs → ELK (Vector daemonset); +**metrics → Grafana/Mimir** (Grafana Agent scrapes ServiceMonitors → remote_write to Mimir). The +engine emits OTLP metrics but they are **dropped** today — `arcade-otel-collector:4318` does not +resolve (no collector deployed). Remediation = deploy a collector + bridge it into Prometheus/Mimir. + +## Scores +| # | Criterion (verbatim) | Score (1–5) | Evidence / note | +|---|---|---|---| +| 1 | OTEL enabled on the self-hosted Arcade deployment — execution telemetry ships to ServiceTitan's observability stack (Datadog or ELK). | | | +| 2 | Every tool call produces a log record with: user, tool invoked, timestamp, outcome — queryable in Datadog or ELK. | | | +| 3 | Admin audit log — all configuration changes (gateways, servers, API keys, policies) are logged in Arcade. | | | +| 4 | Per-tool and per-user usage metrics (call counts, error rates, latency) visible in the observability stack. | | | +| 5 | Trace propagation — tool call traces joinable to agent and application traces via OTEL. | | | +| 6 | No telemetry data leaves ServiceTitan's infrastructure to Arcade when self-hosted. | | | + +**Average:** ___ **Category score:** ___ + +## Score anchors +- **1** — No OTEL support; no execution telemetry available outside Arcade's dashboard +- **3** — OTEL works but configuration is manual or underdocumented; trace propagation requires custom work +- **5** — OTEL is documented and easy to enable; full execution telemetry in Datadog/ELK; trace propagation works end-to-end + +## Benchmark tests +| # | Test (verbatim) | Result | Evidence | +|---|---|---|---| +| 1 | Enable OTEL on the self-hosted Arcade Kubernetes deployment. Make a tool call. Verify a record appears in Datadog (or ELK) with: user_id, tool name, timestamp, outcome. | | | +| 2 | Make an administrative change (update a gateway). Verify the change appears in Arcade's admin audit log. | | | +| 3 | Propagate a trace ID from an agent call through to the tool execution. Verify the trace is end-to-end visible in the observability stack. | | | +| 4 | Confirm no tool execution telemetry is transmitted to Arcade's own systems when running self-hosted. | | | + +## Suggested pass/fail gates +| Gate | Pass condition (verbatim) | Result | Evidence | +|---|---|---|---| +| OTEL integration | OTEL enabled on self-hosted deployment; execution telemetry flows to Datadog or ELK | | | +| Execution audit | Every tool call produces a retrievable record with user, tool, timestamp, outcome in ServiceTitan's observability stack | | | +| Admin audit | All Arcade configuration changes are logged in the admin audit log | | | +| Data residency | No tool execution telemetry transmitted to Arcade when self-hosted — confirmed | | | +| InfoSec sign-off | Dane Snyder confirms the OTEL-based execution audit satisfies the access audit requirement | | | + +## Findings +- diff --git a/categories/cat6-security/criteria-section-6.md b/categories/cat6-security/criteria-section-6.md new file mode 100644 index 0000000..617f68f --- /dev/null +++ b/categories/cat6-security/criteria-section-6.md @@ -0,0 +1,50 @@ +# Category 6 — Security and Compliance (weight 10) + +> Verbatim criteria/gates from the criteria Google Doc. Fill Score/Evidence locally; **the human +> pastes**. 1–5 scale; anchors at 1/3/5. + +## Scores +| # | Criterion (verbatim) | Score (1–5) | Evidence / note | +|---|---|---|---| +| 1 | PII masking or redaction at the gateway layer — without changes to tool code. | | | +| 2 | Input blocking — Contextual Access policy can block tool calls based on content. | | | +| 3 | MCPs can be scaled to less than human access. | | | +| 4 | Output redaction — sensitive fields removed from responses before reaching the agent. | | | +| 5 | Data processing agreement (DPA) and sub-processor disclosure in place. | | | +| 6 | SOC 2 / ISO 27001 certification (or equivalent) confirmed. | | | +| 7 | Data boundary acceptable to InfoSec — tool call payloads route through Arcade's Engine; execution stays in ServiceTitan's infrastructure. | | | +| 8 | Raw OAuth tokens are never exposed to the LLM, agent code, or logs. | | | +| 9 | Secrets management integration (Azure Key Vault or equivalent) for API key storage. | | | +| 10 | Potential for log forwarding for telemetry, alerting | | | +| 11 | Potential integration for DLP tooling if possible | | | +| 12 | Data boundary guardrails (able to block querying all records from a table) | | | + +**Average:** ___ **Category score:** ___ + +## Score anchors +- **1** — No policy enforcement; payloads flow unmodified; DPA and certifications unconfirmed +- **3** — Some policy controls exist; DPA in progress; compliance posture requires follow-up +- **5** — Full policy enforcement, DPA executed, compliant data boundary, tokens never exposed + +## Benchmark tests +| # | Test (verbatim) | Result | Evidence | +|---|---|---|---| +| 1 | Send a tool input containing a mock SSN. Verify it is redacted before reaching the tool function via a Contextual Access rule. | | | +| 2 | Send a tool output containing a mock API key string. Verify it is redacted before reaching the agent. | | | +| 3 | Attempt a tool call with an expired or revoked credential. Verify rejection with a clean error — no fallback to a shared credential. | | | +| 4 | Attempt to call a tool that has been restricted by the MCP gateway that the person usually can perform | | | +| 5 | Attempt to pull all records from an MCP integration, instead of focused data | | | +| 6 | Review the DPA and sub-processor list against ServiceTitan's data governance requirements. | | | +| 7 | Confirm in the Engine architecture that raw tokens never appear in logs, traces, or agent responses. | | | + +## Suggested pass/fail gates +| Gate | Pass condition (verbatim) | Result | Evidence | +|---|---|---|---| +| Data boundary | Tool call payloads through Arcade Engine + execution in ServiceTitan infrastructure — acceptable to InfoSec | | | +| No token exposure | Raw OAuth tokens are never visible in logs, traces, or agent responses | | | +| DPA | Data processing agreement is executed before the pilot ends | | | +| PII policy | At least one PII redaction rule works end-to-end | | | +| Compliance | SOC 2 or equivalent certification confirmed | | | + +## Findings +- diff --git a/categories/cat7-performance/criteria-section-7.md b/categories/cat7-performance/criteria-section-7.md new file mode 100644 index 0000000..2ffbd34 --- /dev/null +++ b/categories/cat7-performance/criteria-section-7.md @@ -0,0 +1,42 @@ +# Category 7 — Performance and Availability (weight 8) + +> Because every gateway-mediated tool call routes through the Arcade Engine — even when the custom +> server is self-hosted — Engine latency and availability are a floor on the entire agent stack. +> Verbatim criteria/gates from the criteria Google Doc. Fill Score/Evidence locally; **the human +> pastes**. 1–5 scale; anchors at 1/3/5. + +## Scores +| # | Criterion (verbatim) | Score (1–5) | Evidence / note | +|---|---|---|---| +| 1 | Engine-added latency per tool call is within acceptable bounds for interactive agent use. | | | +| 2 | Engine SLA — defined uptime guarantees with incident response process. | | | +| 3 | Failure behavior when Engine is unavailable: fail-closed with a clean, catchable error. | | | +| 4 | Self-hosted server HA — multi-replica, pod failure handling, no dropped calls on restart. | | | +| 5 | Multi-region failover design — documented and validated. | | | +| 6 | Engine geographic placement and round-trip latency from ServiceTitan's primary region. | | | + +**Average:** ___ **Category score:** ___ + +## Score anchors +- **1** — Engine SLA undocumented; failure behavior is a hang or silent failure; no HA guidance +- **3** — SLA documented; HA works with manual configuration; failure behavior is known but requires client-side handling +- **5** — SLA with incident response in writing; HA is the documented default; failure behavior is clean and observable + +## Benchmark tests +| # | Test (verbatim) | Result | Evidence | +|---|---|---|---| +| 1 | Make 100 tool calls through the Engine to a self-hosted server. Measure P50, P95, P99 round-trip latency. Compare against a direct server call (bypassing the Engine) to isolate Engine-added overhead. | | | +| 2 | Simulate Engine unavailability (block the Engine endpoint). Confirm tool calls fail with a clean, catchable error — not a hang or silent failure. | | | +| 3 | Deploy the custom server with multiple replicas. Kill one pod. Confirm tool calls continue without dropped requests. | | | +| 4 | Confirm Engine SLA documentation: uptime percentage, response time commitment, and P0 escalation path. | | | + +## Suggested pass/fail gates +| Gate | Pass condition (verbatim) | Result | Evidence | +|---|---|---|---| +| Engine overhead | P95 Engine-added latency is under 500ms for standard (non-streaming) tool calls | | | +| SLA documented | Engine uptime SLA and incident response process confirmed in writing | | | +| HA | Self-hosted server survives pod failure; no tool calls dropped during pod restart | | | +| Fail behavior | Engine outage produces a clean, catchable error to the agent — no hangs | | | + +## Findings +- diff --git a/categories/cat8-deployment-ops/criteria-section-8.md b/categories/cat8-deployment-ops/criteria-section-8.md new file mode 100644 index 0000000..b6a5da3 --- /dev/null +++ b/categories/cat8-deployment-ops/criteria-section-8.md @@ -0,0 +1,41 @@ +# Category 8 — Deployment and Operations (weight 7) + +> Verbatim criteria/gates from the criteria Google Doc. Fill Score/Evidence locally; **the human +> pastes**. 1–5 scale; anchors at 1/3/5. + +## Scores +| # | Criterion (verbatim) | Score (1–5) | Evidence / note | +|---|---|---|---| +| 1 | Helm chart available and documented for self-hosted server deployment in Kubernetes. | | | +| 2 | Zero-downtime configuration updates — gateway and policy changes do not interrupt in-flight calls. | | | +| 3 | GitOps-compatible — gateway, server, and policy configuration is expressible as code. | | | +| 4 | Upgrade and rollback process is documented and tested. | | | +| 5 | Runbooks for common failure scenarios. | | | +| 6 | Vendor support model during the pilot: dedicated solutions engineer, response SLA. | | | +| 7 | P0/P1 escalation path after the pilot, in production. | | | + +**Average:** ___ **Category score:** ___ + +## Score anchors +- **1** — No Helm chart; manual deployment only; no dedicated support +- **3** — Helm chart works with gaps; zero-downtime config updates unverified; support exists but is not dedicated +- **5** — Helm-native, GitOps-compatible, zero-downtime config, dedicated SE, documented escalation path + +## Benchmark tests +| # | Test (verbatim) | Result | Evidence | +|---|---|---|---| +| 1 | Deploy a self-hosted custom server to a Kubernetes namespace via Helm chart. Measure time from clean namespace to first successful tool call. Target: under 1 day. | | | +| 2 | Update a gateway configuration (add a tool). Verify in-flight calls are not dropped. | | | +| 3 | Simulate a configuration rollback. Verify the rollback completes cleanly and the prior configuration is restored. | | | +| 4 | Stage a K8s namespace and confirm the Helm deployment matches the architecture recommended for production. | | | + +## Suggested pass/fail gates +| Gate | Pass condition (verbatim) | Result | Evidence | +|---|---|---|---| +| Helm deployment | Full Kubernetes deployment via Helm chart in under 1 day | | | +| Config safety | Gateway configuration changes are zero-downtime | | | +| Rollback | Prior configuration can be restored cleanly | | | +| Dedicated SE | A dedicated solutions engineer is available and responsive during the pilot | | | + +## Findings +- Note: a live deployment already exists (`k8s-backstage-v2/apps/arcade`, chart 1.8.8, Flux/GitOps) — a head start for this category's evidence. diff --git a/categories/cat9-devex/criteria-section-9.md b/categories/cat9-devex/criteria-section-9.md new file mode 100644 index 0000000..6e30dff --- /dev/null +++ b/categories/cat9-devex/criteria-section-9.md @@ -0,0 +1,43 @@ +# Category 9 — Developer Experience (weight 5) + +> Verbatim criteria/gates from the criteria Google Doc. Fill Score/Evidence locally; **the human +> pastes**. 1–5 scale; anchors at 1/3/5. + +## Scores +| # | Criterion (verbatim) | Score (1–5) | Evidence / note | +|---|---|---|---| +| 1 | Local development loop is productive — stdio mode enables tool development without cloud infrastructure (Stage 1: code runs locally, MCP client spawns the server directly). | | | +| 2 | Tunnel-based development loop is supported — a developer can expose their locally running MCP server through a tunnel (Cloudflare, ngrok) and register it against a shared dev Arcade instance to exercise the full request chain (gateway → Engine → tunnel → local server) without deploying to Kubernetes. This is the primary development pattern for custom server authors. | | | +| 3 | A shared dev Arcade instance is available for ServiceTitan developers to register tunnel endpoints against — no need to provision a personal Arcade org for every developer. | | | +| 4 | Cloudflare tunnel (or equivalent) is the standardized proxy mechanism — documented, with a permanent named-tunnel option so the registered server URL does not change on every session restart. | | | +| 5 | SDK documentation is complete, accurate, and has working examples. | | | +| 6 | Error messages are actionable — auth failures, misconfigurations, and policy blocks identify the root cause. | | | +| 7 | MCP client integration requires no custom adapters or wrappers. | | | +| 8 | Gateway and server management are automatable via API. | | | + +**Average:** ___ **Category score:** ___ + +## Score anchors +- **1** — No tunnel support; local development requires full Kubernetes deployment to test the gateway chain +- **3** — Tunnel registration works but is underdocumented; no shared dev instance; engineers figure it out individually +- **5** — Tunnel loop is documented with a standard Cloudflare recipe; shared dev instance available; engineers are productive without platform hand-holding + +## Benchmark tests +| # | Test (verbatim) | Result | Evidence | +|---|---|---|---| +| 1 | **Stage 1 — local stdio:** Time an engineer from SDK install to first successful local tool call (stdio mode, no Arcade infrastructure). Target: under 2 hours. | | | +| 2 | **Stage 2 — tunnel registration (the key test):** Developer runs a local MCP server in HTTP mode, opens a Cloudflare tunnel, registers the tunnel URL as a self-hosted server in the dev Arcade instance, and makes a tool call that flows: Claude Code → gateway → Engine → Cloudflare tunnel → local server. Verify the full chain works including auth and Contextual Access. Measure time from working local server to first successful gateway-mediated call. Target: under 1 day. | | | +| 3 | Verify Cloudflare named tunnel (permanent hostname) — confirm the registered URL survives session restarts without re-editing the server registration. | | | +| 4 | Intentionally misconfigure an OAuth provider. Measure how quickly the error message identifies the root cause. | | | +| 5 | Integrate with Claude Code from scratch — time from gateway URL to working tool invocation in Claude Code. Target: under 30 minutes. | | | + +## Suggested pass/fail gates +| Gate | Pass condition (verbatim) | Result | Evidence | +|---|---|---|---| +| Tunnel loop | Full gateway → Engine → Cloudflare tunnel → local server chain works end-to-end | | | +| Permanent tunnel URL | Named tunnel hostname persists across session restarts without re-registration | | | +| Shared dev instance | ServiceTitan developers can register local servers against a shared dev Arcade org without individual account provisioning | | | +| Time to first call | Engineer reaches a working gateway-mediated tool call in under 1 day from scratch | | | + +## Findings +-