# arcade-eval Evaluation workspace for **Arcade.dev** as a self-hosted, governed **MCP gateway** for ServiceTitan — measured against the internal *MCP Gateway Benchmark Criteria* (10 weighted categories, hard gates). Multiple lanes (one per category) run **in parallel**; this repo is the shared, tool-agnostic source of truth. The question: can Arcade let AI agents act **as the calling user** (no shared credentials, auditable, per-user tool scoping) with operational characteristics we can run in production? ## Start here (any tool — Claude Code, Cursor, a human) 1. `git pull` 2. Read **`STATUS.md`** → **`LIVE-POC.md`** → **`GROUND-RULES.md`** (in that order). 3. Run the **live-state check** (see GROUND-RULES) before trusting the live instance. 4. Go to your `categories/catN-*/` and work only inside it. Per-tool entry pointers (all say the same thing, no duplicated content): - **Claude Code:** the in-repo skill `arcade-gateway-eval` (auto-discovered here). - **Cursor:** `.cursor/rules/arcade-eval.mdc` (auto-attaches) + `.cursor/mcp.json.example`. - **Any agent tool:** `AGENTS.md`. ## The request chain (what we're testing) ``` MCP client → Gateway (curated tool list) → Engine (auth/vault/policy/audit) → Server → External API ``` Live endpoints: gateway `https://api.arcade.st.dev/mcp/{slug}`, dashboard `https://dashboard.arcade.st.dev`. See `LIVE-POC.md` for the full deployment snapshot. ## How lanes work (parallel-session safety) - Each category is a **lane** owning `categories/catN-*/` + its own `STATUS.md` section. - Shared files (`config/targets.yaml`, `lib/`, top-level docs) are append-mostly; `git pull --rebase` before every push. See the ownership table in `GROUND-RULES.md`. - The harness `lib/` is plain Python (`uv`) — tool-agnostic. ## Starting a new category lane 1. `git clone … && cd arcade-eval`; `cp config/.env.example .env`, fill in creds. 2. Invoke the bootstrap skill / read the Start-here docs above; run the live-state check. 3. Open your `categories/catN-*/` — `criteria-section-N.md` is **already pre-seeded** with your verbatim criteria/gates/anchors. Copy `categories/_TEMPLATE/`'s `NOTES.md` + `tests/` in. 4. Claim your section in `STATUS.md`. Work only inside your subtree; `git pull --rebase` before push. ## Categories → reviewer clusters (from the criteria doc) | Cluster | Question | Categories | Reviewers | |---|---|---|---| | Platform | Does it work and stay up? | 1 Functional · 7 Performance · 8 Deployment/Ops | Nawaz / SRE | | Security | Can we control and see it? | 2 Delegated authz · 3 Access policy · 5 Auditability · 6 Security | Dane / Chandu | | Adopt/Operate | Can we adopt and operate it? | 4 Connectors · 9 Developer experience · 10 Product fit | Paul / Chandu | **ztaylor owns categories 1, 5, 9** (one per cluster). ## Weights 1=8 · 2=20 · 3=15 · 4=10 · 5=12 · 6=10 · 7=8 · 8=7 · 9=5 · 10=5 (total 100) ## Layout ``` config/ targets.yaml · .env.example lib/ mcp_client.py · mcp_server/ (shared reference server) · helpers categories/ _TEMPLATE/ + cat1..cat10 (each: criteria-section-N.md [+ tests/ NOTES.md when active]) results/ git-ignored run artifacts ```