3.1 KiB
arcade-eval
Evaluation workspace for Arcade.dev as a self-hosted, governed MCP gateway for ServiceTitan — measured against the internal MCP Gateway Benchmark Criteria (10 weighted categories, hard gates). Multiple lanes (one per category) run in parallel; this repo is the shared, tool-agnostic source of truth.
The question: can Arcade let AI agents act as the calling user (no shared credentials, auditable, per-user tool scoping) with operational characteristics we can run in production?
Start here (any tool — Claude Code, Cursor, a human)
git pull- Read
STATUS.md→LIVE-POC.md→GROUND-RULES.md(in that order). - Run the live-state check (see GROUND-RULES) before trusting the live instance.
- Go to your
categories/catN-*/and work only inside it.
Per-tool entry pointers (all say the same thing, no duplicated content):
- Claude Code: the in-repo skill
arcade-gateway-eval(auto-discovered here). - Cursor:
.cursor/rules/arcade-eval.mdc(auto-attaches) +.cursor/mcp.json.example. - Any agent tool:
AGENTS.md.
The request chain (what we're testing)
MCP client → Gateway (curated tool list) → Engine (auth/vault/policy/audit) → Server → External API
Live endpoints: gateway https://api.arcade.st.dev/mcp/{slug}, dashboard
https://dashboard.arcade.st.dev. See LIVE-POC.md for the full deployment snapshot.
How lanes work (parallel-session safety)
- Each category is a lane owning
categories/catN-*/+ its ownSTATUS.mdsection. - Shared files (
config/targets.yaml,lib/, top-level docs) are append-mostly;git pull --rebasebefore every push. See the ownership table inGROUND-RULES.md. - The harness
lib/is plain Python (uv) — tool-agnostic.
Starting a new category lane
git clone … && cd arcade-eval;cp config/.env.example .env, fill in creds.- Invoke the bootstrap skill / read the Start-here docs above; run the live-state check.
- Open your
categories/catN-*/—criteria-section-N.mdis already pre-seeded with your verbatim criteria/gates/anchors. Copycategories/_TEMPLATE/'sNOTES.md+tests/in. - Claim your section in
STATUS.md. Work only inside your subtree;git pull --rebasebefore push.
Categories → reviewer clusters (from the criteria doc)
| Cluster | Question | Categories | Reviewers |
|---|---|---|---|
| Platform | Does it work and stay up? | 1 Functional · 7 Performance · 8 Deployment/Ops | Nawaz / SRE |
| Security | Can we control and see it? | 2 Delegated authz · 3 Access policy · 5 Auditability · 6 Security | Dane / Chandu |
| Adopt/Operate | Can we adopt and operate it? | 4 Connectors · 9 Developer experience · 10 Product fit | Paul / Chandu |
ztaylor owns categories 1, 5, 9 (one per cluster).
Weights
1=8 · 2=20 · 3=15 · 4=10 · 5=12 · 6=10 · 7=8 · 8=7 · 9=5 · 10=5 (total 100)
Layout
config/ targets.yaml · .env.example
lib/ mcp_client.py · mcp_server/ (shared reference server) · helpers
categories/ _TEMPLATE/ + cat1..cat10 (each: criteria-section-N.md [+ tests/ NOTES.md when active])
results/ git-ignored run artifacts