T

Zeb Taylor 9009237a14 deploy: containerize arcade-eval-ref MCP server + ACR build/push action (#4 )

Replace the cloudflared quick-tunnel dev pattern with a permanent in-cluster
deployment so the self-hosted Arcade engine reaches the echo/add/whoami reference
server over stable cluster DNS.

- lib/mcp_server/Dockerfile: python:3.12-slim, pip install ., HTTP transport via
  ARCADE_SERVER_{TRANSPORT,HOST,PORT} env overrides (no server.py change needed),
  non-root user, port 8000.
- .github/workflows/build-push-acr.yml: build + push
  servicetitandev.azurecr.io/arcade-eval-ref:1.0.<run_number>. Adapted from
  servicetitan/mem0; needs repo secrets ACR_DEV_USERNAME / ACR_DEV_PASSWORD.
- docs/superpowers/specs design record.

K8s manifests live in k8s-backstage-v2 apps/mcp/arcade-eval-ref/ (separate branch).

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-06-22 11:29:23 -04:00

.claude/skills/arcade-gateway-eval

feat: in-repo arcade-gateway-eval bootstrap skill

2026-06-18 10:07:47 -04:00

.cursor

docs: record confirmed headless auth headers (Authorization + Arcade-User-ID)

2026-06-18 10:13:15 -04:00

.github/workflows

deploy: containerize arcade-eval-ref MCP server + ACR build/push action (#4 )

2026-06-22 11:29:23 -04:00

.vscode

cat1: FINALIZE scorecard (draft 4/5); STATUS + cat-5 NOTES ready for fresh-session handoff

2026-06-22 09:55:01 -04:00

README.md

arcade-eval

Evaluation workspace for Arcade.dev as a self-hosted, governed MCP gateway for ServiceTitan — measured against the internal MCP Gateway Benchmark Criteria (10 weighted categories, hard gates). Multiple lanes (one per category) run in parallel; this repo is the shared, tool-agnostic source of truth.

The question: can Arcade let AI agents act as the calling user (no shared credentials, auditable, per-user tool scoping) with operational characteristics we can run in production?

Start here (any tool — Claude Code, Cursor, a human)

git pull
Read STATUS.md → LIVE-POC.md → GROUND-RULES.md (in that order).
Run the live-state check (see GROUND-RULES) before trusting the live instance.
Go to your categories/catN-*/ and work only inside it.

Per-tool entry pointers (all say the same thing, no duplicated content):

Claude Code: the in-repo skill arcade-gateway-eval (auto-discovered here).
Cursor: .cursor/rules/arcade-eval.mdc (auto-attaches) + .cursor/mcp.json.example.
Any agent tool: AGENTS.md.

The request chain (what we're testing)

MCP client → Gateway (curated tool list) → Engine (auth/vault/policy/audit) → Server → External API

Live endpoints: gateway https://api.arcade.st.dev/mcp/{slug}, dashboard https://dashboard.arcade.st.dev. See LIVE-POC.md for the full deployment snapshot.

How lanes work (parallel-session safety)

Each category is a lane owning categories/catN-*/ + its own STATUS.md section.
Shared files (config/targets.yaml, lib/, top-level docs) are append-mostly; git pull --rebase before every push. See the ownership table in GROUND-RULES.md.
The harness lib/ is plain Python (uv) — tool-agnostic.

Starting a new category lane

git clone … && cd arcade-eval; cp config/.env.example .env, fill in creds.
Invoke the bootstrap skill / read the Start-here docs above; run the live-state check.
Open your categories/catN-*/ — criteria-section-N.md is already pre-seeded with your verbatim criteria/gates/anchors. Copy categories/_TEMPLATE/'s NOTES.md + tests/ in.
Claim your section in STATUS.md. Work only inside your subtree; git pull --rebase before push.

Categories → reviewer clusters (from the criteria doc)

Cluster	Question	Categories	Reviewers
Platform	Does it work and stay up?	1 Functional · 7 Performance · 8 Deployment/Ops	Nawaz / SRE
Security	Can we control and see it?	2 Delegated authz · 3 Access policy · 5 Auditability · 6 Security	Dane / Chandu
Adopt/Operate	Can we adopt and operate it?	4 Connectors · 9 Developer experience · 10 Product fit	Paul / Chandu

ztaylor owns categories 1, 5, 9 (one per cluster).

Weights

1=8 · 2=20 · 3=15 · 4=10 · 5=12 · 6=10 · 7=8 · 8=7 · 9=5 · 10=5 (total 100)

Layout

config/      targets.yaml · .env.example
lib/         mcp_client.py · mcp_server/ (shared reference server) · helpers
categories/  _TEMPLATE/ + cat1..cat10 (each: criteria-section-N.md [+ tests/ NOTES.md when active])
results/     git-ignored run artifacts