Files
arcade-eval/docs/superpowers/specs/2026-06-22-deploy-mcp-to-k8s-design.md
T
ztaylor e78795bf4f docs: update deploy design for public-ingress pivot + publicOnlyTransport finding
Records that the in-cluster Service DNS could not be used for a dashboard-registered
worker (engine publicOnlyTransport SSRF guard blocks internal addresses), the pivot to
st-app chart + public ingress at arcade-eval-ref.st.dev (CNAME -> k8s-backstage.st.dev),
and the verified end-to-end whoami result.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-22 12:44:55 -04:00

4.6 KiB

Deploy arcade-eval reference MCP server to backstage k8s

Date: 2026-06-22 Status: DONE — deployed and verified end-to-end.

Goal

Replace the ephemeral cloudflared quick tunnel (used to register the arcade-eval-ref server with the self-hosted Arcade engine) with a permanent deployment on backstage-wus2-v4, so the engine reaches the server over a stable URL instead of a trycloudflare.com URL that dies on restart.

Relevant eval categories: cat-4 (custom server dev), cat-8 (deployment), cat-9 (DX).

Key finding that shaped the final design

The first attempt registered the in-cluster Service DNS (http://arcade-eval-ref.arcade-eval-ref.svc.cluster.local:8000) as a dashboard worker. Health went green but 0 tools loaded. Engine logs showed:

Failed to get worker tools: Get ".../worker/tools":
  dial tcp 10.0.192.27:8000: publicOnlyTransport: blocked connection to internal address

The Arcade engine has an SSRF guard (publicOnlyTransport) that blocks dashboard-registered worker URIs resolving to internal/private (RFC1918) addresses. Only workers declared in the engine config file (e.g. the bundled arcade-worker-main at http://arcade-worker-main:8001) may use internal URIs. Health checks aren't guarded (hence green), but the authenticated /worker/tools discovery is. The cloudflared tunnel worked only because it was a public URL.

⇒ A dashboard-registered in-cluster worker must be exposed on a public URL. (The worker secret was a red herring — the connection is refused before auth.)

Architecture / data flow (final)

Claude Code ──▶ gateway zeb-gateway-test ──▶ Arcade engine ──HTTPS /worker/*──▶
   https://arcade-eval-ref.st.dev  (Cloudflare CNAME → k8s-backstage.st.dev → nginx ingress)
      └─▶ Service → Deployment: python:3.12 running mcp_server.server over HTTP :8000
          (echo / add / whoami).  /mcp also served; /worker/* auth = ARCADE_WORKER_SECRET.

Runtime facts (verified by introspecting arcade-mcp-server 1.17)

  • app.run() honors env overrides via _get_configuration_overrides(): ARCADE_SERVER_TRANSPORT=http, ARCADE_SERVER_HOST=0.0.0.0, ARCADE_SERVER_PORT=8000 — so the hardcoded 127.0.0.1 in server.py is overridden at runtime (no code change).
  • ARCADE_WORKER_SECRET enables worker routes at /worker/*; the engine authenticates with an HS256 JWT (aud=worker, ver=1) signed with that secret. MCP is served at /mcp.

Components (three repos)

1. arcade-eval — image

  • lib/mcp_server/Dockerfilepython:3.12-slim, pip install ., HTTP transport via env, non-root, port 8000.
  • .github/workflows/build-push-acr.yml — pushes servicetitandev.azurecr.io/arcade-eval-ref:1.0.<run_number> (secrets ACR_DEV_USERNAME/ACR_DEV_PASSWORD). Adapted from servicetitan/mem0.

2. k8s-backstage-v2apps/mcp/arcade-eval-ref/

  • namespace.yaml — ns arcade-eval-ref.
  • server.yamlst-app HelmRelease (chart 2.0.72): image pinned to 1.0.1, service.internalPort: 8000, ingress.enabled host arcade-eval-ref.st.dev class nginx, oAuth.enabled: false (no SSO wall over /worker/* or /mcp), worker secret via envFrom from the SealedSecret, probes off. TLS = ingress default *.st.dev wildcard cert.
  • sealedsecret.yamlarcade-eval-ref-worker-secret (key ARCADE_WORKER_SECRET), strict scope, sealed with the backstage-wus2-v4 sealed-secrets cert.

3. iac-terraform-workspaces — DNS

  • CNAME arcade-eval-ref.st.devk8s-backstage.st.dev (st.dev zone), mirroring the anvil/alerts pattern.

Registration (dashboard)

Add/repoint the worker: URI https://arcade-eval-ref.st.dev, Secret = the worker-secret plaintext (git-ignored at results/arcade-eval-ref-worker-secret.txt). The engine then fetches /worker/tools over the public URL → tools load → add to zeb-gateway-test.

Verified

  • https://arcade-eval-ref.st.dev/worker/health → 200 (valid *.st.dev LE cert); /worker/tools with a correct worker JWT → 200, tools Echo/Add/Whoami.
  • Through the gateway: ArcadeEvalRef_Whoami() → the caller's Entra sub (GvgRofe5…), proving per-user execution across the full client → gateway → engine → public URL → in-cluster pod chain.

Alternative considered (not taken)

Declare the server as a static worker in the engine config (tools.directors[].workers, like arcade-worker-main) — that path allows internal URIs and avoids public exposure, but edits the vendor Helm release (apps/arcade) and loses the dashboard per-project workflow. Public ingress was chosen as the lower-touch option.