docs: update deploy design for public-ingress pivot + publicOnlyTransport finding

Records that the in-cluster Service DNS could not be used for a dashboard-registered worker (engine publicOnlyTransport SSRF guard blocks internal addresses), the pivot to st-app chart + public ingress at arcade-eval-ref.st.dev (CNAME -> k8s-backstage.st.dev), and the verified end-to-end whoami result. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-22 12:44:55 -04:00
parent 9009237a14
commit e78795bf4f
1 changed files with 67 additions and 45 deletions
@@ -1,73 +1,95 @@
 # Deploy arcade-eval reference MCP server to backstage k8s

 **Date:** 2026-06-22
-**Status:** Approved — implementing
+**Status:** DONE — deployed and verified end-to-end.

 ## Goal

 Replace the ephemeral cloudflared **quick tunnel** (used to register the
 `arcade-eval-ref` server with the self-hosted Arcade engine) with a permanent
-in-cluster deployment on `backstage-wus2-v4`. The engine then reaches the server
-over stable cluster DNS instead of a `trycloudflare.com` URL that dies on restart.
+deployment on `backstage-wus2-v4`, so the engine reaches the server over a stable
+URL instead of a `trycloudflare.com` URL that dies on restart.

 Relevant eval categories: cat-4 (custom server dev), cat-8 (deployment), cat-9 (DX).

-## Architecture / data flow
+## Key finding that shaped the final design
+
+The first attempt registered the in-cluster **Service DNS**
+(`http://arcade-eval-ref.arcade-eval-ref.svc.cluster.local:8000`) as a dashboard
+worker. Health went green but **0 tools loaded**. Engine logs showed:

 ```
-Arcade engine (ns: arcade)  ──HTTP /worker/*──▶  Service arcade-eval-ref (ns: arcade-eval-ref)
-   registered as type "Arcade"                       └─▶ Deployment: python:3.12 running
-   URI = http://arcade-eval-ref.arcade-eval-ref            mcp_server.server over HTTP :8000
-        .svc.cluster.local:8000                           (echo / add / whoami)
-   Secret = ARCADE_WORKER_SECRET  ◀── same value ──▶  env ARCADE_WORKER_SECRET (SealedSecret)
+Failed to get worker tools: Get ".../worker/tools":
+  dial tcp 10.0.192.27:8000: publicOnlyTransport: blocked connection to internal address
+```
+
+**The Arcade engine has an SSRF guard (`publicOnlyTransport`) that blocks
+dashboard-registered worker URIs resolving to internal/private (RFC1918) addresses.**
+Only workers declared in the **engine config file** (e.g. the bundled `arcade-worker-main`
+at `http://arcade-worker-main:8001`) may use internal URIs. Health checks aren't guarded
+(hence green), but the authenticated `/worker/tools` discovery is. The cloudflared tunnel
+worked only because it was a *public* URL.
+
+⇒ A dashboard-registered in-cluster worker **must be exposed on a public URL**. (The
+worker secret was a red herring — the connection is refused before auth.)
+
+## Architecture / data flow (final)
+
+```
+Claude Code ──▶ gateway zeb-gateway-test ──▶ Arcade engine ──HTTPS /worker/*──▶
+   https://arcade-eval-ref.st.dev  (Cloudflare CNAME → k8s-backstage.st.dev → nginx ingress)
+      └─▶ Service → Deployment: python:3.12 running mcp_server.server over HTTP :8000
+          (echo / add / whoami).  /mcp also served; /worker/* auth = ARCADE_WORKER_SECRET.
 ```

 ### Runtime facts (verified by introspecting `arcade-mcp-server` 1.17)

 - `app.run()` honors env overrides via `_get_configuration_overrides()`:
-  `ARCADE_SERVER_TRANSPORT=http`, `ARCADE_SERVER_HOST=0.0.0.0`, `ARCADE_SERVER_PORT=8000`.
-  So the hardcoded `127.0.0.1` in `server.py`'s `__main__` is overridden at runtime —
-  **no `server.py` change needed.**
- `ARCADE_WORKER_SECRET` (settings alias `arcade.server_secret`) → worker routes mount at
-  `/worker/*` (what the engine calls); MCP also served at `/mcp`. FastAPI app, port 8000.
+  `ARCADE_SERVER_TRANSPORT=http`, `ARCADE_SERVER_HOST=0.0.0.0`, `ARCADE_SERVER_PORT=8000`
+  — so the hardcoded `127.0.0.1` in `server.py` is overridden at runtime (no code change).
+- `ARCADE_WORKER_SECRET` enables worker routes at `/worker/*`; the engine authenticates with
+  an HS256 JWT (`aud=worker`, `ver=1`) signed with that secret. MCP is served at `/mcp`.

-## Components
+## Components (three repos)

-### 1. `arcade-eval` repo (branch off `main`)
+### 1. `arcade-eval` — image
+- `lib/mcp_server/Dockerfile` — `python:3.12-slim`, `pip install .`, HTTP transport via env,
+  non-root, port 8000.
+- `.github/workflows/build-push-acr.yml` — pushes
+  `servicetitandev.azurecr.io/arcade-eval-ref:1.0.<run_number>` (secrets
+  `ACR_DEV_USERNAME`/`ACR_DEV_PASSWORD`). Adapted from `servicetitan/mem0`.

- **`lib/mcp_server/Dockerfile`** — `python:3.12-slim`, `pip install .` (pulls
-  `arcade-mcp-server` + `httpx`), `ENV` transport/host/port, non-root user, `EXPOSE 8000`,
-  `CMD ["python","-m","mcp_server.server"]`.
- **`.github/workflows/build-push-acr.yml`** — adapted from `servicetitan/mem0`. Pushes
-  `servicetitandev.azurecr.io/arcade-eval-ref:1.0.<run_number>`. Login via repo secrets
-  `ACR_DEV_USERNAME` / `ACR_DEV_PASSWORD`. Triggers: `workflow_dispatch` + push to `main`
-  filtered to `lib/mcp_server/**`.
+### 2. `k8s-backstage-v2` — `apps/mcp/arcade-eval-ref/`
+- `namespace.yaml` — ns `arcade-eval-ref`.
+- `server.yaml` — **st-app HelmRelease** (chart 2.0.72): `image` pinned to `1.0.1`,
+  `service.internalPort: 8000`, **`ingress.enabled` host `arcade-eval-ref.st.dev`
+  class `nginx`, `oAuth.enabled: false`** (no SSO wall over `/worker/*` or `/mcp`),
+  worker secret via `envFrom` from the SealedSecret, probes off. TLS = ingress default
+  `*.st.dev` wildcard cert.
+- `sealedsecret.yaml` — `arcade-eval-ref-worker-secret` (key `ARCADE_WORKER_SECRET`),
+  strict scope, sealed with the backstage-wus2-v4 sealed-secrets cert.

-### 2. `k8s-backstage-v2` repo (branch off `master`)
+### 3. `iac-terraform-workspaces` — DNS
+- CNAME `arcade-eval-ref.st.dev` → `k8s-backstage.st.dev` (st.dev zone), mirroring the
+  `anvil`/`alerts` pattern.

-New dir **`apps/mcp/arcade-eval-ref/`** (Flux's `apps` Kustomization recursively applies
-everything under `apps/`; no per-dir `kustomization.yaml`):
+## Registration (dashboard)

- **`namespace.yaml`** — ns `arcade-eval-ref` (labels per repo convention, `team: infra`).
- **`server.yaml`** — plain `Deployment` (image
-  `servicetitandev.azurecr.io/arcade-eval-ref:1.0.1`; no imagePullSecret — the cluster has
-  native ACR pull, confirmed by other `apps/mcp/*` servers; `ARCADE_WORKER_SECRET` from
-  secretRef; TCP probes; modest resources) + `Service` (ClusterIP, 8000→8000).
- **`sealedsecret.yaml`** — `arcade-eval-ref-worker-secret`, key `ARCADE_WORKER_SECRET`,
-  **strict** scope, sealed offline with `kubeseal --cert <backstage-wus2-v4 public cert>`.
+Add/repoint the worker: URI `https://arcade-eval-ref.st.dev`, Secret = the worker-secret
+plaintext (git-ignored at `results/arcade-eval-ref-worker-secret.txt`). The engine then
+fetches `/worker/tools` over the public URL → tools load → add to `zeb-gateway-test`.

-## Manual steps after merge
+## Verified

-1. Add `ACR_DEV_USERNAME` / `ACR_DEV_PASSWORD` repo secrets to `arcade-eval`.
-2. `workflow_dispatch` (or merge to `main`) to build/push the image — first run = tag `1.0.1`.
-3. Merge the k8s branch; Flux applies the namespace/secret/deployment.
-4. Dashboard → **Add Server → Arcade**, URI
-   `http://arcade-eval-ref.arcade-eval-ref.svc.cluster.local:8000`, Secret = the worker secret
-   plaintext (stored git-ignored at `results/arcade-eval-ref-worker-secret.txt`); re-point the
-   `zeb-gateway-test` gateway's ref tools at it and drop the tunnel. Delete the plaintext file
-   afterward.
+- `https://arcade-eval-ref.st.dev/worker/health` → 200 (valid `*.st.dev` LE cert);
+  `/worker/tools` with a correct worker JWT → 200, tools `Echo/Add/Whoami`.
+- Through the gateway: `ArcadeEvalRef_Whoami()` → the caller's Entra `sub`
+  (`GvgRofe5…`), proving per-user execution across the full
+  client → gateway → engine → public URL → in-cluster pod chain.

-## Out of scope (YAGNI)
+## Alternative considered (not taken)

-No ingress (internal-only ClusterIP), no HPA, no PodMonitor/metrics (separate cat-5 work),
-single replica.
+Declare the server as a static worker in the **engine config** (`tools.directors[].workers`,
+like `arcade-worker-main`) — that path allows internal URIs and avoids public exposure, but
+edits the vendor Helm release (`apps/arcade`) and loses the dashboard per-project workflow.
+Public ingress was chosen as the lower-touch option.