6.8 KiB
6.8 KiB
Live POC — frozen facts
Self-hosted on backstage-wus2-v4 via Flux; vendor Helm chart 1.8.8
(apps/arcade/ in k8s-backstage-v2, origin/master). Run the live-state check
(GROUND-RULES) before trusting any of this — it ages.
Deployment
- Endpoints:
api.arcade.st.dev(MCP/engine),coordinator.,dashboard.,experience.arcade.st.dev. Gateway URLs:https://api.arcade.st.dev/mcp/{slug}. - Upstream IdP: ServiceTitan Entra ID app registration (iac PR #4012). Not Okta yet (Okta is the criteria doc's eventual target — note the gap when scoring identity / cat 2).
- Chat/playground: disabled (
features.chatEnabled: false); engine LLM + embeddings routed through in-cluster LiteLLM, not api.openai.com. - Datastores: bundled in-cluster Postgres + Redis, default passwords, ephemeral.
Observability (cat 5 — confirmed)
- OTEL (evidence, Kibana 2026-06-18): the
arcade-enginepod emits OTLP metrics by default but the target collector does not resolve — repeating ~60s:failed to upload metrics: Post "http://arcade-otel-collector:4318/v1/metrics": dial tcp: lookup arcade-otel-collector ... no such host. Instrumentation is ON; the collector Servicearcade-otel-collectoris not deployed/resolvable in thearcadens; every metric is dropped. (Chart lists the collector image but the HelmRelease never enabled/named it.) - Logs → ELK: Vector daemonset scrapes pod stdout/stderr cluster-wide → ELK. Engine logs
already reach Kibana (that's how the above error is visible). Visible fields incl.
Tracing.TraceId,ContextInfo.CorrelationId,NetCore.RequestPath→ engine is a .NET app emitting structured logs with trace/correlation IDs (relevant to trace propagation pre-OTEL). - Metrics pipeline (metrics ≠ logs): metrics do not go to ELK. Metrics → Grafana,
via the Grafana Agent Operator (
MetricsInstancemain, nsmonitoring) which scrapes allServiceMonitor/PodMonitorCRs cluster-wide (any namespace; excludes ServiceMonitors labeledgrafana-agent: external) andremoteWrites to Grafana Mimir (http://mimir-nginx.mimir.observability-wus2/api/v1/push, tenant headerX-Scope-OrgID: k8s-backstage-v4). Convention: an app exposes a Prometheus/metricsport + aServiceMonitor(labelrelease: prometheus-operator) → auto-scraped → Grafana. - Two cat-5 gaps for metrics: (a) no collector —
arcade-otel-collector:4318doesn't resolve; (b) no bridge from OTLP-push into the pull-based Prometheus/Mimir pipeline. Fix = an OTEL Collector inarcadens that ingests the engine's OTLP and EITHER exposes aprometheusexporter/metricsscraped via aServiceMonitor, ORprometheusremotewritestraight to the Mimir push URL+tenant above. (Chart may bundle a disabled collector subchart — verify first.)
Live fixtures (filled in Phase 1)
- Project: TBD (Task 1.1)
- API key: label / last-4 only — never the key (Task 1.1)
- Headless auth header convention (confirmed via Arcade docs 2026-06-18): MCP gateway calls use
Authorization: Bearer <ARCADE_API_KEY>+Arcade-User-ID: <user_id>. The user_id is any stable string (an email works); this mode is for clients without browser auth / token refresh. Self-hosted gateway URL:https://api.arcade.st.dev/mcp/<slug>. (Source: docs.arcade.dev call-tool-client.) - Baseline gateway:
zeb-gateway-test— auth mode Arcade Headers (API key +Arcade-User-ID); 7 main-catalog tools (Slack ×2, GoogleDocs ×4, Brightdata ×1). Seeconfig/targets.yaml. Confirmed live 2026-06-18: tool list is gateway-wide (same for allArcade-User-IDs). - Shared reference server:
arcade-eval-ref(dashboard idmilitary-healthy-posted-rats), toolkitArcadeEvalRef, tools Echo/Add/Whoami — self-hosted atlib/mcp_server, registered via a Cloudflare quick tunnel (ephemeral URL inresults/tunnel_url.txt; re-register on restart). whoami exec-proof verified (A→user-a, B→user-b). whoamiidentity field: server readscontext.user_id(arcade_mcp_serverContext), populated by the Engine from the calling user (Arcade-User-ID/ authsub).
Known behaviors (findings)
arcade deployis cloud-only. It validates the server locally fine (health, tool + secret discovery — our ref server: 3 tools, 0 secrets), but POSTs the deployment toapi.arcade.dev(PROD_ENGINE_HOST), ignoring thearcade login --hostcoordinator — so against our self-hosted instance it returns 401.deployexposes no--host. Implication: self-hosted custom servers must be registered (run the server + dashboard "Add Server", type Arcade, URL + worker secret) — the tunnel pattern for local dev, or an in-cluster deploy for prod — notarcade deploy. Relevant to cat-4 (SDK/deploy), cat-8 (deployment), cat-9 (DX).- Per-user Google OAuth — two distinct issues, both cat-2 (the load-bearing category):
- Google provider redirect-URI / secret mismatch (RESOLVED 2026-06-22 by user). Initially the
consent URL was minted but no token vaulted (
tools.authorize(...)stayedpending). Cause: the Google client's Authorized redirect URI / client secret didn't match the Arcadegoogle-docs-providerconnection (Arcade re-mints a new connection id → new redirect URI on reconfigure). Fixed by matching the redirect URI + re-pasting the secret in both consoles. - Identity-namespace mismatch blocks consent binding under Entra User Source (OPEN, important).
With the gateway in User Source (Entra OIDC) mode, a Claude Code session resolves to the opaque
Entra
sub(ArcadeEvalRef_Whoami→GvgRofe5xGzPoeS0w__hSMmBY1JkU7F6pR4yLKOP-Qk). When the user completes the downstream Google consent in a browser signed into the Arcade dashboard asztaylor@servicetitan.com, Arcade's callback refuses to bind: "Your code provided the user ID GvgRofe5… but the currently signed-in Arcade account is ztaylor@servicetitan.com." Correct safety guardrail (no cross-user token grants), but it means the gateway User Source keys user_id on the rawsub, while the dashboard/coordinator login resolves the same Entra person toemail— so agent identity ≠ consent-completer identity. Likely fix: configure the Entra User Source to map user_id to theemail/preferred_usernameclaim (sowhoami=ztaylor@servicetitan.com, matching the dashboard). Until aligned, downstream OAuth consent can't complete for a User-Source agent session. This is a key cat-2 / identity-mapping finding and also bears on cat-10 (what string the vault is keyed on for multi-tenancy). Headless Arcade-Headers mode is unaffected (you pass the email directly asArcade-User-ID, which matches).
- Google provider redirect-URI / secret mismatch (RESOLVED 2026-06-22 by user). Initially the
consent URL was minted but no token vaulted (