Files
arcade-eval/LIVE-POC.md
T

5.9 KiB
Raw Blame History

Live POC — frozen facts

Self-hosted on backstage-wus2-v4 via Flux; vendor Helm chart 1.8.8 (apps/arcade/ in k8s-backstage-v2, origin/master). Run the live-state check (GROUND-RULES) before trusting any of this — it ages.

Deployment

  • Endpoints: api.arcade.st.dev (MCP/engine), coordinator., dashboard., experience.arcade.st.dev. Gateway URLs: https://api.arcade.st.dev/mcp/{slug}.
  • Upstream IdP: ServiceTitan Entra ID app registration (iac PR #4012). Not Okta yet (Okta is the criteria doc's eventual target — note the gap when scoring identity / cat 2).
  • Chat/playground: disabled (features.chatEnabled: false); engine LLM + embeddings routed through in-cluster LiteLLM, not api.openai.com.
  • Datastores: bundled in-cluster Postgres + Redis, default passwords, ephemeral.

Observability (cat 5 — confirmed)

  • OTEL (evidence, Kibana 2026-06-18): the arcade-engine pod emits OTLP metrics by default but the target collector does not resolve — repeating ~60s: failed to upload metrics: Post "http://arcade-otel-collector:4318/v1/metrics": dial tcp: lookup arcade-otel-collector ... no such host. Instrumentation is ON; the collector Service arcade-otel-collector is not deployed/resolvable in the arcade ns; every metric is dropped. (Chart lists the collector image but the HelmRelease never enabled/named it.)
  • Logs → ELK: Vector daemonset scrapes pod stdout/stderr cluster-wide → ELK. Engine logs already reach Kibana (that's how the above error is visible). Visible fields incl. Tracing.TraceId, ContextInfo.CorrelationId, NetCore.RequestPath → engine is a .NET app emitting structured logs with trace/correlation IDs (relevant to trace propagation pre-OTEL).
  • Metrics pipeline (metrics ≠ logs): metrics do not go to ELK. Metrics → Grafana, via the Grafana Agent Operator (MetricsInstance main, ns monitoring) which scrapes all ServiceMonitor/PodMonitor CRs cluster-wide (any namespace; excludes ServiceMonitors labeled grafana-agent: external) and remoteWrites to Grafana Mimir (http://mimir-nginx.mimir.observability-wus2/api/v1/push, tenant header X-Scope-OrgID: k8s-backstage-v4). Convention: an app exposes a Prometheus /metrics port + a ServiceMonitor (label release: prometheus-operator) → auto-scraped → Grafana.
  • Two cat-5 gaps for metrics: (a) no collector — arcade-otel-collector:4318 doesn't resolve; (b) no bridge from OTLP-push into the pull-based Prometheus/Mimir pipeline. Fix = an OTEL Collector in arcade ns that ingests the engine's OTLP and EITHER exposes a prometheus exporter /metrics scraped via a ServiceMonitor, OR prometheusremotewrite straight to the Mimir push URL+tenant above. (Chart may bundle a disabled collector subchart — verify first.)

Live fixtures (filled in Phase 1)

  • Project: TBD (Task 1.1)
  • API key: label / last-4 only — never the key (Task 1.1)
  • Headless auth header convention (confirmed via Arcade docs 2026-06-18): MCP gateway calls use Authorization: Bearer <ARCADE_API_KEY> + Arcade-User-ID: <user_id>. The user_id is any stable string (an email works); this mode is for clients without browser auth / token refresh. Self-hosted gateway URL: https://api.arcade.st.dev/mcp/<slug>. (Source: docs.arcade.dev call-tool-client.)
  • Baseline gateway: zeb-gateway-test — auth mode Arcade Headers (API key + Arcade-User-ID); 7 main-catalog tools (Slack ×2, GoogleDocs ×4, Brightdata ×1). See config/targets.yaml. Confirmed live 2026-06-18: tool list is gateway-wide (same for all Arcade-User-IDs).
  • Shared reference server: arcade-eval-ref (dashboard id military-healthy-posted-rats), toolkit ArcadeEvalRef, tools Echo/Add/Whoami — self-hosted at lib/mcp_server, registered via a Cloudflare quick tunnel (ephemeral URL in results/tunnel_url.txt; re-register on restart). whoami exec-proof verified (A→user-a, B→user-b).
  • whoami identity field: server reads context.user_id (arcade_mcp_server Context), populated by the Engine from the calling user (Arcade-User-ID / auth sub).

Known behaviors (findings)

  • arcade deploy is cloud-only. It validates the server locally fine (health, tool + secret discovery — our ref server: 3 tools, 0 secrets), but POSTs the deployment to api.arcade.dev (PROD_ENGINE_HOST), ignoring the arcade login --host coordinator — so against our self-hosted instance it returns 401. deploy exposes no --host. Implication: self-hosted custom servers must be registered (run the server + dashboard "Add Server", type Arcade, URL + worker secret) — the tunnel pattern for local dev, or an in-cluster deploy for prod — not arcade deploy. Relevant to cat-4 (SDK/deploy), cat-8 (deployment), cat-9 (DX).
  • Per-user Google OAuth: consent URL works, but token does NOT vault for the headless Arcade-User-ID (verified 2026-06-18, cat-2). tools.authorize("GoogleDocs_CreateDocumentFromText", user_id) stays status=pending for both a real id (ztaylor@servicetitan.com) and a fresh id (gdoc-test-user) even after completing the exact consent link in-browser (Google approval 200 → coordinator callback 303 → dashboard 200, no visible error). Provider google-docs-provider is configured (mints consent URLs; scopes userinfo.email/profile + drive.file; redirect via coordinator.arcade.st.dev). Root cause TBD: (A) token exchange/storage fails server-side (Google client secret / redirect-uri misconfig), or (B) browser consent in a dashboard-logged-in session rebinds the token to the dashboard/account identity, not the headless user_id. Next: check arcade-coordinator logs for the callback/token-exchange. Blocks headless per-user execution for OAuth tools. (cat-1 whoami exec-proof uses no external OAuth, so it's unaffected.)