Files
arcade-eval/LIVE-POC.md
T

64 lines
4.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Live POC — frozen facts
Self-hosted on `backstage-wus2-v4` via Flux; vendor Helm chart **1.8.8**
(`apps/arcade/` in `k8s-backstage-v2`, `origin/master`). **Run the live-state check
(GROUND-RULES) before trusting any of this — it ages.**
## Deployment
- **Endpoints:** `api.arcade.st.dev` (MCP/engine), `coordinator.`, `dashboard.`,
`experience.arcade.st.dev`. Gateway URLs: `https://api.arcade.st.dev/mcp/{slug}`.
- **Upstream IdP:** ServiceTitan **Entra ID** app registration (iac PR #4012). Not Okta yet
(Okta is the criteria doc's eventual target — note the gap when scoring identity / cat 2).
- **Chat/playground:** **disabled** (`features.chatEnabled: false`); engine LLM + embeddings
routed through in-cluster **LiteLLM**, not api.openai.com.
- **Datastores:** bundled in-cluster Postgres + Redis, default passwords, ephemeral.
## Observability (cat 5 — confirmed)
- **OTEL (evidence, Kibana 2026-06-18):** the `arcade-engine` pod emits OTLP metrics by
default but the target collector **does not resolve** — repeating ~60s:
`failed to upload metrics: Post "http://arcade-otel-collector:4318/v1/metrics": dial tcp:
lookup arcade-otel-collector ... no such host`. Instrumentation is ON; the collector Service
`arcade-otel-collector` is **not deployed/resolvable** in the `arcade` ns; every metric is
dropped. (Chart lists the collector image but the HelmRelease never enabled/named it.)
- **Logs → ELK:** Vector daemonset scrapes pod stdout/stderr cluster-wide → ELK. Engine logs
already reach Kibana (that's how the above error is visible). Visible fields incl.
`Tracing.TraceId`, `ContextInfo.CorrelationId`, `NetCore.RequestPath` → engine is a .NET app
emitting structured logs with trace/correlation IDs (relevant to trace propagation pre-OTEL).
- **Metrics pipeline (metrics ≠ logs):** metrics do **not** go to ELK. **Metrics → Grafana**,
via the **Grafana Agent Operator** (`MetricsInstance` `main`, ns `monitoring`) which scrapes
**all `ServiceMonitor`/`PodMonitor` CRs cluster-wide** (any namespace; excludes ServiceMonitors
labeled `grafana-agent: external`) and `remoteWrite`s to **Grafana Mimir**
(`http://mimir-nginx.mimir.observability-wus2/api/v1/push`, tenant header
`X-Scope-OrgID: k8s-backstage-v4`). Convention: an app exposes a Prometheus `/metrics` port +
a `ServiceMonitor` (label `release: prometheus-operator`) → auto-scraped → Grafana.
- **Two cat-5 gaps for metrics:** (a) no collector — `arcade-otel-collector:4318` doesn't
resolve; (b) no bridge from OTLP-push into the pull-based Prometheus/Mimir pipeline. Fix = an
OTEL Collector in `arcade` ns that ingests the engine's OTLP and EITHER exposes a `prometheus`
exporter `/metrics` scraped via a `ServiceMonitor`, OR `prometheusremotewrite` straight to the
Mimir push URL+tenant above. (Chart may bundle a disabled collector subchart — verify first.)
## Live fixtures (filled in Phase 1)
- **Project:** _TBD (Task 1.1)_
- **API key:** _label / last-4 only — never the key (Task 1.1)_
- **Headless auth header convention (confirmed via Arcade docs 2026-06-18):** MCP gateway calls use
`Authorization: Bearer <ARCADE_API_KEY>` + `Arcade-User-ID: <user_id>`. The user_id is any stable
string (an email works); this mode is for clients without browser auth / token refresh. Self-hosted
gateway URL: `https://api.arcade.st.dev/mcp/<slug>`. (Source: docs.arcade.dev call-tool-client.)
- **Baseline gateway:** `zeb-gateway-test` — auth mode **Arcade Headers** (API key + `Arcade-User-ID`);
7 main-catalog tools (Slack ×2, GoogleDocs ×4, Brightdata ×1). See `config/targets.yaml`.
Confirmed live 2026-06-18: tool list is gateway-wide (same for all `Arcade-User-ID`s).
- **Shared reference server:** `arcade-eval-ref` (dashboard id `military-healthy-posted-rats`), toolkit
`ArcadeEvalRef`, tools Echo/Add/Whoami — self-hosted at `lib/mcp_server`, registered via a Cloudflare
**quick** tunnel (ephemeral URL in `results/tunnel_url.txt`; re-register on restart). whoami exec-proof
verified (A→user-a, B→user-b).
- **`whoami` identity field:** server reads `context.user_id` (arcade_mcp_server `Context`), populated by the Engine from the calling user (`Arcade-User-ID` / auth `sub`).
## Known behaviors (findings)
- **`arcade deploy` is cloud-only.** It validates the server locally fine (health, tool + secret
discovery — our ref server: 3 tools, 0 secrets), but POSTs the deployment to `api.arcade.dev`
(`PROD_ENGINE_HOST`), ignoring the `arcade login --host` coordinator — so against our self-hosted
instance it returns **401**. `deploy` exposes no `--host`. **Implication:** self-hosted custom
servers must be **registered** (run the server + dashboard "Add Server", type Arcade, URL + worker
secret) — the tunnel pattern for local dev, or an in-cluster deploy for prod — not `arcade deploy`.
Relevant to cat-4 (SDK/deploy), cat-8 (deployment), cat-9 (DX).