Files

84 lines
6.8 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Live POC — frozen facts
Self-hosted on `backstage-wus2-v4` via Flux; vendor Helm chart **1.8.8**
(`apps/arcade/` in `k8s-backstage-v2`, `origin/master`). **Run the live-state check
(GROUND-RULES) before trusting any of this — it ages.**
## Deployment
- **Endpoints:** `api.arcade.st.dev` (MCP/engine), `coordinator.`, `dashboard.`,
`experience.arcade.st.dev`. Gateway URLs: `https://api.arcade.st.dev/mcp/{slug}`.
- **Upstream IdP:** ServiceTitan **Entra ID** app registration (iac PR #4012). Not Okta yet
(Okta is the criteria doc's eventual target — note the gap when scoring identity / cat 2).
- **Chat/playground:** **disabled** (`features.chatEnabled: false`); engine LLM + embeddings
routed through in-cluster **LiteLLM**, not api.openai.com.
- **Datastores:** bundled in-cluster Postgres + Redis, default passwords, ephemeral.
## Observability (cat 5 — confirmed)
- **OTEL (evidence, Kibana 2026-06-18):** the `arcade-engine` pod emits OTLP metrics by
default but the target collector **does not resolve** — repeating ~60s:
`failed to upload metrics: Post "http://arcade-otel-collector:4318/v1/metrics": dial tcp:
lookup arcade-otel-collector ... no such host`. Instrumentation is ON; the collector Service
`arcade-otel-collector` is **not deployed/resolvable** in the `arcade` ns; every metric is
dropped. (Chart lists the collector image but the HelmRelease never enabled/named it.)
- **Logs → ELK:** Vector daemonset scrapes pod stdout/stderr cluster-wide → ELK. Engine logs
already reach Kibana (that's how the above error is visible). Visible fields incl.
`Tracing.TraceId`, `ContextInfo.CorrelationId`, `NetCore.RequestPath` → engine is a .NET app
emitting structured logs with trace/correlation IDs (relevant to trace propagation pre-OTEL).
- **Metrics pipeline (metrics ≠ logs):** metrics do **not** go to ELK. **Metrics → Grafana**,
via the **Grafana Agent Operator** (`MetricsInstance` `main`, ns `monitoring`) which scrapes
**all `ServiceMonitor`/`PodMonitor` CRs cluster-wide** (any namespace; excludes ServiceMonitors
labeled `grafana-agent: external`) and `remoteWrite`s to **Grafana Mimir**
(`http://mimir-nginx.mimir.observability-wus2/api/v1/push`, tenant header
`X-Scope-OrgID: k8s-backstage-v4`). Convention: an app exposes a Prometheus `/metrics` port +
a `ServiceMonitor` (label `release: prometheus-operator`) → auto-scraped → Grafana.
- **Two cat-5 gaps for metrics:** (a) no collector — `arcade-otel-collector:4318` doesn't
resolve; (b) no bridge from OTLP-push into the pull-based Prometheus/Mimir pipeline. Fix = an
OTEL Collector in `arcade` ns that ingests the engine's OTLP and EITHER exposes a `prometheus`
exporter `/metrics` scraped via a `ServiceMonitor`, OR `prometheusremotewrite` straight to the
Mimir push URL+tenant above. (Chart may bundle a disabled collector subchart — verify first.)
## Live fixtures (filled in Phase 1)
- **Project:** _TBD (Task 1.1)_
- **API key:** _label / last-4 only — never the key (Task 1.1)_
- **Headless auth header convention (confirmed via Arcade docs 2026-06-18):** MCP gateway calls use
`Authorization: Bearer <ARCADE_API_KEY>` + `Arcade-User-ID: <user_id>`. The user_id is any stable
string (an email works); this mode is for clients without browser auth / token refresh. Self-hosted
gateway URL: `https://api.arcade.st.dev/mcp/<slug>`. (Source: docs.arcade.dev call-tool-client.)
- **Baseline gateway:** `zeb-gateway-test` — auth mode **Arcade Headers** (API key + `Arcade-User-ID`);
7 main-catalog tools (Slack ×2, GoogleDocs ×4, Brightdata ×1). See `config/targets.yaml`.
Confirmed live 2026-06-18: tool list is gateway-wide (same for all `Arcade-User-ID`s).
- **Shared reference server:** `arcade-eval-ref` (dashboard id `military-healthy-posted-rats`), toolkit
`ArcadeEvalRef`, tools Echo/Add/Whoami — self-hosted at `lib/mcp_server`, registered via a Cloudflare
**quick** tunnel (ephemeral URL in `results/tunnel_url.txt`; re-register on restart). whoami exec-proof
verified (A→user-a, B→user-b).
- **`whoami` identity field:** server reads `context.user_id` (arcade_mcp_server `Context`), populated by the Engine from the calling user (`Arcade-User-ID` / auth `sub`).
## Known behaviors (findings)
- **`arcade deploy` is cloud-only.** It validates the server locally fine (health, tool + secret
discovery — our ref server: 3 tools, 0 secrets), but POSTs the deployment to `api.arcade.dev`
(`PROD_ENGINE_HOST`), ignoring the `arcade login --host` coordinator — so against our self-hosted
instance it returns **401**. `deploy` exposes no `--host`. **Implication:** self-hosted custom
servers must be **registered** (run the server + dashboard "Add Server", type Arcade, URL + worker
secret) — the tunnel pattern for local dev, or an in-cluster deploy for prod — not `arcade deploy`.
Relevant to cat-4 (SDK/deploy), cat-8 (deployment), cat-9 (DX).
- **Per-user Google OAuth — two distinct issues, both cat-2 (the load-bearing category):**
1. **Google provider redirect-URI / secret mismatch (RESOLVED 2026-06-22 by user).** Initially the
consent URL was minted but no token vaulted (`tools.authorize(...)` stayed `pending`). Cause: the
Google client's Authorized redirect URI / client secret didn't match the Arcade `google-docs-provider`
connection (Arcade re-mints a new connection id → new redirect URI on reconfigure). Fixed by matching
the redirect URI + re-pasting the secret in both consoles.
2. **Identity-namespace mismatch blocks consent binding under Entra User Source (OPEN, important).**
With the gateway in **User Source (Entra OIDC)** mode, a Claude Code session resolves to the **opaque
Entra `sub`** (`ArcadeEvalRef_Whoami``GvgRofe5xGzPoeS0w__hSMmBY1JkU7F6pR4yLKOP-Qk`). When the user
completes the downstream Google consent in a browser signed into the Arcade dashboard as
`ztaylor@servicetitan.com`, Arcade's callback **refuses to bind**: *"Your code provided the user ID
GvgRofe5… but the currently signed-in Arcade account is ztaylor@servicetitan.com."* Correct safety
guardrail (no cross-user token grants), but it means the **gateway User Source keys user_id on the raw
`sub`, while the dashboard/coordinator login resolves the same Entra person to `email`** — so agent
identity ≠ consent-completer identity. **Likely fix:** configure the Entra User Source to map user_id
to the `email`/`preferred_username` claim (so `whoami` = `ztaylor@servicetitan.com`, matching the
dashboard). Until aligned, downstream OAuth consent can't complete for a User-Source agent session.
**This is a key cat-2 / identity-mapping finding** and also bears on cat-10 (what string the vault is
keyed on for multi-tenancy). Headless **Arcade-Headers** mode is unaffected (you pass the email
directly as `Arcade-User-ID`, which matches).