84 lines
6.8 KiB
Markdown
84 lines
6.8 KiB
Markdown
# Live POC — frozen facts
|
||
|
||
Self-hosted on `backstage-wus2-v4` via Flux; vendor Helm chart **1.8.8**
|
||
(`apps/arcade/` in `k8s-backstage-v2`, `origin/master`). **Run the live-state check
|
||
(GROUND-RULES) before trusting any of this — it ages.**
|
||
|
||
## Deployment
|
||
- **Endpoints:** `api.arcade.st.dev` (MCP/engine), `coordinator.`, `dashboard.`,
|
||
`experience.arcade.st.dev`. Gateway URLs: `https://api.arcade.st.dev/mcp/{slug}`.
|
||
- **Upstream IdP:** ServiceTitan **Entra ID** app registration (iac PR #4012). Not Okta yet
|
||
(Okta is the criteria doc's eventual target — note the gap when scoring identity / cat 2).
|
||
- **Chat/playground:** **disabled** (`features.chatEnabled: false`); engine LLM + embeddings
|
||
routed through in-cluster **LiteLLM**, not api.openai.com.
|
||
- **Datastores:** bundled in-cluster Postgres + Redis, default passwords, ephemeral.
|
||
|
||
## Observability (cat 5 — confirmed)
|
||
- **OTEL (evidence, Kibana 2026-06-18):** the `arcade-engine` pod emits OTLP metrics by
|
||
default but the target collector **does not resolve** — repeating ~60s:
|
||
`failed to upload metrics: Post "http://arcade-otel-collector:4318/v1/metrics": dial tcp:
|
||
lookup arcade-otel-collector ... no such host`. Instrumentation is ON; the collector Service
|
||
`arcade-otel-collector` is **not deployed/resolvable** in the `arcade` ns; every metric is
|
||
dropped. (Chart lists the collector image but the HelmRelease never enabled/named it.)
|
||
- **Logs → ELK:** Vector daemonset scrapes pod stdout/stderr cluster-wide → ELK. Engine logs
|
||
already reach Kibana (that's how the above error is visible). Visible fields incl.
|
||
`Tracing.TraceId`, `ContextInfo.CorrelationId`, `NetCore.RequestPath` → engine is a .NET app
|
||
emitting structured logs with trace/correlation IDs (relevant to trace propagation pre-OTEL).
|
||
- **Metrics pipeline (metrics ≠ logs):** metrics do **not** go to ELK. **Metrics → Grafana**,
|
||
via the **Grafana Agent Operator** (`MetricsInstance` `main`, ns `monitoring`) which scrapes
|
||
**all `ServiceMonitor`/`PodMonitor` CRs cluster-wide** (any namespace; excludes ServiceMonitors
|
||
labeled `grafana-agent: external`) and `remoteWrite`s to **Grafana Mimir**
|
||
(`http://mimir-nginx.mimir.observability-wus2/api/v1/push`, tenant header
|
||
`X-Scope-OrgID: k8s-backstage-v4`). Convention: an app exposes a Prometheus `/metrics` port +
|
||
a `ServiceMonitor` (label `release: prometheus-operator`) → auto-scraped → Grafana.
|
||
- **Two cat-5 gaps for metrics:** (a) no collector — `arcade-otel-collector:4318` doesn't
|
||
resolve; (b) no bridge from OTLP-push into the pull-based Prometheus/Mimir pipeline. Fix = an
|
||
OTEL Collector in `arcade` ns that ingests the engine's OTLP and EITHER exposes a `prometheus`
|
||
exporter `/metrics` scraped via a `ServiceMonitor`, OR `prometheusremotewrite` straight to the
|
||
Mimir push URL+tenant above. (Chart may bundle a disabled collector subchart — verify first.)
|
||
|
||
## Live fixtures (filled in Phase 1)
|
||
- **Project:** _TBD (Task 1.1)_
|
||
- **API key:** _label / last-4 only — never the key (Task 1.1)_
|
||
- **Headless auth header convention (confirmed via Arcade docs 2026-06-18):** MCP gateway calls use
|
||
`Authorization: Bearer <ARCADE_API_KEY>` + `Arcade-User-ID: <user_id>`. The user_id is any stable
|
||
string (an email works); this mode is for clients without browser auth / token refresh. Self-hosted
|
||
gateway URL: `https://api.arcade.st.dev/mcp/<slug>`. (Source: docs.arcade.dev call-tool-client.)
|
||
- **Baseline gateway:** `zeb-gateway-test` — auth mode **Arcade Headers** (API key + `Arcade-User-ID`);
|
||
7 main-catalog tools (Slack ×2, GoogleDocs ×4, Brightdata ×1). See `config/targets.yaml`.
|
||
Confirmed live 2026-06-18: tool list is gateway-wide (same for all `Arcade-User-ID`s).
|
||
- **Shared reference server:** `arcade-eval-ref` (dashboard id `military-healthy-posted-rats`), toolkit
|
||
`ArcadeEvalRef`, tools Echo/Add/Whoami — self-hosted at `lib/mcp_server`, registered via a Cloudflare
|
||
**quick** tunnel (ephemeral URL in `results/tunnel_url.txt`; re-register on restart). whoami exec-proof
|
||
verified (A→user-a, B→user-b).
|
||
- **`whoami` identity field:** server reads `context.user_id` (arcade_mcp_server `Context`), populated by the Engine from the calling user (`Arcade-User-ID` / auth `sub`).
|
||
|
||
## Known behaviors (findings)
|
||
- **`arcade deploy` is cloud-only.** It validates the server locally fine (health, tool + secret
|
||
discovery — our ref server: 3 tools, 0 secrets), but POSTs the deployment to `api.arcade.dev`
|
||
(`PROD_ENGINE_HOST`), ignoring the `arcade login --host` coordinator — so against our self-hosted
|
||
instance it returns **401**. `deploy` exposes no `--host`. **Implication:** self-hosted custom
|
||
servers must be **registered** (run the server + dashboard "Add Server", type Arcade, URL + worker
|
||
secret) — the tunnel pattern for local dev, or an in-cluster deploy for prod — not `arcade deploy`.
|
||
Relevant to cat-4 (SDK/deploy), cat-8 (deployment), cat-9 (DX).
|
||
- **Per-user Google OAuth — two distinct issues, both cat-2 (the load-bearing category):**
|
||
1. **Google provider redirect-URI / secret mismatch (RESOLVED 2026-06-22 by user).** Initially the
|
||
consent URL was minted but no token vaulted (`tools.authorize(...)` stayed `pending`). Cause: the
|
||
Google client's Authorized redirect URI / client secret didn't match the Arcade `google-docs-provider`
|
||
connection (Arcade re-mints a new connection id → new redirect URI on reconfigure). Fixed by matching
|
||
the redirect URI + re-pasting the secret in both consoles.
|
||
2. **Identity-namespace mismatch blocks consent binding under Entra User Source (OPEN, important).**
|
||
With the gateway in **User Source (Entra OIDC)** mode, a Claude Code session resolves to the **opaque
|
||
Entra `sub`** (`ArcadeEvalRef_Whoami` → `GvgRofe5xGzPoeS0w__hSMmBY1JkU7F6pR4yLKOP-Qk`). When the user
|
||
completes the downstream Google consent in a browser signed into the Arcade dashboard as
|
||
`ztaylor@servicetitan.com`, Arcade's callback **refuses to bind**: *"Your code provided the user ID
|
||
GvgRofe5… but the currently signed-in Arcade account is ztaylor@servicetitan.com."* Correct safety
|
||
guardrail (no cross-user token grants), but it means the **gateway User Source keys user_id on the raw
|
||
`sub`, while the dashboard/coordinator login resolves the same Entra person to `email`** — so agent
|
||
identity ≠ consent-completer identity. **Likely fix:** configure the Entra User Source to map user_id
|
||
to the `email`/`preferred_username` claim (so `whoami` = `ztaylor@servicetitan.com`, matching the
|
||
dashboard). Until aligned, downstream OAuth consent can't complete for a User-Source agent session.
|
||
**This is a key cat-2 / identity-mapping finding** and also bears on cat-10 (what string the vault is
|
||
keyed on for multi-tenancy). Headless **Arcade-Headers** mode is unaffected (you pass the email
|
||
directly as `Arcade-User-ID`, which matches).
|