docs: ground rules + frozen live-POC facts (incl. metrics pipeline)
This commit is contained in:
+46
@@ -0,0 +1,46 @@
|
||||
# Live POC — frozen facts
|
||||
|
||||
Self-hosted on `backstage-wus2-v4` via Flux; vendor Helm chart **1.8.8**
|
||||
(`apps/arcade/` in `k8s-backstage-v2`, `origin/master`). **Run the live-state check
|
||||
(GROUND-RULES) before trusting any of this — it ages.**
|
||||
|
||||
## Deployment
|
||||
- **Endpoints:** `api.arcade.st.dev` (MCP/engine), `coordinator.`, `dashboard.`,
|
||||
`experience.arcade.st.dev`. Gateway URLs: `https://api.arcade.st.dev/mcp/{slug}`.
|
||||
- **Upstream IdP:** ServiceTitan **Entra ID** app registration (iac PR #4012). Not Okta yet
|
||||
(Okta is the criteria doc's eventual target — note the gap when scoring identity / cat 2).
|
||||
- **Chat/playground:** **disabled** (`features.chatEnabled: false`); engine LLM + embeddings
|
||||
routed through in-cluster **LiteLLM**, not api.openai.com.
|
||||
- **Datastores:** bundled in-cluster Postgres + Redis, default passwords, ephemeral.
|
||||
|
||||
## Observability (cat 5 — confirmed)
|
||||
- **OTEL (evidence, Kibana 2026-06-18):** the `arcade-engine` pod emits OTLP metrics by
|
||||
default but the target collector **does not resolve** — repeating ~60s:
|
||||
`failed to upload metrics: Post "http://arcade-otel-collector:4318/v1/metrics": dial tcp:
|
||||
lookup arcade-otel-collector ... no such host`. Instrumentation is ON; the collector Service
|
||||
`arcade-otel-collector` is **not deployed/resolvable** in the `arcade` ns; every metric is
|
||||
dropped. (Chart lists the collector image but the HelmRelease never enabled/named it.)
|
||||
- **Logs → ELK:** Vector daemonset scrapes pod stdout/stderr cluster-wide → ELK. Engine logs
|
||||
already reach Kibana (that's how the above error is visible). Visible fields incl.
|
||||
`Tracing.TraceId`, `ContextInfo.CorrelationId`, `NetCore.RequestPath` → engine is a .NET app
|
||||
emitting structured logs with trace/correlation IDs (relevant to trace propagation pre-OTEL).
|
||||
- **Metrics pipeline (metrics ≠ logs):** metrics do **not** go to ELK. **Metrics → Grafana**,
|
||||
via the **Grafana Agent Operator** (`MetricsInstance` `main`, ns `monitoring`) which scrapes
|
||||
**all `ServiceMonitor`/`PodMonitor` CRs cluster-wide** (any namespace; excludes ServiceMonitors
|
||||
labeled `grafana-agent: external`) and `remoteWrite`s to **Grafana Mimir**
|
||||
(`http://mimir-nginx.mimir.observability-wus2/api/v1/push`, tenant header
|
||||
`X-Scope-OrgID: k8s-backstage-v4`). Convention: an app exposes a Prometheus `/metrics` port +
|
||||
a `ServiceMonitor` (label `release: prometheus-operator`) → auto-scraped → Grafana.
|
||||
- **Two cat-5 gaps for metrics:** (a) no collector — `arcade-otel-collector:4318` doesn't
|
||||
resolve; (b) no bridge from OTLP-push into the pull-based Prometheus/Mimir pipeline. Fix = an
|
||||
OTEL Collector in `arcade` ns that ingests the engine's OTLP and EITHER exposes a `prometheus`
|
||||
exporter `/metrics` scraped via a `ServiceMonitor`, OR `prometheusremotewrite` straight to the
|
||||
Mimir push URL+tenant above. (Chart may bundle a disabled collector subchart — verify first.)
|
||||
|
||||
## Live fixtures (filled in Phase 1)
|
||||
- **Project:** _TBD (Task 1.1)_
|
||||
- **API key:** _label / last-4 only — never the key (Task 1.1)_
|
||||
- **Headless auth header convention:** _confirmed in Task 1.1_
|
||||
- **Baseline gateway:** _slug + tool allow-list (Task 1.2)_
|
||||
- **Shared reference server:** _name + tools echo/whoami/add (Task 1.4)_
|
||||
- **`whoami` identity field:** _exact field the server reads (Task 1.4 / 2.4)_
|
||||
Reference in New Issue
Block a user