docs: ground rules + frozen live-POC facts (incl. metrics pipeline)

2026-06-18 10:07:02 -04:00
parent bb5c5779d2
commit 34a10be5ef
2 changed files with 89 additions and 0 deletions
@@ -0,0 +1,43 @@
+# Ground Rules (binding)
+
+These apply to every lane and every session. Read before doing anything.
+
+## Credentials
+- Credentials live **only** in the git-ignored `.env`. Never print, commit, or persist keys
+  elsewhere (not in docs, not in `config/targets.yaml`, not in commit messages).
+- Load with `set -a && . ./.env && set +a`.
+
+## The criteria Google Doc
+- **Never write the criteria Google Doc from a session.** Concurrent writes spliced tables
+  mid-word in the prior eval. Compose `criteria-section-N.md` locally; **the human pastes.**
+- Criterion / gate / benchmark-question wording is **verbatim** from the criteria doc —
+  never paraphrase. Re-read the doc if unsure.
+
+## Live-state check (REQUIRED before any conclusion)
+The deployment is actively changing; status docs age within a day. Before drawing any
+conclusion from the live instance:
+```
+git -C ~/repos/k8s-backstage-v2 log --oneline -8 origin/master -- apps/arcade
+```
+plus a dashboard/gateway health probe (e.g. `curl -sS -o /dev/null -w '%{http_code}\n' https://dashboard.arcade.st.dev`).
+Any not-yet-reverted in-flight "TEMPORARY"/teardown commit means the bench is NOT in a
+validated steady state — don't draw conclusions from it.
+
+## File ownership (parallel-session safety)
+| You may write | You may NOT write |
+|---|---|
+| your `categories/catN-*/` subtree (criteria-section-N.md, tests/, NOTES.md) | another lane's `categories/` subtree |
+| your own section of `STATUS.md` | another lane's STATUS section |
+| `config/targets.yaml`, `lib/`, top-level docs — **append-mostly**, coordinate | — |
+| `results/` (git-ignored) | the criteria Google Doc (see above) |
+
+`git pull --rebase` before starting and again before pushing; on rejection, `git pull --rebase`.
+
+## Deployment changes
+- `~/repos/k8s-backstage-v2/apps/arcade/**` is read freely but changed **only deliberately,
+  with the operator** (infra owns this cluster/POC). Expected case: the cat-5 collector+exporter
+  remediation — propose first, execute together, document before/after.
+
+## Scoring
+- Single candidate (Arcade only): 1–5 scale, anchors at 1/3/5. Scores drafted locally;
+  nothing lands in the Google Doc/spreadsheet without the human pasting.
@@ -0,0 +1,46 @@
+# Live POC — frozen facts
+
+Self-hosted on `backstage-wus2-v4` via Flux; vendor Helm chart **1.8.8**
+(`apps/arcade/` in `k8s-backstage-v2`, `origin/master`). **Run the live-state check
+(GROUND-RULES) before trusting any of this — it ages.**
+
+## Deployment
+- **Endpoints:** `api.arcade.st.dev` (MCP/engine), `coordinator.`, `dashboard.`,
+  `experience.arcade.st.dev`. Gateway URLs: `https://api.arcade.st.dev/mcp/{slug}`.
+- **Upstream IdP:** ServiceTitan **Entra ID** app registration (iac PR #4012). Not Okta yet
+  (Okta is the criteria doc's eventual target — note the gap when scoring identity / cat 2).
+- **Chat/playground:** **disabled** (`features.chatEnabled: false`); engine LLM + embeddings
+  routed through in-cluster **LiteLLM**, not api.openai.com.
+- **Datastores:** bundled in-cluster Postgres + Redis, default passwords, ephemeral.
+
+## Observability (cat 5 — confirmed)
+- **OTEL (evidence, Kibana 2026-06-18):** the `arcade-engine` pod emits OTLP metrics by
+  default but the target collector **does not resolve** — repeating ~60s:
+  `failed to upload metrics: Post "http://arcade-otel-collector:4318/v1/metrics": dial tcp:
+  lookup arcade-otel-collector ... no such host`. Instrumentation is ON; the collector Service
+  `arcade-otel-collector` is **not deployed/resolvable** in the `arcade` ns; every metric is
+  dropped. (Chart lists the collector image but the HelmRelease never enabled/named it.)
+- **Logs → ELK:** Vector daemonset scrapes pod stdout/stderr cluster-wide → ELK. Engine logs
+  already reach Kibana (that's how the above error is visible). Visible fields incl.
+  `Tracing.TraceId`, `ContextInfo.CorrelationId`, `NetCore.RequestPath` → engine is a .NET app
+  emitting structured logs with trace/correlation IDs (relevant to trace propagation pre-OTEL).
+- **Metrics pipeline (metrics ≠ logs):** metrics do **not** go to ELK. **Metrics → Grafana**,
+  via the **Grafana Agent Operator** (`MetricsInstance` `main`, ns `monitoring`) which scrapes
+  **all `ServiceMonitor`/`PodMonitor` CRs cluster-wide** (any namespace; excludes ServiceMonitors
+  labeled `grafana-agent: external`) and `remoteWrite`s to **Grafana Mimir**
+  (`http://mimir-nginx.mimir.observability-wus2/api/v1/push`, tenant header
+  `X-Scope-OrgID: k8s-backstage-v4`). Convention: an app exposes a Prometheus `/metrics` port +
+  a `ServiceMonitor` (label `release: prometheus-operator`) → auto-scraped → Grafana.
+- **Two cat-5 gaps for metrics:** (a) no collector — `arcade-otel-collector:4318` doesn't
+  resolve; (b) no bridge from OTLP-push into the pull-based Prometheus/Mimir pipeline. Fix = an
+  OTEL Collector in `arcade` ns that ingests the engine's OTLP and EITHER exposes a `prometheus`
+  exporter `/metrics` scraped via a `ServiceMonitor`, OR `prometheusremotewrite` straight to the
+  Mimir push URL+tenant above. (Chart may bundle a disabled collector subchart — verify first.)
+
+## Live fixtures (filled in Phase 1)
+- **Project:** _TBD (Task 1.1)_
+- **API key:** _label / last-4 only — never the key (Task 1.1)_
+- **Headless auth header convention:** _confirmed in Task 1.1_
+- **Baseline gateway:** _slug + tool allow-list (Task 1.2)_
+- **Shared reference server:** _name + tools echo/whoami/add (Task 1.4)_
+- **`whoami` identity field:** _exact field the server reads (Task 1.4 / 2.4)_