diff --git a/GROUND-RULES.md b/GROUND-RULES.md new file mode 100644 index 0000000..0627a89 --- /dev/null +++ b/GROUND-RULES.md @@ -0,0 +1,43 @@ +# Ground Rules (binding) + +These apply to every lane and every session. Read before doing anything. + +## Credentials +- Credentials live **only** in the git-ignored `.env`. Never print, commit, or persist keys + elsewhere (not in docs, not in `config/targets.yaml`, not in commit messages). +- Load with `set -a && . ./.env && set +a`. + +## The criteria Google Doc +- **Never write the criteria Google Doc from a session.** Concurrent writes spliced tables + mid-word in the prior eval. Compose `criteria-section-N.md` locally; **the human pastes.** +- Criterion / gate / benchmark-question wording is **verbatim** from the criteria doc — + never paraphrase. Re-read the doc if unsure. + +## Live-state check (REQUIRED before any conclusion) +The deployment is actively changing; status docs age within a day. Before drawing any +conclusion from the live instance: +``` +git -C ~/repos/k8s-backstage-v2 log --oneline -8 origin/master -- apps/arcade +``` +plus a dashboard/gateway health probe (e.g. `curl -sS -o /dev/null -w '%{http_code}\n' https://dashboard.arcade.st.dev`). +Any not-yet-reverted in-flight "TEMPORARY"/teardown commit means the bench is NOT in a +validated steady state — don't draw conclusions from it. + +## File ownership (parallel-session safety) +| You may write | You may NOT write | +|---|---| +| your `categories/catN-*/` subtree (criteria-section-N.md, tests/, NOTES.md) | another lane's `categories/` subtree | +| your own section of `STATUS.md` | another lane's STATUS section | +| `config/targets.yaml`, `lib/`, top-level docs — **append-mostly**, coordinate | — | +| `results/` (git-ignored) | the criteria Google Doc (see above) | + +`git pull --rebase` before starting and again before pushing; on rejection, `git pull --rebase`. + +## Deployment changes +- `~/repos/k8s-backstage-v2/apps/arcade/**` is read freely but changed **only deliberately, + with the operator** (infra owns this cluster/POC). Expected case: the cat-5 collector+exporter + remediation — propose first, execute together, document before/after. + +## Scoring +- Single candidate (Arcade only): 1–5 scale, anchors at 1/3/5. Scores drafted locally; + nothing lands in the Google Doc/spreadsheet without the human pasting. diff --git a/LIVE-POC.md b/LIVE-POC.md new file mode 100644 index 0000000..cf08130 --- /dev/null +++ b/LIVE-POC.md @@ -0,0 +1,46 @@ +# Live POC — frozen facts + +Self-hosted on `backstage-wus2-v4` via Flux; vendor Helm chart **1.8.8** +(`apps/arcade/` in `k8s-backstage-v2`, `origin/master`). **Run the live-state check +(GROUND-RULES) before trusting any of this — it ages.** + +## Deployment +- **Endpoints:** `api.arcade.st.dev` (MCP/engine), `coordinator.`, `dashboard.`, + `experience.arcade.st.dev`. Gateway URLs: `https://api.arcade.st.dev/mcp/{slug}`. +- **Upstream IdP:** ServiceTitan **Entra ID** app registration (iac PR #4012). Not Okta yet + (Okta is the criteria doc's eventual target — note the gap when scoring identity / cat 2). +- **Chat/playground:** **disabled** (`features.chatEnabled: false`); engine LLM + embeddings + routed through in-cluster **LiteLLM**, not api.openai.com. +- **Datastores:** bundled in-cluster Postgres + Redis, default passwords, ephemeral. + +## Observability (cat 5 — confirmed) +- **OTEL (evidence, Kibana 2026-06-18):** the `arcade-engine` pod emits OTLP metrics by + default but the target collector **does not resolve** — repeating ~60s: + `failed to upload metrics: Post "http://arcade-otel-collector:4318/v1/metrics": dial tcp: + lookup arcade-otel-collector ... no such host`. Instrumentation is ON; the collector Service + `arcade-otel-collector` is **not deployed/resolvable** in the `arcade` ns; every metric is + dropped. (Chart lists the collector image but the HelmRelease never enabled/named it.) +- **Logs → ELK:** Vector daemonset scrapes pod stdout/stderr cluster-wide → ELK. Engine logs + already reach Kibana (that's how the above error is visible). Visible fields incl. + `Tracing.TraceId`, `ContextInfo.CorrelationId`, `NetCore.RequestPath` → engine is a .NET app + emitting structured logs with trace/correlation IDs (relevant to trace propagation pre-OTEL). +- **Metrics pipeline (metrics ≠ logs):** metrics do **not** go to ELK. **Metrics → Grafana**, + via the **Grafana Agent Operator** (`MetricsInstance` `main`, ns `monitoring`) which scrapes + **all `ServiceMonitor`/`PodMonitor` CRs cluster-wide** (any namespace; excludes ServiceMonitors + labeled `grafana-agent: external`) and `remoteWrite`s to **Grafana Mimir** + (`http://mimir-nginx.mimir.observability-wus2/api/v1/push`, tenant header + `X-Scope-OrgID: k8s-backstage-v4`). Convention: an app exposes a Prometheus `/metrics` port + + a `ServiceMonitor` (label `release: prometheus-operator`) → auto-scraped → Grafana. +- **Two cat-5 gaps for metrics:** (a) no collector — `arcade-otel-collector:4318` doesn't + resolve; (b) no bridge from OTLP-push into the pull-based Prometheus/Mimir pipeline. Fix = an + OTEL Collector in `arcade` ns that ingests the engine's OTLP and EITHER exposes a `prometheus` + exporter `/metrics` scraped via a `ServiceMonitor`, OR `prometheusremotewrite` straight to the + Mimir push URL+tenant above. (Chart may bundle a disabled collector subchart — verify first.) + +## Live fixtures (filled in Phase 1) +- **Project:** _TBD (Task 1.1)_ +- **API key:** _label / last-4 only — never the key (Task 1.1)_ +- **Headless auth header convention:** _confirmed in Task 1.1_ +- **Baseline gateway:** _slug + tool allow-list (Task 1.2)_ +- **Shared reference server:** _name + tools echo/whoami/add (Task 1.4)_ +- **`whoami` identity field:** _exact field the server reads (Task 1.4 / 2.4)_