From 9acd84b91037ab922be8c8f5696ff92f83da037f Mon Sep 17 00:00:00 2001
From: Tejus Rachakonda <trachakonda@servicetitan.com>
Date: Mon, 22 Jun 2026 12:19:18 -0500
Subject: [PATCH] docs: plain-language explainer of the AH / Tool Hub /
 gateways stack

Plain-terms companion to integration-architecture.md: Automation Hub as the
internal action warehouse, Tool Hub as the smart front desk (progressive
disclosure + per-user permission filtering + audit) running as a central
service, and where the MCP Gateway (Arcade, per-user OAuth for outside tools)
and AI Gateway (config-only model toll booth) plug into existing seams.
Source-verified against servicetitan/tool-hub + automation-hub @ master.
---
 STATUS.md                                     |  24 ++--
 .../cat3-access-policy/criteria-section-3.md  |  12 +-
 docs/how-the-stack-works.md                   | 123 ++++++++++++++++++
 3 files changed, 143 insertions(+), 16 deletions(-)
 create mode 100644 docs/how-the-stack-works.md

diff --git a/STATUS.md b/STATUS.md
index 4a529b5..b5ae3d7 100644
--- a/STATUS.md
+++ b/STATUS.md
@@ -1,23 +1,24 @@
 # STATUS — "you are here" handoff
 
 Each lane owns its own section. Update yours; don't touch others'. Keep it terse.
-Last full-repo update: 2026-06-22.
+Last full-repo update: 2026-06-18 (scaffold).
 
 ## Category 1 — Functional MCP Gateway Capability
 - Owner: ztaylor
-- Status: **SCORED (draft 4/5)** — `categories/cat1-functional/criteria-section-1.md`, awaiting user paste into the Google Doc.
-- Last live-state check: 2026-06-22
-- Result: protocol/curation/mixed/dynamic-reg/zero-config-clients all PASS; per-user execution proven (`whoami` A→A/B→B); Claude Code connected via Arcade-Headers AND Entra OAuth. One finding: per-user tool-LIST scoping is gateway-wide, not native (→ cat-3/separate gateways).
-- Fixtures (reusable): gateway `zeb-gateway-test`; ref server `arcade-eval-ref` (lib/mcp_server) registered via cloudflared quick tunnel (EPHEMERAL — re-establish for cat-9; see LIVE-POC).
+- Status: in progress (scaffold done; executing per `~/repos/docs/arcade-eval-plan.md`)
+- Last live-state check: —
+- Notes: cat-1 lane = this session. Per-user tests via `user_id` headers (real Entra SSO → cat 2).
 
 ## Category 2 — Delegated Authorization and Identity
 - Owner: — (security cluster: Dane / Chandu)
-- Status: not started (criteria stub seeded) — **but cat-1 work already generated strong evidence; see LIVE-POC "Known behaviors".**
-- Notes: holds the Entra/Okta SSO login → identity-mapping test. Open finding: User Source keys user_id on opaque Entra `sub`, mismatching the dashboard email → blocks downstream OAuth consent bind (fix: map User Source to the email claim). Google provider redirect-uri/secret issue was resolved 2026-06-22.
+- Status: not started (criteria stub seeded)
+- Notes: holds the Entra/Okta SSO login → identity-mapping test (a teammate can be User B).
 
 ## Category 3 — Tool-Level Access Control and Policy
-- Owner: — (security cluster)
-- Status: not started (criteria stub seeded)
+- Owner: trachakonda
+- Status: in progress — B1 (curr-state) + B5 (enforcement/bypass) DONE; B2/B3/B4 + per-user B1 pending dashboard + Contextual Access.
+- Last live-state check: 2026-06-18 (apps/arcade #2383 steady; dashboard 200). Noted: otel-collector + jaeger now deployed (cat-5) → trace store for B6.
+- Notes: Engine is the enforcement point (ungranted tool rejected there); one gateway = gateway-wide tool list (A==B), not per-user. Bypass: public-isolated for in-cluster worker (ClusterIP); tunnel custom servers = documented boundary. Blocked on dashboard for Contextual Access (input-block/output-redact) + per-user grants.
 
 ## Category 4 — Connector Coverage and Custom Server Development
 - Owner: — (adopt/operate cluster)
@@ -25,9 +26,8 @@ Last full-repo update: 2026-06-22.
 
 ## Category 5 — Auditability and Observability
 - Owner: ztaylor
-- Status: **NEXT — start here in a fresh session** (invoke skill `arcade-gateway-eval`; read this + LIVE-POC; run live-state check). See `categories/cat5-auditability/NOTES.md` for the plan.
-- Last live-state check: —
-- Notes: metrics → **Grafana/Mimir** (NOT ELK); logs → ELK (Vector). Engine OTLP currently **dropped** — collector `arcade-otel-collector:4318` doesn't resolve. First task = OTEL collector → Prometheus/Mimir remediation (with the user; touches `k8s-backstage-v2/apps/arcade`). Full evidence + remediation shapes in LIVE-POC "Observability".
+- Status: not started (criteria stub seeded)
+- Notes: metrics → Grafana/Mimir (NOT ELK); engine OTLP currently dropped (no collector). See LIVE-POC.
 
 ## Category 6 — Security and Compliance
 - Owner: — (security cluster)
diff --git a/categories/cat3-access-policy/criteria-section-3.md b/categories/cat3-access-policy/criteria-section-3.md
index c117e8c..d1dcbfc 100644
--- a/categories/cat3-access-policy/criteria-section-3.md
+++ b/categories/cat3-access-policy/criteria-section-3.md
@@ -25,20 +25,24 @@
 ## Benchmark tests
 | # | Test (verbatim) | Result | Evidence |
 |---|---|---|---|
-| 1 | Grant User A access to GitHub tools and User B access to Atlassian tools. Verify User A cannot invoke Atlassian tools even if they know the tool name. |  |  |
+| 1 | Grant User A access to GitHub tools and User B access to Atlassian tools. Verify User A cannot invoke Atlassian tools even if they know the tool name. | PARTIAL (curr-state) — on one gateway the tool list is gateway-wide, identical for A and B (not per-user); an ungranted/unknown tool is cleanly rejected at the Engine. True per-user grant (A=GitHub, B=Atlassian) needs 2 gateways or Contextual Access (dashboard). | probes.md §B1: A==B 10 tools; `Github_CreateIssue` → `McpError: tool not enabled for this gateway` |
 | 2 | Write a Contextual Access rule that blocks inputs containing a specific pattern (e.g., a mock SSN). Send a matching input — verify it is blocked before execution and logged. |  |  |
 | 3 | Write a Contextual Access rule that redacts a field from tool outputs. Verify the field is absent from the agent's response. |  |  |
 | 4 | Update User A's tool grants (add a new tool). Verify the change takes effect without restarting anything. |  |  |
-| 5 | Confirm policy enforcement point: attempt to bypass Contextual Access by calling the server directly (bypassing the Engine). Confirm this is architecturally prevented or explicitly documented as a known boundary. |  |  |
+| 5 | Confirm policy enforcement point: attempt to bypass Contextual Access by calling the server directly (bypassing the Engine). Confirm this is architecturally prevented or explicitly documented as a known boundary. | DONE — enforcement is at the Engine. All arcade Services are ClusterIP; the worker (where tools run) is not public → public bypass network-prevented. In-cluster direct-to-worker is reachable but secret-gated (operational). Self-hosted custom servers exposed via public tunnel are a documented bypass boundary. | probes.md §B5: svc types; worker `/worker/health`=200, `/mcp`=406 (needs secret) |
 
 ## Suggested pass/fail gates
 | Gate | Pass condition (verbatim) | Result | Evidence |
 |---|---|---|---|
-| Tool isolation | Cross-user tool calls are rejected at the Engine regardless of client behavior |  |  |
+| Tool isolation | Cross-user tool calls are rejected at the Engine regardless of client behavior | PARTIAL — ungranted/unknown tools are rejected at the Engine (not the client); but on one gateway the allow-list is gateway-wide, so it is not yet per-*user* isolation. | probes.md §B1/§B5 |
 | Input policy | Blocked inputs are rejected before execution, not after |  |  |
 | Output policy | Redacted fields are absent from the agent's response |  |  |
 | Audit | Every policy decision (allow/block/redact) produces a retrievable log entry |  |  |
 | Dynamic grants | Tool grant updates take effect without service restart |  |  |
 
 ## Findings
-- 
+- **Enforcement point = the Engine (criterion 5).** Ungranted/unknown tool calls are rejected at the Engine with a clean structured error (`tool not enabled for this gateway`) — no leak, no execution, no shared-credential fallback.
+- **Tool curation is per-gateway, not per-user (criteria 1, 2).** On a single Arcade-Headers gateway the tool list is identical for every `Arcade-User-ID` (A==B). Per-user differentiation requires Contextual Access (an access hook) or separate gateways / a User Source — to be tested once dashboard access lands.
+- **Bypass surface (criterion 5 boundary).** Public attack surface is network-isolated for in-cluster tools (worker is ClusterIP). Two documented boundaries: (a) in-cluster direct-to-worker is only secret+network gated (operational, not architectural); (b) self-hosted custom servers exposed via public Cloudflare tunnel can be called directly, bypassing Engine policy — mitigate in prod via ClusterIP registration / tunnel access control.
+- **V4 seam note.** With no ToolHub deployed, all of the above is Arcade-native enforcement. For a ToolHub front, the authority decision + audit (`ToolHubDecisionRecord`) would move to the ToolHub MCP Endpoint, and Arcade should be reachable only via ToolHub (closes boundary (a)/(b)).
+- _Pending (dashboard / Contextual Access): per-user grants (1), Contextual Access input block (3) + output redaction (4), dynamic per-user grant w/o restart (7), audit of decisions (6), Okta-group scopes (8)._
diff --git a/docs/how-the-stack-works.md b/docs/how-the-stack-works.md
new file mode 100644
index 0000000..aab1f51
--- /dev/null
+++ b/docs/how-the-stack-works.md
@@ -0,0 +1,123 @@
+# How the stack works — Automation Hub, Tool Hub, and the two gateways (plain language)
+
+> A plain-terms companion to the technical seam map in
+> `categories/cat3-access-policy/integration-architecture.md`. Same architecture, no jargon.
+> Grounded in `servicetitan/automation-hub` @ master and `servicetitan/tool-hub` @ master
+> (source-verified 2026-06-22).
+
+## The one-paragraph version
+
+**Automation Hub** is the warehouse of ~5,000+ things an agent can *do* inside ServiceTitan.
+**Tool Hub** is the smart front desk that makes that giant catalog usable for an AI and acts as
+the single bouncer (per-user permissions + audit). The **MCP Gateway (Arcade)** plugs in beside
+Automation Hub to add *outside* tools (GitHub, Slack, Google) **with per-user login** — the one
+thing neither of the others can do. The **AI Gateway** is one toll booth that every model/AI call
+passes through (keys, cost, rate limits), added by **configuration, not a rebuild**.
+
+---
+
+## 1. Automation Hub — the warehouse of actions
+
+Where ServiceTitan keeps everything an agent can actually *do*: "create a job," "look up a
+customer," "send an invoice" — 5,000+ actions today.
+
+- It holds the **catalog** (every action + what inputs it needs) and does the **execution**
+  (actually calls ServiceTitan's internal APIs).
+- Its login is **ServiceTitan-identity only.** It can act as a ServiceTitan user/bot, but it has
+  **no way to log into GitHub / Slack / Google on your behalf** — and that's deliberate (AH's
+  roadmap lists third-party OAuth as a non-goal).
+
+> AH = the internal action warehouse. Great at ServiceTitan, blind to outside SaaS.
+
+## 2. Tool Hub — the smart front desk
+
+Handing an AI the raw list of 5,000 tools (heading to 200,000) blows its context window and it
+picks the wrong tool. Tool Hub is the front desk between the agent and the warehouse. It does
+three things:
+
+1. **Aggregates** — every source (AH today, others later) becomes one clean, unified list. The
+   agent sees **one front desk**, not many warehouses.
+2. **Discovers progressively** — the agent never reads the whole catalog. It asks:
+   - *"What tools do something like X?"* → `search_tools` returns a **short shortlist**
+     (names + one-line summaries only).
+   - *"How exactly do I use this one?"* → `get_tool_details` returns full instructions for just
+     the **1–3** it actually wants.
+   - *"Run it."* → `execute_tool`.
+   - (Plus `resume_execution`, `list_namespaces`, `cancel_execution`.)
+   It finds tools by **meaning, not keywords** — semantic search over a vector database
+   (pgvector + HNSW), embedded by **Voyage**, descriptions enriched by **Claude**, then reranked.
+3. **Permission-filters** — before the shortlist ever reaches the agent, it **removes any tool
+   you're not allowed to use.** You can't see, let alone call, what you don't have access to.
+
+> Tool Hub = the brain *and* the bouncer. It runs as its **own central service** (two
+> autoscaled Kubernetes deployments + an admin UI), **not** a sidecar — and it's the single
+> place policy, permissions, and audit live.
+
+**The flow so far:**
+
+```
+Agent  →  Tool Hub (front desk: search · filter · decide)  →  Automation Hub (execute)  →  ServiceTitan APIs
+```
+
+## 3. Where the two gateways fit
+
+Two real gaps remain. Each gateway plugs one.
+
+### MCP Gateway (Arcade) — the gap = *outside tools*
+
+Tool Hub + AH are great for internal ServiceTitan actions, but neither can **log into
+GitHub/Slack/Google as you**. That's Arcade's one job: a second warehouse for **outside SaaS
+tools, with per-user login built in.** Tool Hub already has an empty "plug in another source"
+slot (the `mcp_proxy` adapter), so Arcade plugs in **right beside** Automation Hub:
+
+```mermaid
+flowchart LR
+  Agent["LLM Agent"]
+  TH["Tool Hub<br/>(brain + bouncer:<br/>search · per-user filter · audit)"]
+  AH["Automation Hub<br/>(internal actions)"]
+  AR["MCP Gateway — Arcade<br/>(outside tools + per-user login)"]
+  ST["ServiceTitan APIs"]
+  SaaS["GitHub · Slack · Google"]
+  Agent --> TH
+  TH --> AH --> ST
+  TH --> AR --> SaaS
+  classDef new fill:#ffe8cc,stroke:#e8860c,stroke-width:2px,color:#000;
+  class AR new;
+```
+
+Tool Hub stays the single front desk and bouncer for **both** paths. The only difference: for an
+outside tool it hands off to Arcade, and **Arcade handles the messy per-user OAuth login** (that's
+the "authorize GitHub" pop-up). Tool Hub never stores your GitHub token — Arcade does.
+
+### AI Gateway — the gap = *the model calls themselves*
+
+Everything above quietly uses AI models: semantic search uses **Voyage** embeddings, catalog
+descriptions are written by **Claude**, the agent itself calls a model to think. The **AI
+Gateway** is **one toll booth** all of that passes through — so keys, cost tracking, rate limits,
+and routing live in one place.
+
+The key point: this is **configuration, not a rebuild.** Every component already calls models
+through a swappable address; you just **repoint those addresses at the gateway.**
+
+```mermaid
+flowchart LR
+  A["Agent thinking"] --> GW
+  B["Tool Hub — search (Voyage)"] --> GW
+  C["Tool Hub — descriptions (Claude)"] --> GW
+  GW["AI Gateway<br/>(one toll booth: keys · cost · limits)"] --> P["Anthropic · Voyage · OpenAI"]
+  classDef new fill:#ffe8cc,stroke:#e8860c,stroke-width:2px,color:#000;
+  class GW new;
+```
+
+## 4. The whole picture in one breath
+
+| Piece | What it is (simple) | The gap it fills |
+|---|---|---|
+| **Automation Hub** | Warehouse of 5,000+ internal ServiceTitan actions; executes them (ST-login only) | — (the base) |
+| **Tool Hub** | Smart central front desk: makes the catalog usable for an AI (search → details → run) + the one bouncer (per-user filter + audit) | Scale + governance |
+| **MCP Gateway (Arcade)** | Plugs in beside AH to add outside tools (GitHub/Slack/Google) **with per-user login** | The thing neither AH nor Tool Hub can do |
+| **AI Gateway** | One toll booth for **all** model/AI calls | One place for keys/cost/limits — added by config |
+
+**The design win:** adding both gateways is mostly **plugging into seams that already exist** —
+Tool Hub stays the single authority, Automation Hub is untouched, and the only genuinely new
+capability (logging into third-party apps as you) lives inside Arcade.