docs: plain-language explainer of the AH / Tool Hub / gateways stack

Plain-terms companion to integration-architecture.md: Automation Hub as the
internal action warehouse, Tool Hub as the smart front desk (progressive
disclosure + per-user permission filtering + audit) running as a central
service, and where the MCP Gateway (Arcade, per-user OAuth for outside tools)
and AI Gateway (config-only model toll booth) plug into existing seams.
Source-verified against servicetitan/tool-hub + automation-hub @ master.
This commit is contained in:
Tejus Rachakonda
2026-06-22 12:19:18 -05:00
parent 0dfeeb4194
commit 9acd84b910
3 changed files with 143 additions and 16 deletions
+12 -12
View File
@@ -1,23 +1,24 @@
# STATUS — "you are here" handoff
Each lane owns its own section. Update yours; don't touch others'. Keep it terse.
Last full-repo update: 2026-06-22.
Last full-repo update: 2026-06-18 (scaffold).
## Category 1 — Functional MCP Gateway Capability
- Owner: ztaylor
- Status: **SCORED (draft 4/5)**`categories/cat1-functional/criteria-section-1.md`, awaiting user paste into the Google Doc.
- Last live-state check: 2026-06-22
- Result: protocol/curation/mixed/dynamic-reg/zero-config-clients all PASS; per-user execution proven (`whoami` A→A/B→B); Claude Code connected via Arcade-Headers AND Entra OAuth. One finding: per-user tool-LIST scoping is gateway-wide, not native (→ cat-3/separate gateways).
- Fixtures (reusable): gateway `zeb-gateway-test`; ref server `arcade-eval-ref` (lib/mcp_server) registered via cloudflared quick tunnel (EPHEMERAL — re-establish for cat-9; see LIVE-POC).
- Status: in progress (scaffold done; executing per `~/repos/docs/arcade-eval-plan.md`)
- Last live-state check:
- Notes: cat-1 lane = this session. Per-user tests via `user_id` headers (real Entra SSO → cat 2).
## Category 2 — Delegated Authorization and Identity
- Owner: — (security cluster: Dane / Chandu)
- Status: not started (criteria stub seeded)**but cat-1 work already generated strong evidence; see LIVE-POC "Known behaviors".**
- Notes: holds the Entra/Okta SSO login → identity-mapping test. Open finding: User Source keys user_id on opaque Entra `sub`, mismatching the dashboard email → blocks downstream OAuth consent bind (fix: map User Source to the email claim). Google provider redirect-uri/secret issue was resolved 2026-06-22.
- Status: not started (criteria stub seeded)
- Notes: holds the Entra/Okta SSO login → identity-mapping test (a teammate can be User B).
## Category 3 — Tool-Level Access Control and Policy
- Owner: — (security cluster)
- Status: not started (criteria stub seeded)
- Owner: trachakonda
- Status: in progress — B1 (curr-state) + B5 (enforcement/bypass) DONE; B2/B3/B4 + per-user B1 pending dashboard + Contextual Access.
- Last live-state check: 2026-06-18 (apps/arcade #2383 steady; dashboard 200). Noted: otel-collector + jaeger now deployed (cat-5) → trace store for B6.
- Notes: Engine is the enforcement point (ungranted tool rejected there); one gateway = gateway-wide tool list (A==B), not per-user. Bypass: public-isolated for in-cluster worker (ClusterIP); tunnel custom servers = documented boundary. Blocked on dashboard for Contextual Access (input-block/output-redact) + per-user grants.
## Category 4 — Connector Coverage and Custom Server Development
- Owner: — (adopt/operate cluster)
@@ -25,9 +26,8 @@ Last full-repo update: 2026-06-22.
## Category 5 — Auditability and Observability
- Owner: ztaylor
- Status: **NEXT — start here in a fresh session** (invoke skill `arcade-gateway-eval`; read this + LIVE-POC; run live-state check). See `categories/cat5-auditability/NOTES.md` for the plan.
- Last live-state check: —
- Notes: metrics → **Grafana/Mimir** (NOT ELK); logs → ELK (Vector). Engine OTLP currently **dropped** — collector `arcade-otel-collector:4318` doesn't resolve. First task = OTEL collector → Prometheus/Mimir remediation (with the user; touches `k8s-backstage-v2/apps/arcade`). Full evidence + remediation shapes in LIVE-POC "Observability".
- Status: not started (criteria stub seeded)
- Notes: metrics → Grafana/Mimir (NOT ELK); engine OTLP currently dropped (no collector). See LIVE-POC.
## Category 6 — Security and Compliance
- Owner: — (security cluster)
@@ -25,20 +25,24 @@
## Benchmark tests
| # | Test (verbatim) | Result | Evidence |
|---|---|---|---|
| 1 | Grant User A access to GitHub tools and User B access to Atlassian tools. Verify User A cannot invoke Atlassian tools even if they know the tool name. | | |
| 1 | Grant User A access to GitHub tools and User B access to Atlassian tools. Verify User A cannot invoke Atlassian tools even if they know the tool name. | PARTIAL (curr-state) — on one gateway the tool list is gateway-wide, identical for A and B (not per-user); an ungranted/unknown tool is cleanly rejected at the Engine. True per-user grant (A=GitHub, B=Atlassian) needs 2 gateways or Contextual Access (dashboard). | probes.md §B1: A==B 10 tools; `Github_CreateIssue``McpError: tool not enabled for this gateway` |
| 2 | Write a Contextual Access rule that blocks inputs containing a specific pattern (e.g., a mock SSN). Send a matching input — verify it is blocked before execution and logged. | | |
| 3 | Write a Contextual Access rule that redacts a field from tool outputs. Verify the field is absent from the agent's response. | | |
| 4 | Update User A's tool grants (add a new tool). Verify the change takes effect without restarting anything. | | |
| 5 | Confirm policy enforcement point: attempt to bypass Contextual Access by calling the server directly (bypassing the Engine). Confirm this is architecturally prevented or explicitly documented as a known boundary. | | |
| 5 | Confirm policy enforcement point: attempt to bypass Contextual Access by calling the server directly (bypassing the Engine). Confirm this is architecturally prevented or explicitly documented as a known boundary. | DONE — enforcement is at the Engine. All arcade Services are ClusterIP; the worker (where tools run) is not public → public bypass network-prevented. In-cluster direct-to-worker is reachable but secret-gated (operational). Self-hosted custom servers exposed via public tunnel are a documented bypass boundary. | probes.md §B5: svc types; worker `/worker/health`=200, `/mcp`=406 (needs secret) |
## Suggested pass/fail gates
| Gate | Pass condition (verbatim) | Result | Evidence |
|---|---|---|---|
| Tool isolation | Cross-user tool calls are rejected at the Engine regardless of client behavior | | |
| Tool isolation | Cross-user tool calls are rejected at the Engine regardless of client behavior | PARTIAL — ungranted/unknown tools are rejected at the Engine (not the client); but on one gateway the allow-list is gateway-wide, so it is not yet per-*user* isolation. | probes.md §B1/§B5 |
| Input policy | Blocked inputs are rejected before execution, not after | | |
| Output policy | Redacted fields are absent from the agent's response | | |
| Audit | Every policy decision (allow/block/redact) produces a retrievable log entry | | |
| Dynamic grants | Tool grant updates take effect without service restart | | |
## Findings
-
- **Enforcement point = the Engine (criterion 5).** Ungranted/unknown tool calls are rejected at the Engine with a clean structured error (`tool not enabled for this gateway`) — no leak, no execution, no shared-credential fallback.
- **Tool curation is per-gateway, not per-user (criteria 1, 2).** On a single Arcade-Headers gateway the tool list is identical for every `Arcade-User-ID` (A==B). Per-user differentiation requires Contextual Access (an access hook) or separate gateways / a User Source — to be tested once dashboard access lands.
- **Bypass surface (criterion 5 boundary).** Public attack surface is network-isolated for in-cluster tools (worker is ClusterIP). Two documented boundaries: (a) in-cluster direct-to-worker is only secret+network gated (operational, not architectural); (b) self-hosted custom servers exposed via public Cloudflare tunnel can be called directly, bypassing Engine policy — mitigate in prod via ClusterIP registration / tunnel access control.
- **V4 seam note.** With no ToolHub deployed, all of the above is Arcade-native enforcement. For a ToolHub front, the authority decision + audit (`ToolHubDecisionRecord`) would move to the ToolHub MCP Endpoint, and Arcade should be reachable only via ToolHub (closes boundary (a)/(b)).
- _Pending (dashboard / Contextual Access): per-user grants (1), Contextual Access input block (3) + output redaction (4), dynamic per-user grant w/o restart (7), audit of decisions (6), Okta-group scopes (8)._
+123
View File
@@ -0,0 +1,123 @@
# How the stack works — Automation Hub, Tool Hub, and the two gateways (plain language)
> A plain-terms companion to the technical seam map in
> `categories/cat3-access-policy/integration-architecture.md`. Same architecture, no jargon.
> Grounded in `servicetitan/automation-hub` @ master and `servicetitan/tool-hub` @ master
> (source-verified 2026-06-22).
## The one-paragraph version
**Automation Hub** is the warehouse of ~5,000+ things an agent can *do* inside ServiceTitan.
**Tool Hub** is the smart front desk that makes that giant catalog usable for an AI and acts as
the single bouncer (per-user permissions + audit). The **MCP Gateway (Arcade)** plugs in beside
Automation Hub to add *outside* tools (GitHub, Slack, Google) **with per-user login** — the one
thing neither of the others can do. The **AI Gateway** is one toll booth that every model/AI call
passes through (keys, cost, rate limits), added by **configuration, not a rebuild**.
---
## 1. Automation Hub — the warehouse of actions
Where ServiceTitan keeps everything an agent can actually *do*: "create a job," "look up a
customer," "send an invoice" — 5,000+ actions today.
- It holds the **catalog** (every action + what inputs it needs) and does the **execution**
(actually calls ServiceTitan's internal APIs).
- Its login is **ServiceTitan-identity only.** It can act as a ServiceTitan user/bot, but it has
**no way to log into GitHub / Slack / Google on your behalf** — and that's deliberate (AH's
roadmap lists third-party OAuth as a non-goal).
> AH = the internal action warehouse. Great at ServiceTitan, blind to outside SaaS.
## 2. Tool Hub — the smart front desk
Handing an AI the raw list of 5,000 tools (heading to 200,000) blows its context window and it
picks the wrong tool. Tool Hub is the front desk between the agent and the warehouse. It does
three things:
1. **Aggregates** — every source (AH today, others later) becomes one clean, unified list. The
agent sees **one front desk**, not many warehouses.
2. **Discovers progressively** — the agent never reads the whole catalog. It asks:
- *"What tools do something like X?"* → `search_tools` returns a **short shortlist**
(names + one-line summaries only).
- *"How exactly do I use this one?"* → `get_tool_details` returns full instructions for just
the **13** it actually wants.
- *"Run it."* → `execute_tool`.
- (Plus `resume_execution`, `list_namespaces`, `cancel_execution`.)
It finds tools by **meaning, not keywords** — semantic search over a vector database
(pgvector + HNSW), embedded by **Voyage**, descriptions enriched by **Claude**, then reranked.
3. **Permission-filters** — before the shortlist ever reaches the agent, it **removes any tool
you're not allowed to use.** You can't see, let alone call, what you don't have access to.
> Tool Hub = the brain *and* the bouncer. It runs as its **own central service** (two
> autoscaled Kubernetes deployments + an admin UI), **not** a sidecar — and it's the single
> place policy, permissions, and audit live.
**The flow so far:**
```
Agent → Tool Hub (front desk: search · filter · decide) → Automation Hub (execute) → ServiceTitan APIs
```
## 3. Where the two gateways fit
Two real gaps remain. Each gateway plugs one.
### MCP Gateway (Arcade) — the gap = *outside tools*
Tool Hub + AH are great for internal ServiceTitan actions, but neither can **log into
GitHub/Slack/Google as you**. That's Arcade's one job: a second warehouse for **outside SaaS
tools, with per-user login built in.** Tool Hub already has an empty "plug in another source"
slot (the `mcp_proxy` adapter), so Arcade plugs in **right beside** Automation Hub:
```mermaid
flowchart LR
Agent["LLM Agent"]
TH["Tool Hub<br/>(brain + bouncer:<br/>search · per-user filter · audit)"]
AH["Automation Hub<br/>(internal actions)"]
AR["MCP Gateway — Arcade<br/>(outside tools + per-user login)"]
ST["ServiceTitan APIs"]
SaaS["GitHub · Slack · Google"]
Agent --> TH
TH --> AH --> ST
TH --> AR --> SaaS
classDef new fill:#ffe8cc,stroke:#e8860c,stroke-width:2px,color:#000;
class AR new;
```
Tool Hub stays the single front desk and bouncer for **both** paths. The only difference: for an
outside tool it hands off to Arcade, and **Arcade handles the messy per-user OAuth login** (that's
the "authorize GitHub" pop-up). Tool Hub never stores your GitHub token — Arcade does.
### AI Gateway — the gap = *the model calls themselves*
Everything above quietly uses AI models: semantic search uses **Voyage** embeddings, catalog
descriptions are written by **Claude**, the agent itself calls a model to think. The **AI
Gateway** is **one toll booth** all of that passes through — so keys, cost tracking, rate limits,
and routing live in one place.
The key point: this is **configuration, not a rebuild.** Every component already calls models
through a swappable address; you just **repoint those addresses at the gateway.**
```mermaid
flowchart LR
A["Agent thinking"] --> GW
B["Tool Hub — search (Voyage)"] --> GW
C["Tool Hub — descriptions (Claude)"] --> GW
GW["AI Gateway<br/>(one toll booth: keys · cost · limits)"] --> P["Anthropic · Voyage · OpenAI"]
classDef new fill:#ffe8cc,stroke:#e8860c,stroke-width:2px,color:#000;
class GW new;
```
## 4. The whole picture in one breath
| Piece | What it is (simple) | The gap it fills |
|---|---|---|
| **Automation Hub** | Warehouse of 5,000+ internal ServiceTitan actions; executes them (ST-login only) | — (the base) |
| **Tool Hub** | Smart central front desk: makes the catalog usable for an AI (search → details → run) + the one bouncer (per-user filter + audit) | Scale + governance |
| **MCP Gateway (Arcade)** | Plugs in beside AH to add outside tools (GitHub/Slack/Google) **with per-user login** | The thing neither AH nor Tool Hub can do |
| **AI Gateway** | One toll booth for **all** model/AI calls | One place for keys/cost/limits — added by config |
**The design win:** adding both gateways is mostly **plugging into seams that already exist**
Tool Hub stays the single authority, Automation Hub is untouched, and the only genuinely new
capability (logging into third-party apps as you) lives inside Arcade.