Integrated Guardrails
The Agent Gateway is the centralized security proxy through which all AI traffic flows — prompts to LLMs, tool calls to MCP servers, responses back to agents. Because every request and response passes through a single choke point, guardrails can be applied consistently across every integration without instrumenting individual call sites.
This page covers how guardrail evaluation works at the gateway layer, what is protected, and how enforcement decisions are made.
How it works
When a request arrives at the Agent Gateway, it is evaluated by Shield before the request is forwarded downstream. The gateway intercepts at every lifecycle stage — not just the initial prompt — so guardrails cover the full surface of an agent interaction.
Agent / Application
│
▼
┌───────────────────────────────────────────┐
│ Agent Gateway │
│ │
│ 1. Receive request │
│ 2. Evaluate with Shield ──────────────── │──► Shield (detection + Cedar policies)
│ 3. Enforce decision (allow / block / │
│ redact / alert) │
│ 4. Forward to provider or MCP server │
│ 5. Evaluate response with Shield ─────── │──► Shield
│ 6. Enforce, then return to caller │
└───────────────────────────────────────────┘
│
▼
LLM Provider / MCP ServerThe gateway evaluates every stage synchronously in the request path. If Shield is unreachable, the gateway fails open — it logs a warning and forwards the request — so a Shield outage does not cause downstream availability issues.
Evaluation points
Guardrails run at four points in the agent lifecycle:
User prompt
The incoming message before it reaches the model
Prompt injection, jailbreak, PII, secrets, toxic content
Tool call
Tool name and arguments before the tool executes
Command injection, SQL injection, path traversal, tool poisoning
Tool response
The tool output before the model sees it
Indirect prompt injection, data leakage, malicious payloads in tool results
Model response
The LLM output before it is returned to the caller
PII, secrets, hallucinations, phishing links, toxic content
Evaluating tool responses before they reach the model is what blocks indirect prompt injection attacks — where a malicious payload is embedded in a web page, document, or API response retrieved by a tool, and the model is instructed to act on it.
Detection pipeline
Each evaluation runs through the same three-tier Shield detection pipeline: deterministic fast checks (< 5 ms), ML/NLP semantic checks (10–200 ms), and cloud-API deep checks (50–500 ms). Tiers run in order; the pipeline early-exits as soon as a tier produces a signal that a Cedar policy would act on.
For the full detector reference — including all detectors in each tier and their Cedar context key names — see Securing Model Calls.
Cedar policies
Detection signals feed into Cedar policies, which determine the enforcement action. This separation means you can tune enforcement without changing the detection configuration.
A policy evaluates the signals available in the evaluation context:
Available context keys include injection and jailbreak scores (0–100), content safety scores, PII and secret detection flags, tool risk classification, MCP server metadata, and session history signals from prior turns in the conversation.
Session awareness
The gateway tracks cumulative signals across conversation turns within a session. A single message with a low injection score might be allowed through, but after several turns where each message incrementally pushes the agent toward a sensitive action, the accumulated session context can change the policy decision.
Session context keys available in policies:
session_injection_detected
Injection signal observed in any prior turn
session_pii_detected
PII detected in any prior turn
session_secrets_detected
Secrets detected in any prior turn
session_cumulative_risk_score
Running total of risk contributions across all turns
session_threat_turns
Number of turns with at least one threat signal
session_max_injection_score
Peak injection score seen in any prior turn
Pass a consistent session_id on every request to enable session tracking via the X-Highflame-Session-ID header.
Enforcement actions
When a policy matches, four enforcement actions are available:
Block
Request is rejected with a 403 response. The downstream provider or tool is never called.
Redact
Violating content (PII, secrets) is masked before the request is forwarded. The agent receives a response based on the redacted content.
Alert
Request is allowed through. An alert is emitted in Observatory for review.
Monitor
Request is allowed through with no alert. Detection signals are recorded in traces for analysis only.
Different enforcement actions can be applied at different evaluation points. For example, you might redact PII in prompts while blocking on tool call injection — configured per route in the gateway.
MCP gateway integration
When the Agent Gateway is used as an MCP gateway, all tool calls and tool responses pass through the same guardrail pipeline.
Tool call evaluation checks the tool name and input arguments for injection patterns, command injection, path traversal, and policy violations before the call is forwarded to the MCP server.
Tool response evaluation checks the tool output for indirect prompt injection payloads before returning the result to the model. This is the primary defense against documents, web pages, or API responses that contain adversarial instructions targeting the model.
The MCP Registry provides an additional trust layer — only tools that have been registered and enabled can be called through the gateway. Unregistered tools are rejected before guardrail evaluation even runs.
Configuring guardrails on a route
Guardrails are configured per route in the gateway. Each route specifies which policy set applies and the enforcement mode for that route.
The enforcement mode on a route overrides the default mode in the policy set. This lets you run the same policy set in monitor mode on development routes and enforce mode in production without duplicating policies.
Routing through the gateway
Point your existing OpenAI-compatible client at the gateway endpoint. No other changes are needed.
When a request is blocked, the gateway returns a 403 response with a structured error body:
Observability
Every request evaluated at the gateway generates a trace in Observatory. The trace captures:
Latency breakdown — application, Highflame gateway, downstream provider
Detector results for each evaluation point, including scores and matched signals
Cedar policy decisions and which policy triggered the enforcement action
Tool invocation records with inputs and outputs
Session context accumulated across turns
Blocked requests appear in the Threats view for triage. Session-level patterns surface in the Sessions view. Full distributed traces are available in the Traces view.
Gateway vs. SDK guardrails
Agent Gateway
SDK (client.guard.*)
Integration effort
Point base URL at gateway
Instrument each call site
Coverage
All traffic automatically
Only instrumented call sites
Framework support
Any OpenAI-compatible client
Python and TypeScript SDKs
Session tracking
Header-based, automatic
Pass session_id per call
MCP protection
Built-in via MCP gateway
Manual tool output evaluation
Best for
Consistent enforcement across all agents
Fine-grained control, custom logic
Both approaches use the same detection pipeline, Cedar policies, and Observatory backend. They can also be combined: route model traffic through the gateway and use the SDK to guard custom tool logic that doesn't flow through the gateway.
Last updated