Integrated Guardrails

The Agent Gateway is the centralized security proxy through which all AI traffic flows — prompts to LLMs, tool calls to MCP servers, responses back to agents. Because every request and response passes through a single choke point, guardrails can be applied consistently across every integration without instrumenting individual call sites.

This page covers how guardrail evaluation works at the gateway layer, what is protected, and how enforcement decisions are made.


How it works

When a request arrives at the Agent Gateway, it is evaluated by Shield before the request is forwarded downstream. The gateway intercepts at every lifecycle stage — not just the initial prompt — so guardrails cover the full surface of an agent interaction.

Agent / Application


┌───────────────────────────────────────────┐
│  Agent Gateway                            │
│                                           │
│  1. Receive request                       │
│  2. Evaluate with Shield ──────────────── │──► Shield (detection + Cedar policies)
│  3. Enforce decision (allow / block /     │
│     redact / alert)                       │
│  4. Forward to provider or MCP server     │
│  5. Evaluate response with Shield ─────── │──► Shield
│  6. Enforce, then return to caller        │
└───────────────────────────────────────────┘


   LLM Provider / MCP Server

The gateway evaluates every stage synchronously in the request path. If Shield is unreachable, the gateway fails open — it logs a warning and forwards the request — so a Shield outage does not cause downstream availability issues.


Evaluation points

Guardrails run at four points in the agent lifecycle:

Stage
What is evaluated
Threats targeted

User prompt

The incoming message before it reaches the model

Prompt injection, jailbreak, PII, secrets, toxic content

Tool call

Tool name and arguments before the tool executes

Command injection, SQL injection, path traversal, tool poisoning

Tool response

The tool output before the model sees it

Indirect prompt injection, data leakage, malicious payloads in tool results

Model response

The LLM output before it is returned to the caller

PII, secrets, hallucinations, phishing links, toxic content

Evaluating tool responses before they reach the model is what blocks indirect prompt injection attacks — where a malicious payload is embedded in a web page, document, or API response retrieved by a tool, and the model is instructed to act on it.


Detection pipeline

Each evaluation runs through the same three-tier Shield detection pipeline: deterministic fast checks (< 5 ms), ML/NLP semantic checks (10–200 ms), and cloud-API deep checks (50–500 ms). Tiers run in order; the pipeline early-exits as soon as a tier produces a signal that a Cedar policy would act on.

For the full detector reference — including all detectors in each tier and their Cedar context key names — see Securing Model Calls.


Cedar policies

Detection signals feed into Cedar policies, which determine the enforcement action. This separation means you can tune enforcement without changing the detection configuration.

A policy evaluates the signals available in the evaluation context:

Available context keys include injection and jailbreak scores (0–100), content safety scores, PII and secret detection flags, tool risk classification, MCP server metadata, and session history signals from prior turns in the conversation.


Session awareness

The gateway tracks cumulative signals across conversation turns within a session. A single message with a low injection score might be allowed through, but after several turns where each message incrementally pushes the agent toward a sensitive action, the accumulated session context can change the policy decision.

Session context keys available in policies:

Key
Description

session_injection_detected

Injection signal observed in any prior turn

session_pii_detected

PII detected in any prior turn

session_secrets_detected

Secrets detected in any prior turn

session_cumulative_risk_score

Running total of risk contributions across all turns

session_threat_turns

Number of turns with at least one threat signal

session_max_injection_score

Peak injection score seen in any prior turn

Pass a consistent session_id on every request to enable session tracking via the X-Highflame-Session-ID header.


Enforcement actions

When a policy matches, four enforcement actions are available:

Action
Behavior

Block

Request is rejected with a 403 response. The downstream provider or tool is never called.

Redact

Violating content (PII, secrets) is masked before the request is forwarded. The agent receives a response based on the redacted content.

Alert

Request is allowed through. An alert is emitted in Observatory for review.

Monitor

Request is allowed through with no alert. Detection signals are recorded in traces for analysis only.

Different enforcement actions can be applied at different evaluation points. For example, you might redact PII in prompts while blocking on tool call injection — configured per route in the gateway.


MCP gateway integration

When the Agent Gateway is used as an MCP gateway, all tool calls and tool responses pass through the same guardrail pipeline.

Tool call evaluation checks the tool name and input arguments for injection patterns, command injection, path traversal, and policy violations before the call is forwarded to the MCP server.

Tool response evaluation checks the tool output for indirect prompt injection payloads before returning the result to the model. This is the primary defense against documents, web pages, or API responses that contain adversarial instructions targeting the model.

The MCP Registry provides an additional trust layer — only tools that have been registered and enabled can be called through the gateway. Unregistered tools are rejected before guardrail evaluation even runs.


Configuring guardrails on a route

Guardrails are configured per route in the gateway. Each route specifies which policy set applies and the enforcement mode for that route.

The enforcement mode on a route overrides the default mode in the policy set. This lets you run the same policy set in monitor mode on development routes and enforce mode in production without duplicating policies.


Routing through the gateway

Point your existing OpenAI-compatible client at the gateway endpoint. No other changes are needed.

When a request is blocked, the gateway returns a 403 response with a structured error body:


Observability

Every request evaluated at the gateway generates a trace in Observatory. The trace captures:

  • Latency breakdown — application, Highflame gateway, downstream provider

  • Detector results for each evaluation point, including scores and matched signals

  • Cedar policy decisions and which policy triggered the enforcement action

  • Tool invocation records with inputs and outputs

  • Session context accumulated across turns

Blocked requests appear in the Threats view for triage. Session-level patterns surface in the Sessions view. Full distributed traces are available in the Traces view.


Gateway vs. SDK guardrails

Agent Gateway

SDK (client.guard.*)

Integration effort

Point base URL at gateway

Instrument each call site

Coverage

All traffic automatically

Only instrumented call sites

Framework support

Any OpenAI-compatible client

Python and TypeScript SDKs

Session tracking

Header-based, automatic

Pass session_id per call

MCP protection

Built-in via MCP gateway

Manual tool output evaluation

Best for

Consistent enforcement across all agents

Fine-grained control, custom logic

Both approaches use the same detection pipeline, Cedar policies, and Observatory backend. They can also be combined: route model traffic through the gateway and use the SDK to guard custom tool logic that doesn't flow through the gateway.

Last updated