# Integrated Guardrails

The Agent Gateway is the centralized security proxy through which all AI traffic flows — prompts to LLMs, tool calls to MCP servers, responses back to agents. Because every request and response passes through a single choke point, guardrails can be applied consistently across every integration without instrumenting individual call sites.

This page covers how guardrail evaluation works at the gateway layer, what is protected, and how enforcement decisions are made.

***

## How it works

When a request arrives at the Agent Gateway, it is evaluated by Shield before the request is forwarded downstream. The gateway intercepts at every lifecycle stage — not just the initial prompt — so guardrails cover the full surface of an agent interaction.

```
Agent / Application
        │
        ▼
┌───────────────────────────────────────────┐
│  Agent Gateway                            │
│                                           │
│  1. Receive request                       │
│  2. Evaluate with Shield ──────────────── │──► Shield (detection + Cedar policies)
│  3. Enforce decision (allow / block /     │
│     redact / alert)                       │
│  4. Forward to provider or MCP server     │
│  5. Evaluate response with Shield ─────── │──► Shield
│  6. Enforce, then return to caller        │
└───────────────────────────────────────────┘
        │
        ▼
   LLM Provider / MCP Server
```

The gateway evaluates every stage synchronously in the request path. If Shield is unreachable, the gateway fails open — it logs a warning and forwards the request — so a Shield outage does not cause downstream availability issues.

***

## Evaluation points

Guardrails run at four points in the agent lifecycle:

| Stage              | What is evaluated                                  | Threats targeted                                                            |
| ------------------ | -------------------------------------------------- | --------------------------------------------------------------------------- |
| **User prompt**    | The incoming message before it reaches the model   | Prompt injection, jailbreak, PII, secrets, toxic content                    |
| **Tool call**      | Tool name and arguments before the tool executes   | Command injection, SQL injection, path traversal, tool poisoning            |
| **Tool response**  | The tool output before the model sees it           | Indirect prompt injection, data leakage, malicious payloads in tool results |
| **Model response** | The LLM output before it is returned to the caller | PII, secrets, hallucinations, phishing links, toxic content                 |

Evaluating tool responses before they reach the model is what blocks indirect prompt injection attacks — where a malicious payload is embedded in a web page, document, or API response retrieved by a tool, and the model is instructed to act on it.

***

## Detection pipeline

Each evaluation runs through the same three-tier Shield detection pipeline: deterministic fast checks (< 5 ms), ML/NLP semantic checks (10–200 ms), and cloud-API deep checks (50–500 ms). Tiers run in order; the pipeline early-exits as soon as a tier produces a signal that a Cedar policy would act on.

For the full detector reference — including all detectors in each tier and their Cedar context key names — see [Securing Model Calls](/agent-gateway/securing-model-calls.md#input-evaluation).

***

## Cedar policies

Detection signals feed into Cedar policies, which determine the enforcement action. This separation means you can tune enforcement without changing the detection configuration.

A policy evaluates the signals available in the evaluation context:

```cedar
// Block requests that score high on injection detection
// from agents below first_party trust level
permit (
  principal,
  action == Action::"process_prompt",
  resource
)
when {
  context.injection_score < 70 ||
  context.trust_level == "first_party"
};
```

Available context keys include injection and jailbreak scores (0–100), content safety scores, PII and secret detection flags, tool risk classification, MCP server metadata, and session history signals from prior turns in the conversation.

***

## Session awareness

The gateway tracks cumulative signals across conversation turns within a session. A single message with a low injection score might be allowed through, but after several turns where each message incrementally pushes the agent toward a sensitive action, the accumulated session context can change the policy decision.

Session context keys available in policies:

| Key                             | Description                                          |
| ------------------------------- | ---------------------------------------------------- |
| `session_injection_detected`    | Injection signal observed in any prior turn          |
| `session_pii_detected`          | PII detected in any prior turn                       |
| `session_secrets_detected`      | Secrets detected in any prior turn                   |
| `session_cumulative_risk_score` | Running total of risk contributions across all turns |
| `session_threat_turns`          | Number of turns with at least one threat signal      |
| `session_max_injection_score`   | Peak injection score seen in any prior turn          |

Pass a consistent `session_id` on every request to enable session tracking via the `X-Highflame-Session-ID` header.

***

## Enforcement actions

When a policy matches, four enforcement actions are available:

| Action      | Behavior                                                                                                                                 |
| ----------- | ---------------------------------------------------------------------------------------------------------------------------------------- |
| **Block**   | Request is rejected with a `403` response. The downstream provider or tool is never called.                                              |
| **Redact**  | Violating content (PII, secrets) is masked before the request is forwarded. The agent receives a response based on the redacted content. |
| **Alert**   | Request is allowed through. An alert is emitted in Observatory for review.                                                               |
| **Monitor** | Request is allowed through with no alert. Detection signals are recorded in traces for analysis only.                                    |

Different enforcement actions can be applied at different evaluation points. For example, you might redact PII in prompts while blocking on tool call injection — configured per route in the gateway.

***

## MCP gateway integration

When the Agent Gateway is used as an MCP gateway, all tool calls and tool responses pass through the same guardrail pipeline.

**Tool call evaluation** checks the tool name and input arguments for injection patterns, command injection, path traversal, and policy violations before the call is forwarded to the MCP server.

**Tool response evaluation** checks the tool output for indirect prompt injection payloads before returning the result to the model. This is the primary defense against documents, web pages, or API responses that contain adversarial instructions targeting the model.

The MCP Registry provides an additional trust layer — only tools that have been registered and enabled can be called through the gateway. Unregistered tools are rejected before guardrail evaluation even runs.

***

## Configuring guardrails on a route

Guardrails are configured per route in the gateway. Each route specifies which policy set applies and the enforcement mode for that route.

```yaml
routes:
  - name: customer-support-agent
    model: anthropic/claude-sonnet-4-6
    policy_set: customer-support-policies
    guardrails:
      prompt: enforce
      tool_call: enforce
      tool_response: enforce
      response: alert   # alert on response violations, don't block
    session_tracking: true
```

The enforcement mode on a route overrides the default mode in the policy set. This lets you run the same policy set in monitor mode on development routes and enforce mode in production without duplicating policies.

***

## Routing through the gateway

Point your existing OpenAI-compatible client at the gateway endpoint. No other changes are needed.

{% tabs %}
{% tab title="Python" %}

```python
from openai import OpenAI
import os

client = OpenAI(
    api_key=os.environ["OPENAI_API_KEY"],
    base_url=f"{os.environ['HIGHFLAME_BASE_URL']}/v1",
    default_headers={
        "X-Highflame-API-Key": os.environ["HIGHFLAME_API_KEY"],
        "X-Highflame-Session-ID": session_id,
    },
)

response = client.chat.completions.create(
    model="anthropic/claude-sonnet-4-6",
    messages=[{"role": "user", "content": user_input}],
)
```

{% endtab %}

{% tab title="TypeScript" %}

```typescript
import OpenAI from "openai";

const client = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
  baseURL: `${process.env.HIGHFLAME_BASE_URL}/v1`,
  defaultHeaders: {
    "X-Highflame-API-Key": process.env.HIGHFLAME_API_KEY,
    "X-Highflame-Session-ID": sessionId,
  },
});

const response = await client.chat.completions.create({
  model: "anthropic/claude-sonnet-4-6",
  messages: [{ role: "user", content: userInput }],
});
```

{% endtab %}

{% tab title="curl" %}

```bash
curl https://<your-gateway>/v1/chat/completions \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -H "X-Highflame-API-Key: $HIGHFLAME_API_KEY" \
  -H "X-Highflame-Session-ID: $SESSION_ID" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "anthropic/claude-sonnet-4-6",
    "messages": [{"role": "user", "content": "Hello"}]
  }'
```

{% endtab %}
{% endtabs %}

When a request is blocked, the gateway returns a `403` response with a structured error body:

```json
{
  "error": {
    "type": "request_blocked",
    "message": "Request blocked by policy: prompt_injection_detected",
    "policy_reason": "Injection score exceeded threshold for this route",
    "decision": "deny"
  }
}
```

***

## Observability

Every request evaluated at the gateway generates a trace in Observatory. The trace captures:

* Latency breakdown — application, Highflame gateway, downstream provider
* Detector results for each evaluation point, including scores and matched signals
* Cedar policy decisions and which policy triggered the enforcement action
* Tool invocation records with inputs and outputs
* Session context accumulated across turns

Blocked requests appear in the **Threats** view for triage. Session-level patterns surface in the **Sessions** view. Full distributed traces are available in the **Traces** view.

***

## Gateway vs. SDK guardrails

|                        | Agent Gateway                            | SDK (`client.guard.*`)             |
| ---------------------- | ---------------------------------------- | ---------------------------------- |
| **Integration effort** | Point base URL at gateway                | Instrument each call site          |
| **Coverage**           | All traffic automatically                | Only instrumented call sites       |
| **Framework support**  | Any OpenAI-compatible client             | Python and TypeScript SDKs         |
| **Session tracking**   | Header-based, automatic                  | Pass `session_id` per call         |
| **MCP protection**     | Built-in via MCP gateway                 | Manual tool output evaluation      |
| **Best for**           | Consistent enforcement across all agents | Fine-grained control, custom logic |

Both approaches use the same detection pipeline, Cedar policies, and Observatory backend. They can also be combined: route model traffic through the gateway and use the SDK to guard custom tool logic that doesn't flow through the gateway.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.highflame.ai/agent-gateway/agent-gateway.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.