# Guardrails & Policies

Highflame Guardrails protect your AI systems at every interaction — from user prompts to tool calls to model responses. They detect threats, enforce policies, and take action in real time, without requiring you to build or maintain your own detection logic.

Guardrails are session-aware. Rather than evaluating each message in isolation, they track activity across the full conversation and execution context: previous turns, tool call history, detected signals, and agent behavior over time. This allows them to catch sophisticated attacks that unfold gradually across multiple steps.

Guardrails can be:

* **Embedded in the Agent Gateway** for automatic inline protection across all agent traffic
* **Called directly via API** for explicit, per-request enforcement from any application

***

### What Guardrails Protect Against

#### Prompt & Agent Threats

Attacks against the agent's reasoning and instruction-following behavior, including attempts that span multiple conversation turns.

| Threat                          | Description                                                                                                          |
| ------------------------------- | -------------------------------------------------------------------------------------------------------------------- |
| **Prompt Injection**            | Direct attempts to override system instructions or hijack agent behavior                                             |
| **Indirect Prompt Injection**   | Hidden instructions embedded in tool outputs, documents, or retrieved content                                        |
| **Jailbreak**                   | Attempts to bypass safety constraints through prompt manipulation                                                    |
| **Multi-Turn Jailbreak**        | Jailbreak attempts that unfold gradually across several conversation turns                                           |
| **Tool Poisoning**              | Malicious instructions embedded in tool descriptions to redirect agent behavior                                      |
| **Agent Goal Switching**        | Mid-execution attempts to steer the agent toward an unintended objective                                             |
| **Suspicious Action Sequences** | Behavioral patterns that match known attack trajectories: data exfiltration, credential theft, destructive sequences |
| **Agent Loop Detection**        | Repeated invocation of the same tool, indicating stuck or manipulated execution                                      |
| **Token Budget Overrun**        | Detection of runaway sessions consuming excessive resources                                                          |

#### Sensitive Data Protection

Prevention of regulated or confidential information flowing in or out of your AI system.

| Threat                    | Description                                                                                          |
| ------------------------- | ---------------------------------------------------------------------------------------------------- |
| **PII Detection**         | Names, email addresses, phone numbers, SSNs, credit card numbers, and other personal identifiers     |
| **Secrets & Credentials** | API keys, tokens, passwords, private keys, and other credentials — 16+ secret formats                |
| **Keyword Matching**      | Exact and fuzzy matching against custom keyword libraries for topic restrictions or brand protection |
| **Custom Regex Patterns** | Deterministic, high-performance matching for known internal formats or compliance-sensitive strings  |
| **Enterprise DLP**        | Deep PII detection with fuzzy matching via Google Cloud DLP for regulated environments               |

#### Tool & Code Security

Protection against malicious inputs targeting tool execution and system calls.

| Threat                      | Description                                                          |
| --------------------------- | -------------------------------------------------------------------- |
| **Command Injection**       | Attempts to execute arbitrary system commands through tool arguments |
| **SQL Injection**           | Database manipulation payloads in tool inputs or model outputs       |
| **Path Traversal**          | Directory traversal attempts to access files outside intended scope  |
| **Script Injection**        | Malicious scripts embedded in content passed to tools                |
| **Encoded Injection**       | Base64 or URL-encoded payloads designed to bypass text-based filters |
| **Cross-Origin Escalation** | Attempts to access resources across trust boundaries                 |
| **MCP-Specific Risks**      | Attacks targeting MCP tool protocols and server interactions         |

#### Content Safety

Ensuring that both user inputs and model outputs meet your organization's standards.

| Threat                      | Description                                                          |
| --------------------------- | -------------------------------------------------------------------- |
| **Toxicity**                | Violence, hate speech, sexual content, weapons, crime, and profanity |
| **Phishing Links**          | Malicious URLs in prompts or model-generated content                 |
| **File Content Safety**     | Safety analysis of uploaded files and documents                      |
| **Hallucination Detection** | Factual inconsistency in model responses                             |
| **Language Detection**      | Identification of the language of incoming content (75 languages)    |

***

### Enforcement Actions

When a Guardrail detects a violation, it takes the action configured in your policy:

* **Block** — reject the request or response and return an error to the caller
* **Redact** — mask or remove the violating content and allow the rest through
* **Allow + Alert** — let the request through but emit a structured alert for review
* **Monitor** — observe and log without any enforcement, for shadow testing

***

### Policy-Driven Enforcement

Guardrails separate detection from enforcement. Detectors produce signals — injection scores, PII presence, tool risk levels, behavioral patterns. Cedar policies translate those signals into decisions. This means you can tune enforcement thresholds, combine signals, and scope rules to specific agents, environments, or trust levels without changing detection logic.

For policy authoring patterns and examples, see the [Cedar Cookbook](https://docs.highflame.ai/agent-authorization-and-control-shield/cedar-cookbook).

***

### Composing Guardrails

Guardrails evaluate at multiple points in the request lifecycle — before content reaches the model or tool (input phase) and after responses are generated (output phase). Within each phase, checks are layered: fast deterministic checks run first, followed by deeper semantic analysis. This allows Highflame to enforce strict controls on latency-sensitive paths while still running richer analysis where needed.

See [How Guardrails Evaluate](https://docs.highflame.ai/agent-authorization-and-control-shield/guardrails-policies/bounded-functional-units) for details on the evaluation lifecycle.

***

### Custom Detectors

In addition to built-in detection, Highflame supports custom detectors via webhooks. Register your own detection endpoint and declare the signal keys it produces. Those signals become available in Cedar policies alongside built-in detector output.

***

### Guardrail Coverage at a Glance

| What you're protecting | When evaluated                    |
| ---------------------- | --------------------------------- |
| User prompts           | Before reaching the model or tool |
| Tool call arguments    | Before the tool executes          |
| Tool responses         | Before returned to the model      |
| Model outputs          | Before returned to the user       |
| Uploaded files         | Before content is processed       |
| Conversation history   | Continuously across turns         |
