Guardrails & Policies

Highflame Guardrails — real-time threat detection and policy enforcement across prompts, tool calls, model responses, and files.

Highflame Guardrails protect your AI systems at every interaction — from user prompts to tool calls to model responses. They detect threats, enforce policies, and take action in real time, without requiring you to build or maintain your own detection logic.

Guardrails are session-aware. Rather than evaluating each message in isolation, they track activity across the full conversation and execution context: previous turns, tool call history, detected signals, and agent behavior over time. This allows them to catch sophisticated attacks that unfold gradually across multiple steps.

Guardrails can be:

  • Embedded in the Agent Gateway for automatic inline protection across all agent traffic

  • Called directly via API for explicit, per-request enforcement from any application


What Guardrails Protect Against

Prompt & Agent Threats

Attacks against the agent's reasoning and instruction-following behavior, including attempts that span multiple conversation turns.

Threat
Description

Prompt Injection

Direct attempts to override system instructions or hijack agent behavior

Indirect Prompt Injection

Hidden instructions embedded in tool outputs, documents, or retrieved content

Jailbreak

Attempts to bypass safety constraints through prompt manipulation

Multi-Turn Jailbreak

Jailbreak attempts that unfold gradually across several conversation turns

Tool Poisoning

Malicious instructions embedded in tool descriptions to redirect agent behavior

Agent Goal Switching

Mid-execution attempts to steer the agent toward an unintended objective

Suspicious Action Sequences

Behavioral patterns that match known attack trajectories: data exfiltration, credential theft, destructive sequences

Agent Loop Detection

Repeated invocation of the same tool, indicating stuck or manipulated execution

Token Budget Overrun

Detection of runaway sessions consuming excessive resources

Sensitive Data Protection

Prevention of regulated or confidential information flowing in or out of your AI system.

Threat
Description

PII Detection

Names, email addresses, phone numbers, SSNs, credit card numbers, and other personal identifiers

Secrets & Credentials

API keys, tokens, passwords, private keys, and other credentials — 16+ secret formats

Keyword Matching

Exact and fuzzy matching against custom keyword libraries for topic restrictions or brand protection

Custom Regex Patterns

Deterministic, high-performance matching for known internal formats or compliance-sensitive strings

Enterprise DLP

Deep PII detection with fuzzy matching via Google Cloud DLP for regulated environments

Tool & Code Security

Protection against malicious inputs targeting tool execution and system calls.

Threat
Description

Command Injection

Attempts to execute arbitrary system commands through tool arguments

SQL Injection

Database manipulation payloads in tool inputs or model outputs

Path Traversal

Directory traversal attempts to access files outside intended scope

Script Injection

Malicious scripts embedded in content passed to tools

Encoded Injection

Base64 or URL-encoded payloads designed to bypass text-based filters

Cross-Origin Escalation

Attempts to access resources across trust boundaries

MCP-Specific Risks

Attacks targeting MCP tool protocols and server interactions

Content Safety

Ensuring that both user inputs and model outputs meet your organization's standards.

Threat
Description

Toxicity

Violence, hate speech, sexual content, weapons, crime, and profanity

Phishing Links

Malicious URLs in prompts or model-generated content

File Content Safety

Safety analysis of uploaded files and documents

Hallucination Detection

Factual inconsistency in model responses

Language Detection

Identification of the language of incoming content (75 languages)


Enforcement Actions

When a Guardrail detects a violation, it takes the action configured in your policy:

  • Block — reject the request or response and return an error to the caller

  • Redact — mask or remove the violating content and allow the rest through

  • Allow + Alert — let the request through but emit a structured alert for review

  • Monitor — observe and log without any enforcement, for shadow testing


Policy-Driven Enforcement

Guardrails separate detection from enforcement. Detectors produce signals — injection scores, PII presence, tool risk levels, behavioral patterns. Cedar policies translate those signals into decisions. This means you can tune enforcement thresholds, combine signals, and scope rules to specific agents, environments, or trust levels without changing detection logic.

For policy authoring patterns and examples, see the Cedar Cookbook.


Composing Guardrails

Guardrails evaluate at multiple points in the request lifecycle — before content reaches the model or tool (input phase) and after responses are generated (output phase). Within each phase, checks are layered: fast deterministic checks run first, followed by deeper semantic analysis. This allows Highflame to enforce strict controls on latency-sensitive paths while still running richer analysis where needed.

See How Guardrails Evaluate for details on the evaluation lifecycle.


Custom Detectors

In addition to built-in detection, Highflame supports custom detectors via webhooks. Register your own detection endpoint and declare the signal keys it produces. Those signals become available in Cedar policies alongside built-in detector output.


Guardrail Coverage at a Glance

What you're protecting
When evaluated

User prompts

Before reaching the model or tool

Tool call arguments

Before the tool executes

Tool responses

Before returned to the model

Model outputs

Before returned to the user

Uploaded files

Before content is processed

Conversation history

Continuously across turns

Last updated