Guardrails & Policies
Highflame Guardrails — real-time threat detection and policy enforcement across prompts, tool calls, model responses, and files.
Highflame Guardrails protect your AI systems at every interaction — from user prompts to tool calls to model responses. They detect threats, enforce policies, and take action in real time, without requiring you to build or maintain your own detection logic.
Guardrails are session-aware. Rather than evaluating each message in isolation, they track activity across the full conversation and execution context: previous turns, tool call history, detected signals, and agent behavior over time. This allows them to catch sophisticated attacks that unfold gradually across multiple steps.
Guardrails can be:
Embedded in the Agent Gateway for automatic inline protection across all agent traffic
Called directly via API for explicit, per-request enforcement from any application
What Guardrails Protect Against
Prompt & Agent Threats
Attacks against the agent's reasoning and instruction-following behavior, including attempts that span multiple conversation turns.
Prompt Injection
Direct attempts to override system instructions or hijack agent behavior
Indirect Prompt Injection
Hidden instructions embedded in tool outputs, documents, or retrieved content
Jailbreak
Attempts to bypass safety constraints through prompt manipulation
Multi-Turn Jailbreak
Jailbreak attempts that unfold gradually across several conversation turns
Tool Poisoning
Malicious instructions embedded in tool descriptions to redirect agent behavior
Agent Goal Switching
Mid-execution attempts to steer the agent toward an unintended objective
Suspicious Action Sequences
Behavioral patterns that match known attack trajectories: data exfiltration, credential theft, destructive sequences
Agent Loop Detection
Repeated invocation of the same tool, indicating stuck or manipulated execution
Token Budget Overrun
Detection of runaway sessions consuming excessive resources
Sensitive Data Protection
Prevention of regulated or confidential information flowing in or out of your AI system.
PII Detection
Names, email addresses, phone numbers, SSNs, credit card numbers, and other personal identifiers
Secrets & Credentials
API keys, tokens, passwords, private keys, and other credentials — 16+ secret formats
Keyword Matching
Exact and fuzzy matching against custom keyword libraries for topic restrictions or brand protection
Custom Regex Patterns
Deterministic, high-performance matching for known internal formats or compliance-sensitive strings
Enterprise DLP
Deep PII detection with fuzzy matching via Google Cloud DLP for regulated environments
Tool & Code Security
Protection against malicious inputs targeting tool execution and system calls.
Command Injection
Attempts to execute arbitrary system commands through tool arguments
SQL Injection
Database manipulation payloads in tool inputs or model outputs
Path Traversal
Directory traversal attempts to access files outside intended scope
Script Injection
Malicious scripts embedded in content passed to tools
Encoded Injection
Base64 or URL-encoded payloads designed to bypass text-based filters
Cross-Origin Escalation
Attempts to access resources across trust boundaries
MCP-Specific Risks
Attacks targeting MCP tool protocols and server interactions
Content Safety
Ensuring that both user inputs and model outputs meet your organization's standards.
Toxicity
Violence, hate speech, sexual content, weapons, crime, and profanity
Phishing Links
Malicious URLs in prompts or model-generated content
File Content Safety
Safety analysis of uploaded files and documents
Hallucination Detection
Factual inconsistency in model responses
Language Detection
Identification of the language of incoming content (75 languages)
Enforcement Actions
When a Guardrail detects a violation, it takes the action configured in your policy:
Block — reject the request or response and return an error to the caller
Redact — mask or remove the violating content and allow the rest through
Allow + Alert — let the request through but emit a structured alert for review
Monitor — observe and log without any enforcement, for shadow testing
Policy-Driven Enforcement
Guardrails separate detection from enforcement. Detectors produce signals — injection scores, PII presence, tool risk levels, behavioral patterns. Cedar policies translate those signals into decisions. This means you can tune enforcement thresholds, combine signals, and scope rules to specific agents, environments, or trust levels without changing detection logic.
For policy authoring patterns and examples, see the Cedar Cookbook.
Composing Guardrails
Guardrails evaluate at multiple points in the request lifecycle — before content reaches the model or tool (input phase) and after responses are generated (output phase). Within each phase, checks are layered: fast deterministic checks run first, followed by deeper semantic analysis. This allows Highflame to enforce strict controls on latency-sensitive paths while still running richer analysis where needed.
See How Guardrails Evaluate for details on the evaluation lifecycle.
Custom Detectors
In addition to built-in detection, Highflame supports custom detectors via webhooks. Register your own detection endpoint and declare the signal keys it produces. Those signals become available in Cedar policies alongside built-in detector output.
Guardrail Coverage at a Glance
User prompts
Before reaching the model or tool
Tool call arguments
Before the tool executes
Tool responses
Before returned to the model
Model outputs
Before returned to the user
Uploaded files
Before content is processed
Conversation history
Continuously across turns
Last updated