Guardrail Evaluations
How Highflame Guardrails evaluate requests — evaluation points, session awareness, enforcement modes, and layered protection.
Guardrails evaluate content at multiple points in the request lifecycle, in both blocking and non-blocking modes, and with awareness of the full session context. This page explains when evaluation happens, how checks are layered, and how enforcement mode affects behavior.
Evaluation Points
Every request flowing through Highflame can be evaluated at up to four points:
Input
User prompt before reaching the model
Block injection, validate content
Tool call
Tool name and arguments before execution
Block dangerous or unauthorized tool use
Tool response
Output returned from a tool before the model sees it
Block indirect injection, data leakage
Output
Model response before it reaches the user
Block sensitive data, unsafe content
You can enable guardrails at any combination of these points independently.
Layered Detection
Within each evaluation point, checks are layered by speed and depth:
Fast checks run first — deterministic, sub-millisecond pattern matching for secrets, PII, known injection patterns, command injection, SQL injection, and other high-confidence signals. These run with minimal latency impact.
Semantic checks follow — ML-based analysis for prompt injection confidence, jailbreak likelihood, toxicity scoring, and multi-turn behavioral context. These run when fast checks haven't already produced a definitive decision, or when policy requires deeper analysis.
Deep checks run last — cloud-based or computationally intensive analysis such as enterprise DLP or file content safety. These are opt-in and run only when configured.
Each layer has circuit breakers. If a deeper check is unavailable, evaluation continues without it and the request is not blocked solely due to check unavailability.
Session Awareness
Guardrails maintain state across conversation turns. For each session, Highflame tracks:
Cumulative signals from previous turns (injection scores, detected patterns, tool history)
Behavioral sequences — whether the agent's tool call patterns match known attack trajectories
Token consumption across the session
Repeated tool invocations that may indicate a loop
This means a message that appears benign in isolation can still be caught if it's part of a pattern that has been building across the conversation.
Enforcement Modes
Guardrails operate in one of three modes, set per-request or as a default for the project:
Enforce
Violations are blocked. The request or response does not proceed.
Alert
Violations generate alerts but the request proceeds. Useful for monitoring live traffic.
Monitor
Decisions are logged with no action taken. Useful for validating policy before enforcement.
The recommended rollout sequence is Monitor → Alert → Enforce. This lets you observe how policies behave against real traffic before enabling blocking.
Inline vs. Background Evaluation
Guardrails that must produce a decision before the request continues — prompt inspection, tool argument validation — run inline and block until a decision is reached. These are optimized for low latency.
Guardrails used for richer analysis, alerting, or post-hoc review can run in the background without blocking the request path. Even when run in the background, they contribute to session state and can influence future decisions within the same conversation.
Per-Agent and Per-Environment Scoping
Guardrail policies are scoped to your project and can be further scoped to specific agents, environments, or trust levels. This means:
External-facing agents can have stricter controls than internal ones
Development environments can run in monitor mode while production enforces
High-trust agents (first-party, fully delegated) can be granted wider permissions without changing the underlying detection logic
Scoping is expressed in Cedar policies using the context keys produced by detectors. See the Cedar Cookbook for examples.
Last updated