Guardrail Evaluations

How Highflame Guardrails evaluate requests — evaluation points, session awareness, enforcement modes, and layered protection.

Guardrails evaluate content at multiple points in the request lifecycle, in both blocking and non-blocking modes, and with awareness of the full session context. This page explains when evaluation happens, how checks are layered, and how enforcement mode affects behavior.


Evaluation Points

Every request flowing through Highflame can be evaluated at up to four points:

Phase
What is evaluated
Typical purpose

Input

User prompt before reaching the model

Block injection, validate content

Tool call

Tool name and arguments before execution

Block dangerous or unauthorized tool use

Tool response

Output returned from a tool before the model sees it

Block indirect injection, data leakage

Output

Model response before it reaches the user

Block sensitive data, unsafe content

You can enable guardrails at any combination of these points independently.


Layered Detection

Within each evaluation point, checks are layered by speed and depth:

Fast checks run first — deterministic, sub-millisecond pattern matching for secrets, PII, known injection patterns, command injection, SQL injection, and other high-confidence signals. These run with minimal latency impact.

Semantic checks follow — ML-based analysis for prompt injection confidence, jailbreak likelihood, toxicity scoring, and multi-turn behavioral context. These run when fast checks haven't already produced a definitive decision, or when policy requires deeper analysis.

Deep checks run last — cloud-based or computationally intensive analysis such as enterprise DLP or file content safety. These are opt-in and run only when configured.

Each layer has circuit breakers. If a deeper check is unavailable, evaluation continues without it and the request is not blocked solely due to check unavailability.


Session Awareness

Guardrails maintain state across conversation turns. For each session, Highflame tracks:

  • Cumulative signals from previous turns (injection scores, detected patterns, tool history)

  • Behavioral sequences — whether the agent's tool call patterns match known attack trajectories

  • Token consumption across the session

  • Repeated tool invocations that may indicate a loop

This means a message that appears benign in isolation can still be caught if it's part of a pattern that has been building across the conversation.


Enforcement Modes

Guardrails operate in one of three modes, set per-request or as a default for the project:

Mode
Behavior

Enforce

Violations are blocked. The request or response does not proceed.

Alert

Violations generate alerts but the request proceeds. Useful for monitoring live traffic.

Monitor

Decisions are logged with no action taken. Useful for validating policy before enforcement.

The recommended rollout sequence is Monitor → Alert → Enforce. This lets you observe how policies behave against real traffic before enabling blocking.


Inline vs. Background Evaluation

Guardrails that must produce a decision before the request continues — prompt inspection, tool argument validation — run inline and block until a decision is reached. These are optimized for low latency.

Guardrails used for richer analysis, alerting, or post-hoc review can run in the background without blocking the request path. Even when run in the background, they contribute to session state and can influence future decisions within the same conversation.


Per-Agent and Per-Environment Scoping

Guardrail policies are scoped to your project and can be further scoped to specific agents, environments, or trust levels. This means:

  • External-facing agents can have stricter controls than internal ones

  • Development environments can run in monitor mode while production enforces

  • High-trust agents (first-party, fully delegated) can be granted wider permissions without changing the underlying detection logic

Scoping is expressed in Cedar policies using the context keys produced by detectors. See the Cedar Cookbook for examples.

Last updated