Guardrails

Highflame Guardrails are powered by Shield, the runtime service that evaluates prompts, model outputs, tool calls, and session context before an agent action is allowed to proceed. You can call Shield directly through the Shield API, or use it inline through products such as MCP Gateway and Code Agent Security.

Guardrails are designed for agentic systems rather than single-message moderation. Shield tracks tool usage, session history, cumulative risk, and action sequences, so decisions can incorporate what happened earlier in the session instead of only inspecting the current payload.

What Shield Evaluates

Shield is built to catch both content risks and execution risks. The current runtime capabilities include:

  • prompt injection and jailbreak attempts

  • sensitive data leakage, including secrets and PII

  • dangerous tool usage such as destructive shell or file operations

  • infinite loops and repeated tool-call patterns

  • token budget overruns and agent cost controls

  • suspicious action sequences such as read-to-exfiltration chains

  • cross-turn data exposure, where prior sensitive context blocks later writes

  • MCP-specific risks such as tool poisoning and token misuse

These detections are surfaced as structured signals, and Cedar policies decide whether the final runtime action should be allowed, denied, or logged.

How Guard Evaluation Works

Shield follows a consistent evaluation pipeline:

  1. Detectors inspect the request and produce a raw security context.

  2. A projection layer normalizes that context into stable semantic keys.

  3. Cedar policies evaluate the projected context.

  4. Shield returns a runtime decision together with signals, policy metadata, and optional debugging context.

This split matters because it lets you evolve detection logic without rewriting all of your policies. Policies target stable semantic context, such as injection confidence, secret presence, or tool risk, rather than service-internal implementation details.

Runtime Modes

Shield supports a few different ways to run guardrails depending on where you are in the rollout cycle:

  • enforce blocks actions when policies resolve to deny

  • log_only returns the decision but does not block, which is useful for shadow evaluation

  • detect_only skips Cedar enforcement and returns detector context for debugging and tuning

For deeper inspection, Shield also supports richer response tiers:

  • the default response is optimized for application integration

  • explain=true adds projected context and root-cause information

  • debug=true adds per-detector execution details for profiling and troubleshooting

Policy-Driven Enforcement

Guardrails are not just a bundle of detectors. Runtime behavior is controlled by Cedar policies managed through Highflame's shared policy system. That lets you define organization-specific rules such as:

  • blocking high-confidence injection attempts in production

  • preventing tools with destructive side effects from running outside approved environments

  • denying external writes when earlier turns contained secrets or PII

  • requiring stricter thresholds for public-facing agents than internal workflows

Because the same policy framework is used across products, teams can keep runtime controls auditable and consistent instead of hardcoding enforcement into each application.

Managing Guardrails In Studio

Studio is the operational UI for guardrails management. Teams typically use it to:

  • configure guardrail profiles and baseline protections

  • manage project-level and agent-level policies

  • inspect detector metadata and policy outcomes

  • generate service keys for SDK and API usage

  • validate behavior in the playground before shipping policy changes

The goal is to make Shield usable as a production control surface, not just a backend API. Developers can integrate Shield directly, while security and platform teams can tune enforcement centrally without having to redeploy every client.

Last updated