Threat Alerts

Whenever Shield evaluates a request and a policy fires, it generates a structured threat alert alongside any enforcement action (block, redact, or allow). Threat alerts are the primary record of what Shield detected, which policy drove the decision, and what context was present at the time.

Alerts are collected in Observatory → Threats, where they can be filtered, investigated, and exported. This page covers what Shield puts into each alert and how guardrail health failures are surfaced separately.


Alert structure

Every Shield alert includes:

  • Determining policies — the specific Cedar policy IDs that drove the decision, with their @id, @severity, and @tags annotations

  • Projected context — the detector outputs Cedar evaluated (injection score, PII flags, secret types, tool risk score, etc.)

  • Enforcement action — Block, Redact, Alert, or Monitor

  • Policy reason — the human-readable explanation from the Cedar policy's @reject_message annotation

  • Evaluation point — which phase fired (user prompt, tool call, tool response, assistant response)

  • Session context — session ID, turn number, and any session-level signals that were active


Alert categories

Shield categorizes alerts by detected risk type. Each alert carries a category used for filtering and trending in Observatory.

Category
What triggers it

Prompt Injection

ML injection confidence above threshold

Jailbreak Attempts

Jailbreak classifier above threshold

Indirect Injection

Injection in tool outputs or retrieved content

Sensitive Data

PII detected; action was Reject, Masked, Replaced, or Redacted

Secrets Leakage

API keys, tokens, credentials, private keys

Restricted Keywords

Keyword blocklist match

Command Injection

Command injection pattern in tool arguments

SQL Injection

SQL injection payload in inputs or outputs

Path Traversal

Directory traversal attempt

Encoded Injection

Base64 or invisible-character obfuscation in inputs

Phishing URLs

Malicious URL detected in prompt or model output

Sexual Content

Toxicity classifier — sexual content

Violence

Toxicity classifier — violence

Hate Speech

Toxicity classifier — hate speech

Profanity

Toxicity classifier — profanity

Weapons / Crime

Toxicity classifier — weapons or criminal activity

Non-English Language

Language detection for restricted-language policies

Non-ASCII / Invisible Characters

Non-ASCII or invisible Unicode characters

High Entropy

High-entropy string patterns (potential encoded content)

Markdown / Code

Markdown or code block patterns triggering content rules

Custom Guardrails

Policy-defined custom rules not covered by a built-in category


Guardrail failure alerts

A separate category — Requests With Guardrail Failure — tracks operational issues that prevented a policy from evaluating correctly. These are not security events; they are health signals for your guardrail configuration.

A failure can occur when:

  • A policy references a detector signal that could not be computed (missing dependency, configuration error)

  • A custom webhook detector timed out or returned an error

  • An internal processing error prevented evaluation from completing

Guardrail failures are surfaced in Observatory → Threats with a distinct badge. Each failure record includes the failing policy ID, the error code, and the evaluation point where it occurred.

Failures represent potential blind spots. A request that fails evaluation is not blocked — the default behavior on evaluation failure is to allow the request through. Monitor the failure rate for any policy in production and treat a spike as a configuration incident.


Proactive alerting

Shield alerts can be forwarded in real time to your existing incident response and monitoring tooling. See Integrations → Alerts for configuration:

  • Slack — route specific alert categories or severities to channels

  • Splunk HEC — stream events to your SIEM

  • Webhooks — deliver structured JSON payloads to any endpoint


Last updated