Chat Assistant

The chat_assistant profile is designed for customer-facing AI products — support bots, consumer chatbots, and any deployment where the user population is untrusted and content safety requirements are strict. It lowers detection thresholds below the defaults and adds bidirectional content controls.


Why public-facing deployments need stricter thresholds

The default policies are calibrated for internal or semi-trusted deployments where the baseline user is a known employee or developer. Public-facing chat applications face a different threat profile:

  • Higher jailbreak and injection attempt rates — consumer chatbots are frequent targets for prompt injection and jailbreak attempts from the general public

  • Content safety requirements — outputs that would be acceptable in an internal tool may be inappropriate or harmful in a consumer product

  • Bidirectional PII risk — users may inadvertently send personal information, and models may generate it in responses

The chat_assistant profile lowers injection and jailbreak thresholds and adds output-side controls that the defaults do not include.


Profile files


security.cedar — Tighter injection and jailbreak thresholds

Lowers the confidence threshold required to block prompt injection and jailbreak attempts. The defaults block at approximately 80 confidence; this profile blocks at 70 for injection and 65 for jailbreaks.

Rule
Threshold
Default threshold

Prompt injection block

injection_confidence >= 70

~80

Jailbreak block

jailbreak_confidence >= 65

~80

@id("chat-injection-lower-threshold")
@severity("high")
forbid(principal, action == Guardrails::Action::"process_prompt", resource)
when { context has injection_confidence && context.injection_confidence >= 70 };

@id("chat-jailbreak-lower-threshold")
@severity("high")
forbid(principal, action == Guardrails::Action::"process_prompt", resource)
when { context has jailbreak_confidence && context.jailbreak_confidence >= 65 };

These lower thresholds will catch more borderline cases at the cost of a higher false positive rate. Review Monitor-mode detections during rollout to confirm signal quality against your user population before switching to Block.


privacy.cedar — Bidirectional PII protection

Extends the default PII policy (inputs only) to also block PII in model outputs. This prevents the model from generating responses that contain personal information — even if the user's input did not contain PII that the model is echoing back.

The direction context field distinguishes input vs. output evaluations — this rule applies to both.


trust_safety.cedar — Toxicity and topic restrictions

Toxicity block:

Blocks violent, hateful, sexually explicit, and profane content at a toxicity score above 70. The default toxicity policy may be more permissive — this profile enforces a stricter threshold appropriate for consumer-facing deployments.

Topic restrictions:

Blocks responses where the topic classifier identifies content in restricted categories, when the classifier confidence is above 70:

Restricted topic
Examples

Weapons manufacturing

Instructions for producing firearms, explosives, chemical weapons

Illegal activity

Detailed guidance for committing specific crimes

Controlled substances

Manufacturing or procurement instructions

Financial fraud

Step-by-step guidance for fraud schemes

Topic restrictions apply to model outputs — the evaluation runs after the model generates its response. If a response is blocked, the user receives an error and the violation is recorded in Studio.


Applying the profile


Customizing topic restrictions

The built-in topic list covers the most common restricted categories. Add your own topics using a custom Cedar rule alongside the profile:


Rollout guidance

  1. Deploy in Monitor mode first. The lower injection and jailbreak thresholds will generate more events than the defaults. Run for 1–2 weeks to establish a detection baseline against your actual user traffic.

  2. Review false positives. In Observatory Threats, filter to mode = monitor and policy_category = security to see what the lower thresholds are catching. If legitimate queries are flagged, consider raising the thresholds slightly in a custom override.

  3. Enable PII output blocking early. The bidirectional PII rule has a low false positive rate on most chat applications and can be switched to Block without extended monitoring.

  4. Enable toxicity and topic restrictions in Block last. These depend on classifier confidence and may need tuning for your specific subject matter.


Last updated