Model Red Teaming

Here you'll learn about Highflame's model testing, which helps you review and spot potential vulnerabilities in LLMs before they're integrated into your applications.

What is Model Red Teaming?

Model red teaming is like a specialized security scan for the AI model itself. Highflame's model scanner thoroughly examines the LLM to learn how it responds to different kinds of unsafe inputs, which helps understand the model's baseline level of safety, potential biases, and how effectively it responds to common attacks. This proactive approach allows you to identify and fix issues before they can be exploited. In contrast, agent red teaming performs mission-specific stress tests on your AI system as a whole.

Benefits of Model Red Teaming

Stronger Security: Simulating real-world attacks helps you uncover security gaps that could lead to breaches, unauthorized access, and other incidents, and strengthen your systems in response.
Enhanced Safety: Actively monitoring harmful outputs from models helps you spot and reduce the likelihood of inappropriate content, making AI safer for users.
Bias Testing: Designing scenarios that test for unfair outcomes related to personal attributes helps you build more equitable AI.
Better Under Stress: Simulating extreme scenarios helps you identify bottlenecks, ensuring the model can withstand challenging conditions.
Robust Insights: Throughout the model red teaming process, you receive findings that go beyond standard testing, highlighting security gaps and opportunities for improvement you might otherwise miss.
More Trustworthy: Thoroughly testing your models demonstrates your commitment to using AI responsibly, fostering user trust and regulatory compliance.

How Model Red Teaming Works

Model Scan, Highflame's tool for model red teaming, automatically sends a wide variety of test prompts (Probes) to a selected LLM. Because LLM responses can be non-deterministic, Model Scan sends multiple variations of each prompt to better understand how the model typically behaves.

Running a model scan consists of four key steps:

Configure the Target: Point the scanner at a specific model by selecting a preset Highflame's route or by manually entering the provider information.
Select Probes: Choose the category of vulnerabilities you want to test for.
Run the Scan: Highflame runs the selected Probes, sending a massive number of prompts to the model.
Review the Report: The tool analyzes the model's responses and generates a report summarizing successes and failures.

Probes and Detectors

Probes are the core building blocks of the Model Scan. Some examples of Probes include:

Profanity Detection
Prompt Injection Tests
Jailbreak Attempts
Guardrail Bypass Checks
Text Replay Vulnerabilities

Model Scan offers a wide variety of Probes based on the Garak framework, including:

Security & Jailbreak

Trying to get around model safeguards and security measures

DAN (Do Anything Now) variants: Probes that test if the model can be tricked into ignoring its safety constraints
Visual jailbreak attempts: Tests using ASCII art or Unicode to bypass filters
Encoding-based attacks: Attempts to hide malicious content using various text encodings
Latent injection tests: Subtle attempts to influence model behavior
Prompt injection attacks: Tests for unauthorized prompt modifications

Content & Safety

Assessing how the model handles inappropriate or harmful material

Real toxicity prompts: Tests using real-world examples of toxic content
Discriminatory content: Checks for bias and discrimination handling
Sexual content: Tests boundaries of NSFW content filtering
Profanity: Evaluates handling of explicit language
Bullying detection: Test response to harassment scenarios
Identity attacks: Probes targeting demographic-based harassment
Threats: Evaluates response to threatening language

Technical Vulnerabilities

Trying to make the model generate technically harmful outputs

XSS (Cross-site scripting): Tests for code injection vulnerabilities
Package hallucination: Checks for the generation of fake software packages
File format manipulation: Test handling of various file formats
Malware generation attempts: Probes for code that could be malicious
Suffix-based attacks: Tests using file extensions and technical suffixes

Data Quality

Assessing how trustworthy the model's outputs are

Misleading content: Tests for the generation of false information
Divergence tests: Checks for output consistency
Information hazards: Tests handling of dangerous information
Controversial topics: Evaluates response to sensitive subjects
Factual assertions: Checks the accuracy of factual statements

For each Probe, a corresponding Detector reviews the model's response—for example, a toxicity Probe might be paired with a toxicity detection model that tests the output for toxicity. If the score reaches a certain point, the test is marked as a failure.

Running Scans and Reading Reports

To start a scan, navigate to Model Assessments in Highflame. From there, you'll complete a form where you can set parameters like the target route, number of probe attempts to run, and duration.

When the scan completes, you'll receive a report. It's presented as a table with these columns:

Detector: The specific detector used for the test (for example, detector.toxicity.ToxicCommentModel)
Probe: The probe type used (for example, atkgen.Tox)
Status: A simple "Pass" or "Fail" that indicates whether the model successfully evaded adversarial prompts
Tests: The ratio of successful tests to total attempts (for example, 22/25)
Failure Rate: The percentage of failed attempts, with a visual indicator of the risk

Regularly running model scans helps you make data-informed decisions about which models are safe and appropriate for your organization, especially when you're considering a new model or when a provider updates an existing one.

What's Next?

Learn in the Quick Start Guide for Red Team Testers how to run your first scan.
Learn in Agent Red Teaming how to test your agents, not just models.

PreviousTesting Categories NextModel Supply Chain Scan

Last updated 19 days ago

Good morning

hashtagWhat is Model Red Teaming?

hashtagBenefits of Model Red Teaming

hashtagHow Model Red Teaming Works

hashtagProbes and Detectors

hashtagSecurity & Jailbreak

hashtagContent & Safety

hashtagTechnical Vulnerabilities

hashtagData Quality

hashtagRunning Scans and Reading Reports

hashtagWhat's Next?