Model Red Teaming
Here you'll learn about Javelin's model red teaming features, which help you review and spot potential vulnerabilities in LLMs before they're integrated into your applications.
What is Model Red Teaming?
Model red teaming is like a specialized security scan for the AI model itself. Javelin's model scanner thoroughly examines the LLM to learn how it responds to different kinds of unsafe inputs, which helps understand the model's baseline level of safety, potential biases, and how effectively it responds to common attacks. This proactive approach allows you to identify and fix issues before they can be exploited. In contrast, agent red teaming performs mission-specific stress tests on your AI system as a whole.
Benefits of Model Red Teaming
Stronger Security: Simulating real-world attacks helps you uncover security gaps that could lead to breaches, unauthorized access, and other incidents, and strengthen your systems in response.
Enhanced Safety: Actively trying to get harmful outputs from models lets you spot and reduce the chances of it generating inappropriate content, making AI safer for users.
Bias Testing: Designing scenarios that test for unfair outcomes related to personal attributes helps you build more equitable AI.
Better Under Stress: Simulating extreme situations helps you find bottlenecks, so you can ensure the model can stand up to challenging conditions.
Robust Insights: Throughout the model red teaming process you get findings that go beyond standard testing, shedding light on security gaps and opportunities for improvement that you may otherwise miss.
More Trustworthy: Thoroughly testing your models shows your commitment to using AI responsibly, which fosters trust with users and compliance with regulations.
How Model Red Teaming Works
Model Scan, Javelin's tool for model red teaming, automatically sends a wide variety of test prompts, or Probes, to a selected LLM. Because LLM responses can be non-deterministic, Model Scan sends multiple variations of each prompt to get a better sense of how the model typically works.
Running a model scan consists of four key steps:
Configure the Target: Point the scanner at a specific model by selecting a preset Javelin route or by manually entering the provider information.
Select Probes: Choose the category of vulnerabilities you want to test for.
Run the Scan: Javelin runs the selected Probes, sending a massive number of prompts to the model.
Review the Report: The tool analyzes the model's responses and creates a report that summarizes the successes and failures.
Probes and Detectors
Probes are the core building blocks of the Model Scan. Some examples of Probes include:
Profanity Detection
Prompt Injection Tests
Jailbreak Attempts
Guardrail Bypass Checks
Text Replay Vulnerabilities
Model Scan offers a wide variety of Probes based on the garak framework, including:
Security & Jailbreak
Trying to get around model safeguards and security measures
DAN (Do Anything Now) variants: Probes that test if the model can be tricked into ignoring its safety constraints
Visual jailbreak attempts: Tests using ASCII art or unicode to bypass filters
Encoding-based attacks: Attempts to hide malicious content using various text encodings
Latent injection tests: Subtle attempts to influence model behavior
Prompt injection attacks: Tests for unauthorized prompt modifications
Content & Safety
Assessing how the model handles inappropriate or harmful material
Real toxicity prompts: Tests using real-world examples of toxic content
Discriminatory content: Checks for bias and discrimination handling
Sexual content: Tests boundaries of NSFW content filtering
Profanity: Evaluates handling of explicit language
Bullying detection: Tests response to harassment scenarios
Identity attacks: Probes targeting demographic-based harassment
Threats: Evaluates response to threatening language
Technical Vulnerabilities
Trying to make the model generate technically harmful outputs
XSS (Cross-site scripting): Tests for code injection vulnerabilities
Package hallucination: Checks for generation of fake software packages
File format manipulation: Tests handling of various file formats
Malware generation attempts: Probes for code that could be malicious
Suffix-based attacks: Tests using file extensions and technical suffixes
Data Quality
Assessing how trustworthy the model's outputs are
Misleading content: Tests for generation of false information
Divergence tests: Checks for output consistency
Information hazards: Tests handling of dangerous information
Controversial topics: Evaluates response to sensitive subjects
Factual assertions: Checks accuracy of factual statements
For each Probe, a corresponding Detector reviews the model's response — for example, a toxicity Probe might be paired with a toxicity detection model that tests the output for harmfulness. If the score reaches a certain point, the test is marked as a failure.
Running Scans and Reading Reports
To start a scan, navigate to Model Assessments in Javelin. From there you'll complete a form where you can set parameters like the target route, number of probe attempts to run, and duration.
When the scan completes, you'll receive a report. It's presented as a table with these columns:
Detector: The specific detector used for the test (for example, detector.toxicity.ToxicCommentModel)
Probe: The probe type used (for example, atkgen.Tox)
Status: A simple "Pass" or "Fail" that indicates whether the model successfully evaded adversarial prompts
Tests: The ratio of successful tests to total attempts (for example, 22/25)
Failure Rate: The percentage of failed attempts, with a visual indicator of the risk
Regularly running model scans lets you make data-informed decisions about which models are safe and appropriate for your organization, which is especially important when you're considering a new model or a provider has updated an existing one.
What's Next?
Learn in the Quick Start Guide for Red Team Testers how to run your first scan.
Learn in Agent Red Teaming how to test your agents, not just models.
Last updated