Model Red Teaming

Here you'll learn about Javelin's model red teaming features, which help you review and spot potential vulnerabilities in LLMs before they're integrated into your applications.

What is Model Red Teaming?

Model red teaming is like a specialized security scan for the AI model itself. Javelin's model scanner thoroughly examines the LLM to learn how it responds to different kinds of unsafe inputs, which helps understand the model's baseline level of safety, potential biases, and how effectively it responds to common attacks. This proactive approach allows you to identify and fix issues before they can be exploited. In contrast, agent red teaming performs mission-specific stress tests on your AI system as a whole.

Benefits of Model Red Teaming

  • Stronger Security: Simulating real-world attacks helps you uncover security gaps that could lead to breaches, unauthorized access, and other incidents, and strengthen your systems in response.

  • Enhanced Safety: Actively trying to get harmful outputs from models lets you spot and reduce the chances of it generating inappropriate content, making AI safer for users.

  • Bias Testing: Designing scenarios that test for unfair outcomes related to personal attributes helps you build more equitable AI.

  • Better Under Stress: Simulating extreme situations helps you find bottlenecks, so you can ensure the model can stand up to challenging conditions.

  • Robust Insights: Throughout the model red teaming process you get findings that go beyond standard testing, shedding light on security gaps and opportunities for improvement that you may otherwise miss.

  • More Trustworthy: Thoroughly testing your models shows your commitment to using AI responsibly, which fosters trust with users and compliance with regulations.

How Model Red Teaming Works

Model Scan, Javelin's tool for model red teaming, automatically sends a wide variety of test prompts, or Probes, to a selected LLM. Because LLM responses can be non-deterministic, Model Scan sends multiple variations of each prompt to get a better sense of how the model typically works.

Running a model scan consists of four key steps:

  1. Configure the Target: Point the scanner at a specific model by selecting a preset Javelin route or by manually entering the provider information.

  2. Select Probes: Choose the category of vulnerabilities you want to test for.

  3. Run the Scan: Javelin runs the selected Probes, sending a massive number of prompts to the model.

  4. Review the Report: The tool analyzes the model's responses and creates a report that summarizes the successes and failures.

Probes and Detectors

Probes are the core building blocks of the Model Scan. Some examples of Probes include:

  • Profanity Detection

  • Prompt Injection Tests

  • Jailbreak Attempts

  • Guardrail Bypass Checks

  • Text Replay Vulnerabilities

Model Scan offers a wide variety of Probes based on the garak framework, including:

Security & Jailbreak

Trying to get around model safeguards and security measures

  • DAN (Do Anything Now) variants: Probes that test if the model can be tricked into ignoring its safety constraints

  • Visual jailbreak attempts: Tests using ASCII art or unicode to bypass filters

  • Encoding-based attacks: Attempts to hide malicious content using various text encodings

  • Latent injection tests: Subtle attempts to influence model behavior

  • Prompt injection attacks: Tests for unauthorized prompt modifications

Content & Safety

Assessing how the model handles inappropriate or harmful material

  • Real toxicity prompts: Tests using real-world examples of toxic content

  • Discriminatory content: Checks for bias and discrimination handling

  • Sexual content: Tests boundaries of NSFW content filtering

  • Profanity: Evaluates handling of explicit language

  • Bullying detection: Tests response to harassment scenarios

  • Identity attacks: Probes targeting demographic-based harassment

  • Threats: Evaluates response to threatening language

Technical Vulnerabilities

Trying to make the model generate technically harmful outputs

  • XSS (Cross-site scripting): Tests for code injection vulnerabilities

  • Package hallucination: Checks for generation of fake software packages

  • File format manipulation: Tests handling of various file formats

  • Malware generation attempts: Probes for code that could be malicious

  • Suffix-based attacks: Tests using file extensions and technical suffixes

Data Quality

Assessing how trustworthy the model's outputs are

  • Misleading content: Tests for generation of false information

  • Divergence tests: Checks for output consistency

  • Information hazards: Tests handling of dangerous information

  • Controversial topics: Evaluates response to sensitive subjects

  • Factual assertions: Checks accuracy of factual statements

For each Probe, a corresponding Detector reviews the model's response — for example, a toxicity Probe might be paired with a toxicity detection model that tests the output for harmfulness. If the score reaches a certain point, the test is marked as a failure.

Running Scans and Reading Reports

To start a scan, navigate to Model Assessments in Javelin. From there you'll complete a form where you can set parameters like the target route, number of probe attempts to run, and duration.

When the scan completes, you'll receive a report. It's presented as a table with these columns:

  • Detector: The specific detector used for the test (for example, detector.toxicity.ToxicCommentModel)

  • Probe: The probe type used (for example, atkgen.Tox)

  • Status: A simple "Pass" or "Fail" that indicates whether the model successfully evaded adversarial prompts

  • Tests: The ratio of successful tests to total attempts (for example, 22/25)

  • Failure Rate: The percentage of failed attempts, with a visual indicator of the risk

Regularly running model scans lets you make data-informed decisions about which models are safe and appropriate for your organization, which is especially important when you're considering a new model or a provider has updated an existing one.

What's Next?

Last updated