Model Red Teaming

Here you'll learn about Highflame's model testing, which helps you review and spot potential vulnerabilities in LLMs before they're integrated into your applications.

What is Model Red Teaming?

Model red teaming is like a specialized security scan for the AI model itself. Highflame's model scanner thoroughly examines the LLM to learn how it responds to different kinds of unsafe inputs, which helps understand the model's baseline level of safety, potential biases, and how effectively it responds to common attacks. This proactive approach allows you to identify and fix issues before they can be exploited. In contrast, agent red teaming performs mission-specific stress tests on your AI system as a whole.

Benefits of Model Red Teaming

  • Stronger Security: Simulating real-world attacks helps you uncover security gaps that could lead to breaches, unauthorized access, and other incidents, and strengthen your systems in response.

  • Enhanced Safety: Actively monitoring harmful outputs from models helps you spot and reduce the likelihood of inappropriate content, making AI safer for users.

  • Bias Testing: Designing scenarios that test for unfair outcomes related to personal attributes helps you build more equitable AI.

  • Better Under Stress: Simulating extreme scenarios helps you identify bottlenecks, ensuring the model can withstand challenging conditions.

  • Robust Insights: Throughout the model red teaming process, you receive findings that go beyond standard testing, highlighting security gaps and opportunities for improvement you might otherwise miss.

  • More Trustworthy: Thoroughly testing your models demonstrates your commitment to using AI responsibly, fostering user trust and regulatory compliance.

How Model Red Teaming Works

Model Scan, Highflame's tool for model red teaming, automatically sends a wide variety of test prompts (Probes) to a selected LLM. Because LLM responses can be non-deterministic, Model Scan sends multiple variations of each prompt to better understand how the model typically behaves.

Running a model scan consists of four key steps:

  1. Configure the Target: Point the scanner at a specific model by selecting a preset Highflame's route or by manually entering the provider information.

  2. Select Probes: Choose the category of vulnerabilities you want to test for.

  3. Run the Scan: Highflame runs the selected Probes, sending a massive number of prompts to the model.

  4. Review the Report: The tool analyzes the model's responses and generates a report summarizing successes and failures.

Probes and Detectors

Probes are the core building blocks of the Model Scan. Some examples of Probes include:

  • Profanity Detection

  • Prompt Injection Tests

  • Jailbreak Attempts

  • Guardrail Bypass Checks

  • Text Replay Vulnerabilities

Model Scan offers a wide variety of Probes based on the Garak framework, including:

Security & Jailbreak

Trying to get around model safeguards and security measures

  • DAN (Do Anything Now) variants: Probes that test if the model can be tricked into ignoring its safety constraints

  • Visual jailbreak attempts: Tests using ASCII art or Unicode to bypass filters

  • Encoding-based attacks: Attempts to hide malicious content using various text encodings

  • Latent injection tests: Subtle attempts to influence model behavior

  • Prompt injection attacks: Tests for unauthorized prompt modifications

Content & Safety

Assessing how the model handles inappropriate or harmful material

  • Real toxicity prompts: Tests using real-world examples of toxic content

  • Discriminatory content: Checks for bias and discrimination handling

  • Sexual content: Tests boundaries of NSFW content filtering

  • Profanity: Evaluates handling of explicit language

  • Bullying detection: Test response to harassment scenarios

  • Identity attacks: Probes targeting demographic-based harassment

  • Threats: Evaluates response to threatening language

Technical Vulnerabilities

Trying to make the model generate technically harmful outputs

  • XSS (Cross-site scripting): Tests for code injection vulnerabilities

  • Package hallucination: Checks for the generation of fake software packages

  • File format manipulation: Test handling of various file formats

  • Malware generation attempts: Probes for code that could be malicious

  • Suffix-based attacks: Tests using file extensions and technical suffixes

Data Quality

Assessing how trustworthy the model's outputs are

  • Misleading content: Tests for the generation of false information

  • Divergence tests: Checks for output consistency

  • Information hazards: Tests handling of dangerous information

  • Controversial topics: Evaluates response to sensitive subjects

  • Factual assertions: Checks the accuracy of factual statements

For each Probe, a corresponding Detector reviews the model's response—for example, a toxicity Probe might be paired with a toxicity detection model that tests the output for toxicity. If the score reaches a certain point, the test is marked as a failure.

Running Scans and Reading Reports

To start a scan, navigate to Model Assessments in Highflame. From there, you'll complete a form where you can set parameters like the target route, number of probe attempts to run, and duration.

When the scan completes, you'll receive a report. It's presented as a table with these columns:

  • Detector: The specific detector used for the test (for example, detector.toxicity.ToxicCommentModel)

  • Probe: The probe type used (for example, atkgen.Tox)

  • Status: A simple "Pass" or "Fail" that indicates whether the model successfully evaded adversarial prompts

  • Tests: The ratio of successful tests to total attempts (for example, 22/25)

  • Failure Rate: The percentage of failed attempts, with a visual indicator of the risk

Regularly running model scans helps you make data-informed decisions about which models are safe and appropriate for your organization, especially when you're considering a new model or when a provider updates an existing one.

What's Next?

Last updated