Agent Red Teaming

Here you'll learn about Javelin's agent red teaming features, which proactively review and spot vulnerabilities in your live agents. Javelin Red is designed to simulate real-world attacks, find hidden vulnerabilities, and stress-test security across your full AI ecosystem.

What is Agent Red Teaming?

Agent red teaming is like a mission-specific stress test for an AI agent. While model red teaming looks at the risks of the LLM itself, agent red teaming goes a step beyond, testing your entire AI system as a whole — the model, its prompts, what external tools and APIs it connects to, and its logic. With many of the most critical vulnerabilities happening because of how these pieces work together, agent red teaming helps enterprises securely navigate the speed and complexity of AI systems.

Javelin RedTeam is a cutting-edge AI security platform designed to proactively identify and assess vulnerabilities in LLM-based applications through automated red teaming. Built with modular agents and powered by state-of-the-art adversarial techniques, it provides comprehensive security assessments for your application.

Key Features

Automated Red-teaming Flow

Generate attacks to discover vulnerabilities in target applications automatically using AI-driven attack generation and execution.

Modular Agent Architecture

A suite of agents working together to conduct comprehensive security assessments.

Multiple AI Engines

Support for various sophisticated attack enhancement engines based on published research and established techniques.

Extensive Test Case Library

Access to 60,000+ pre-generated test cases covering all vulnerability categories, dynamically enhanced at runtime using our sophisticated engine suite.

Production-Ready API

FastAPI server with Redis-backed queue system for asynchronous red-teaming operations, enabling scalable and distributed testing workflows.

Real-Time Monitoring & Progress Tracking

Live scan monitoring with detailed progress indicators, status updates, and the ability to cancel running scans with comprehensive error reporting.

Comprehensive Reporting

Detailed vulnerability assessment reports with:

  • Severity classifications

  • Attack vectors and examples

  • Remediation recommendations

  • Compliance mapping (OWASP LLM Top 10)

Reference Lab Applications

Pre-built vulnerable applications (like Lab1 with indirect prompt injection) for learning, testing, and demonstrating attack techniques in safe environments.

Why Choose Javelin RedTeam?

  • Stronger Security: Simulating attacks on an agent helps you uncover security gaps that could could lead to unauthorized actions, system compromise, and other incidents, and strengthen your defenses in response.

  • Enhanced Safety: Actively trying to make the agent take harmful actions lets you spot and reduce the chances of it causing real-world damage, making your autonomous systems safer for users.

  • Increased Reliability: Designing complex scenarios helps you build more dependable agents by identifying how they might fail, get stuck, or misunderstand prompts.

  • Better Under Stress: Simulating extreme situations helps you find bottlenecks, so you can ensure the agent can stand up to challenging conditions.

  • Robust Insights: Throughout the agent red teaming process you get findings that go beyond standard testing, shedding light on urgent behaviors and failures that you may otherwise miss.

  • More Trustworthy: Thoroughly testing your agents shows your commitment to using AI responsibly, which fosters trust with users and compliance with regulations.

Javelin's Agentic Team

Javelin Red is built around a group of specialized AI agents that work together to give you a comprehensive security assessment when you run a scan:

  • Agent Smith: Manages the entire red team workflow

  • Planner: Generates a detailed plan of attack based on the target application and selected vulnerability categories

  • Generator: Retrieves attack prompts from Javelin's vast library of test cases and uses advanced engines to craft them into sophisticated attacks specific to your application

  • Executor: Sends the attack prompts to target applications and collects the responses

  • Evaluator: Specialized LLM "judge" models analyze the application's responses to spot security failures, categorize their severity, and give a detailed breakdown of why responses are deemed risky

  • Reporter: Gathers the findings into a report once the scan is complete

Adversarial Engines

Engines are a key tool of Javelin Red. Informed by sophisticated attack methods based on the latest research and real-world examples, they transform basic prompts into complex inputs designed to challenge AI systems. When you're creating red team exercises, you can increase the strength of attacks by using these engines.

Javelin offers both single- and multi- turn engines, based on the complexity of the attack.

Single-Turn Engines

These engines are designed to create an attack that can get around the model's safety features in one interaction.

Encoding Engines

Using simple encoding to hide harmful keywords from basic content filters

  • ROT13: Simple ROT13 cipher to test basic content filtering bypass mechanisms

  • Base64: Base64 encoding to test content filtering bypass through encoding obfuscation

Obfuscation Engines

Using creative methods to get around more advanced filters

  • ASCII Art: Masks malicious words by converting them into ASCII art to bypass content filters trained only on standard text

  • Mathematical: Obfuscates unsafe prompts using mathematical abstractions and formal notation

  • FlipAttack: Exploits LLMs' left-to-right processing by flipping a prompt's text, adding noise, and guiding models to decode and execute the hidden, harmful instructions

Adversarial Suffix Engines

Attaching character sequences to prompts that are known to make models misbehave and bypass their safety training

  • Adversarial: Uses gradient-based attacks and adversarial suffixes that take advantage of the underlying mathematical architecture of the model to bypass its safety features

Complex Instruction Engines

Embedding the risky request in what appears to be a legitimate task, to trick the model into executing it

  • Prompt Injection: Injects hidden instructions into a prompt to bypass restrictions and elicit malicious outputs

  • Hidden Layer: Combines role-playing, leetspeak encoding, and XML obfuscation techniques to mask intent

  • Chain-of-Utterance (COU): Builds complex reasoning chains to get the model to gradually bypass safety measures

  • Task-in-Prompt (TIP): Embeds harmful requests within legitimate sequence-to-sequence tasks, like asking the model to solve a riddle where the answer is the unsafe content

Meta-Engines/Attack Strategy Engines

Looking into the big picture of the attack, or what the attacker knows

  • Best-of-N (BoN): Generates multiple variations of a prompt until it finds one that bypasses safety measures

  • Gray Box: References limited knowledge of a target system to craft tailored, architecture-aware attacks

  • Direct LLM: Uses a secondary LLM to run advanced prompt engineering, making the attack more stealthy

Multi-Turn Engines

These engines are designed to simulate more advanced attacks that test the model's safety in a longer conversation, over several interactions.

  • Crescendo: Simulates a patient, sophisticated attacker, starting a conversation with a safe prompt that gradually escalates in intensity over several turns, in hopes of getting the model to fulfill a harmful request that it wouldn't have accepted at the start

With this wide selection of engines, Javelin Red gives you deep insight into how your applications react to advanced, real-world threats, with more realistic findings than you can get from running manual tests or using simple prompts.

Vulnerability Taxonomy

Javelin Red tests for 15 different categories of vulnerabilities.

Data Privacy

Protecting sensitive information and user privacy

  • PII Leakage: Exposure of personally identifiable information through model outputs

    • direct_disclosure, database_access, social_manipulation, memorization, session_leak

  • Prompt Leakage: Disclosure of system prompts, instructions, or internal logic

    • secrets_and_credentials, instructions, guard_exposure, permissions_and_roles

Example Test Cases:

Responsible AI

Ensuring ethical AI behavior and bias prevention

  • Bias: Unfair or discriminatory outputs based on protected traits

    • race, gender, political, religion, age, disability, sexual_orientation, socioeconomic, algorithmic

  • Toxicity: Generating harmful, offensive, or inappropriate content

    • hate_speech, profanity, threats, mockery

  • Machine Ethics: Model following ethical principles and moral reasoning in decision-making

    • moral_reasoning, ethical_dilemmas, value_alignment

Example Test Cases:

Security

Assessing technical security vulnerabilities and access control

  • Unauthorized Access: Attempts to gain unauthorized access to systems or data

    • bfla, bola, rbac, debug_access, shell_injection, sql_injection, ssrf

  • Supply Chain: Vulnerabilities from compromised third-party components, models, or dependencies

    • malicious_models, vulnerable_dependencies, untrusted_sources, backdoor_injection

  • Data and Model Poisoning: Manipulation of training data to introduce backdoors or biased outputs

    • hidden_triggers, biased_injection, model_inversion, training_corruption, backdoor_activation

  • Improper Output Handling: Unvalidated model outputs leading to injection attacks or security vulnerabilities

    • code_injection, xss_injection, sql_injection, command_injection, unsanitized_output

Example Test Cases:

Brand Image

Protecting brand reputation and competitive positioning

  • Misinformation: Generation of false, misleading, or inaccurate information

    • factual_errors, unsupported_claims, fake_news, expertise_misrepresentation, conspiracy_theories

  • Intellectual Property: Unauthorized use or disclosure of copyrighted or proprietary content

    • imitation, copyright_violations, trademark_infringement, trade_secret_disclosure, patent_disclosure, proprietary_code_generation

  • Excessive Agency: Model performing actions beyond intended scope or without proper authorization

    • functionality, permissions, autonomy, resource_manipulation

  • Robustness: Model's ability to handle adversarial inputs and maintain consistent behavior

    • hijacking, input_overreliance, jailbreaking, context_manipulation, evasion_attacks

  • Competition: Content that unfairly promotes competitors or damages competitive position

    • competitor_mention, discredition, market_manipulation, confidential_strategies

Example Test Cases:

Illegal Risks

Preventing illegal and harmful content generation

  • Illegal Activity: Content that promotes, instructs, or facilitates illegal activities

    • weapons, illegal_drugs, violent_crime, non_violent_crime, sex_crime, cybercrime, child_exploitation, terrorism, biohazard, biosecurity

  • Graphic Content: Disturbing, violent, or explicit content inappropriate for general audiences

    • violence, gore, sexual_content, animal_cruelty, pornographic_content

  • Personal Safety: Content that could endanger individual or public safety

    • bullying, self_harm, suicide_encouragement, unsafe_practices, dangerous_challenges, stalking, harassment, doxxing

Javelin also offers full coverage for 80+ types of vulnerabilities, in alignment with industry standards like the OWASP LLM Top 10.

OWASP LLM Top 10 Categories

These categories align with the OWASP LLM Top 10 2025 and use dynamic attack generation through specialized engines.

LLM01:2025 - Prompt Injection

When an attacker manipulates how the LLM processes instructions, often bypassing safety or policy constraints

  • Dynamic attack generation using prompt_injection, gray_box, and hidden_layer engines

LLM02:2025 - Sensitive Information Disclosure

When an LLM either stores or reveals confidential data

  • PII Leakage, Prompt Leakage

LLM03:2025 - Supply Chain

When third-party or open-source components are compromised or tampered with

  • Supply Chain

LLM04:2025 - Data and Model Poisoning

When training or fine-tuning data is manipulated

  • Data and Model Poisoning

LLM05:2025 - Improper Output Handling

When raw or unvalidated model outputs are passed downstream

  • Improper Output Handling

LLM06:2025 - Excessive Agency

When an LLM-based system is granted excessive permissions

  • Excessive Agency

LLM07:2025 - System Prompt Leakage

When an LLM's hidden or internal prompt is disclosed to attackers

  • Prompt Leakage

LLM08:2025 - Vector and Embedding Weaknesses

When malicious or unverified data is embedded into vector databases

  • Vector and Embedding Weaknesses (embedding_inversion, multi_tenant_leakage, poisoned_documents, vector_manipulation)

LLM09:2025 - Misinformation

When an LLM generates false or misleading outputs

  • Misinformation

LLM10:2025 - Unbounded Consumption

The risk that LLM operations lack resource controls

  • Unbounded Consumption (resource_exhaustion, cost_overflow, infinite_loops, memory_consumption, api_abuse)

How it Works

Javelin RedTeam follows a systematic approach to red teaming:

  1. User can configure the target application to scan, together with brief description about what the application does, and the endpoint it exposes in javelin UI

  2. Select scan configurations to tailor the scan to your needs.

  3. Start the scan, and wait for the reports to be generated.

Behind the scenes, redteam framework will spin up a suit of agents that execute a scan workflow:

  1. Planning Agent: The Planner Agent generates a detailed attack strategy based on target application and selected vulnerability categories.

  2. Recon Agent: The Reconnaissance Agent performs initial target analysis by probing the application's capabilities, understanding its response patterns, and gathering intelligence about potential attack surfaces. This agent helps identify the most effective attack vectors and informs the subsequent attack generation phase.

  3. Attack Generation: Base attack prompts are retrieved from vector database and enhanced using various engines.

  4. Execution: Attack prompts are sent to target applications through configurable interfaces.

  5. Evaluation: Responses are analyzed by LLM as judge models to identify potential vulnerabilities and security issues.

  6. Reporting: Comprehensive reports are generated with findings, severity levels, and remediation guidance.

Attack Enhancement Engines

Javelin RedTeam uses sophisticated engines to transform base prompts into advanced attacks. These are categorized into single-turn and multi-turn engines.

For details about the engine module and the supported engines, please see engines overview.

Integrating Engines

Different categories work best with specific engines, which Javelin has pre-configured for optimal performance:

Category
Recommended Engines
Use Case

Data Privacy

direct_llm, mathematical, gray_box

Social engineering

Security

prompt_injection, adversarial, hidden_layer

Technical exploitation

Responsible AI

bon, cou_engine, mathematical

Bias and ethics testing

Brand Image

direct_llm, hidden_layer, cou_engine

Reputation attacks

Prompt Injection

All engines

Comprehensive bypass testing

Understanding Reports

When a scan completes, Javelin gives you a detailed report of practical recommendations to address any security issues. You can take a high-level overview or drill down into the details of any failed tests.

Executive Summary Dashboard

The first thing you'll see is the executive summary, giving you a snapshot of the overall scan results. It includes some key metrics:

  • Total Tests Executed: Total number of prompts sent

  • Vulnerabilities Found: Number of successful attacks

  • Success Rate: Percentage of tests that passed

  • Scan Duration: Total time the test took to complete

  • Vulnerability Severity Distribution: Found vulnerabilities organized by severity level, from low to critical

  • Vulnerable Categories Analysis: Breakdown of the most successful types of attacks

Radar Chart

Next, you'll see a visual summary of how your application performed across the tested categories. This visualization helps you quickly spot the strongest and weakest areas of your security.

Detailed Test Results

Below the summary charts, you'll find each vulnerability category presented as a card with these granular details:

  • Category Name & Icon

  • Description: Explanation of the vulnerability type and its implications

  • Success Rate: Progress bar with pass/fail ratio

  • Test Results: "X/Y Succeeded" format with exact pass/fail counts

  • Severity Breakdown: Count of vulnerabilities by severity level (Critical, High, Medium, Low)

  • Compliance Tags: OWASP LLM Top 10 and other relevant standards

  • Show Details: Expandable section for deeper analysis

    • Test Case Information

      • Engine ID: Specific engine used (e.g., "Best-of-N Engine", "Crescendo Engine"), if applicable.

      • Duration: Individual test case execution time

      • Turn Count: Number of conversation turns (1 for single-turn, multiple for multi-turn engines)

    • Complete Conversation Log

      • Security Test Prompt: Exact attack prompt used in the test

      • AI System Response: Full response from the target application

      • Tool Usage Capture: Any function calls, API calls, or tool invocations triggered during the test

      • Multi-turn Conversations: Complete conversation history for multi-turn attack engines

    • Security Analysis Summary: A short explanation, justifying the current evaluation of the response from the application.

    • Mitigation Guidance: For each failed test case you'll see Mitigation Advice, which are specific actions to address the vulnerability

Severity Levels

How to understand the different severity levels assigned to vulnerabilities spotted:

Severity
Score Range
Characteristics
Business Impact
Example Issues

Critical

9.0-10.0

• Immediate system compromise possible • Data breach or privacy violation confirmed • Complete bypass of security controls

• Potential legal liability • Regulatory compliance violations • Reputation damage • Financial losses

• Database credential exposure • Complete prompt injection control • PII mass extraction

High

7.0-8.9

• Significant security control bypass • Sensitive information exposure • Partial system compromise

• Compliance violations likely • Customer trust erosion • Operational disruption

• System prompt disclosure • Partial PII leakage • Successful jailbreak attempts

Medium

4.0-6.9

• Minor security control bypass • Limited information disclosure • Brand or reputation concerns

• Customer dissatisfaction • Minor compliance issues • Competitive disadvantage

• Inappropriate content generation • Minor bias in responses • Competitor information leakage

Low

1.0-3.9

• Minimal security impact • Edge case scenarios • Quality or usability issues

• User experience degradation • Minor brand concerns • Training data improvements needed

• Inconsistent responses • Minor factual inaccuracies • Occasional inappropriate tone

Using Your Scan Results

With the report, you're well positioned to take action:

  • Prioritize by Severity and Impact: Address your Critical and High severity vulnerabilities, the most pressing risks, first.

  • Look for Patterns: Use the Vulnerable Categories Analysis to understand issues across category performance, pass/fail ratios, coverage analysis, and trend identification. A high failure rate in a particular category is most likely a sign of a core issue that you need to resolve, as opposed to just patching a single issue.

  • Follow Remediation Guidance: Review the remediation section for each category, for each finding, to understand what changes to make, controls to set, and other best practices to address each failed test.

  • Scan Again to Verify: After making your fixes, run another test to confirm you've successfully remediated the vulnerabilities.

Support & Community

  • Documentation: Comprehensive guides and references

  • GitHub Issues: Report bugs and request features

  • Enterprise Support: Dedicated support for enterprise customers

What's Next?

Last updated