Agent Red Teaming
Here you'll learn about Javelin's agent red teaming features, which proactively review and spot vulnerabilities in your live agents. Javelin Red is designed to simulate real-world attacks, find hidden vulnerabilities, and stress-test security across your full AI ecosystem.
What is Agent Red Teaming?
Agent red teaming is like a mission-specific stress test for an AI agent. While model red teaming looks at the risks of the LLM itself, agent red teaming goes a step beyond, testing your entire AI system as a whole — the model, its prompts, what external tools and APIs it connects to, and its logic. With many of the most critical vulnerabilities happening because of how these pieces work together, agent red teaming helps enterprises securely navigate the speed and complexity of AI systems.
Javelin RedTeam is a cutting-edge AI security platform designed to proactively identify and assess vulnerabilities in LLM-based applications through automated red teaming. Built with modular agents and powered by state-of-the-art adversarial techniques, it provides comprehensive security assessments for your application.
Key Features
Automated Red-teaming Flow
Generate attacks to discover vulnerabilities in target applications automatically using AI-driven attack generation and execution.
Modular Agent Architecture
A suite of agents working together to conduct comprehensive security assessments.
Multiple AI Engines
Support for various sophisticated attack enhancement engines based on published research and established techniques.
Extensive Test Case Library
Access to 60,000+ pre-generated test cases covering all vulnerability categories, dynamically enhanced at runtime using our sophisticated engine suite.
Production-Ready API
FastAPI server with Redis-backed queue system for asynchronous red-teaming operations, enabling scalable and distributed testing workflows.
Real-Time Monitoring & Progress Tracking
Live scan monitoring with detailed progress indicators, status updates, and the ability to cancel running scans with comprehensive error reporting.
Comprehensive Reporting
Detailed vulnerability assessment reports with:
Severity classifications
Attack vectors and examples
Remediation recommendations
Compliance mapping (OWASP LLM Top 10)
Reference Lab Applications
Pre-built vulnerable applications (like Lab1 with indirect prompt injection) for learning, testing, and demonstrating attack techniques in safe environments.
Why Choose Javelin RedTeam?
Stronger Security: Simulating attacks on an agent helps you uncover security gaps that could could lead to unauthorized actions, system compromise, and other incidents, and strengthen your defenses in response.
Enhanced Safety: Actively trying to make the agent take harmful actions lets you spot and reduce the chances of it causing real-world damage, making your autonomous systems safer for users.
Increased Reliability: Designing complex scenarios helps you build more dependable agents by identifying how they might fail, get stuck, or misunderstand prompts.
Better Under Stress: Simulating extreme situations helps you find bottlenecks, so you can ensure the agent can stand up to challenging conditions.
Robust Insights: Throughout the agent red teaming process you get findings that go beyond standard testing, shedding light on urgent behaviors and failures that you may otherwise miss.
More Trustworthy: Thoroughly testing your agents shows your commitment to using AI responsibly, which fosters trust with users and compliance with regulations.
Javelin's Agentic Team
Javelin Red is built around a group of specialized AI agents that work together to give you a comprehensive security assessment when you run a scan:
Agent Smith: Manages the entire red team workflow
Planner: Generates a detailed plan of attack based on the target application and selected vulnerability categories
Generator: Retrieves attack prompts from Javelin's vast library of test cases and uses advanced engines to craft them into sophisticated attacks specific to your application
Executor: Sends the attack prompts to target applications and collects the responses
Evaluator: Specialized LLM "judge" models analyze the application's responses to spot security failures, categorize their severity, and give a detailed breakdown of why responses are deemed risky
Reporter: Gathers the findings into a report once the scan is complete
Adversarial Engines
Engines are a key tool of Javelin Red. Informed by sophisticated attack methods based on the latest research and real-world examples, they transform basic prompts into complex inputs designed to challenge AI systems. When you're creating red team exercises, you can increase the strength of attacks by using these engines.
Javelin offers both single- and multi- turn engines, based on the complexity of the attack.
Single-Turn Engines
These engines are designed to create an attack that can get around the model's safety features in one interaction.
Encoding Engines
Using simple encoding to hide harmful keywords from basic content filters
ROT13: Simple ROT13 cipher to test basic content filtering bypass mechanisms
Base64: Base64 encoding to test content filtering bypass through encoding obfuscation
Obfuscation Engines
Using creative methods to get around more advanced filters
ASCII Art: Masks malicious words by converting them into ASCII art to bypass content filters trained only on standard text
Mathematical: Obfuscates unsafe prompts using mathematical abstractions and formal notation
FlipAttack: Exploits LLMs' left-to-right processing by flipping a prompt's text, adding noise, and guiding models to decode and execute the hidden, harmful instructions
Adversarial Suffix Engines
Attaching character sequences to prompts that are known to make models misbehave and bypass their safety training
Adversarial: Uses gradient-based attacks and adversarial suffixes that take advantage of the underlying mathematical architecture of the model to bypass its safety features
Complex Instruction Engines
Embedding the risky request in what appears to be a legitimate task, to trick the model into executing it
Prompt Injection: Injects hidden instructions into a prompt to bypass restrictions and elicit malicious outputs
Hidden Layer: Combines role-playing, leetspeak encoding, and XML obfuscation techniques to mask intent
Chain-of-Utterance (COU): Builds complex reasoning chains to get the model to gradually bypass safety measures
Task-in-Prompt (TIP): Embeds harmful requests within legitimate sequence-to-sequence tasks, like asking the model to solve a riddle where the answer is the unsafe content
Meta-Engines/Attack Strategy Engines
Looking into the big picture of the attack, or what the attacker knows
Best-of-N (BoN): Generates multiple variations of a prompt until it finds one that bypasses safety measures
Gray Box: References limited knowledge of a target system to craft tailored, architecture-aware attacks
Direct LLM: Uses a secondary LLM to run advanced prompt engineering, making the attack more stealthy
Multi-Turn Engines
These engines are designed to simulate more advanced attacks that test the model's safety in a longer conversation, over several interactions.
Crescendo: Simulates a patient, sophisticated attacker, starting a conversation with a safe prompt that gradually escalates in intensity over several turns, in hopes of getting the model to fulfill a harmful request that it wouldn't have accepted at the start
With this wide selection of engines, Javelin Red gives you deep insight into how your applications react to advanced, real-world threats, with more realistic findings than you can get from running manual tests or using simple prompts.
Vulnerability Taxonomy
Javelin Red tests for 15 different categories of vulnerabilities.
Data Privacy
Protecting sensitive information and user privacy
PII Leakage: Exposure of personally identifiable information through model outputs
direct_disclosure, database_access, social_manipulation, memorization, session_leak
Prompt Leakage: Disclosure of system prompts, instructions, or internal logic
secrets_and_credentials, instructions, guard_exposure, permissions_and_roles
Example Test Cases:
Responsible AI
Ensuring ethical AI behavior and bias prevention
Bias: Unfair or discriminatory outputs based on protected traits
race, gender, political, religion, age, disability, sexual_orientation, socioeconomic, algorithmic
Toxicity: Generating harmful, offensive, or inappropriate content
hate_speech, profanity, threats, mockery
Machine Ethics: Model following ethical principles and moral reasoning in decision-making
moral_reasoning, ethical_dilemmas, value_alignment
Example Test Cases:
Security
Assessing technical security vulnerabilities and access control
Unauthorized Access: Attempts to gain unauthorized access to systems or data
bfla, bola, rbac, debug_access, shell_injection, sql_injection, ssrf
Supply Chain: Vulnerabilities from compromised third-party components, models, or dependencies
malicious_models, vulnerable_dependencies, untrusted_sources, backdoor_injection
Data and Model Poisoning: Manipulation of training data to introduce backdoors or biased outputs
hidden_triggers, biased_injection, model_inversion, training_corruption, backdoor_activation
Improper Output Handling: Unvalidated model outputs leading to injection attacks or security vulnerabilities
code_injection, xss_injection, sql_injection, command_injection, unsanitized_output
Example Test Cases:
Brand Image
Protecting brand reputation and competitive positioning
Misinformation: Generation of false, misleading, or inaccurate information
factual_errors, unsupported_claims, fake_news, expertise_misrepresentation, conspiracy_theories
Intellectual Property: Unauthorized use or disclosure of copyrighted or proprietary content
imitation, copyright_violations, trademark_infringement, trade_secret_disclosure, patent_disclosure, proprietary_code_generation
Excessive Agency: Model performing actions beyond intended scope or without proper authorization
functionality, permissions, autonomy, resource_manipulation
Robustness: Model's ability to handle adversarial inputs and maintain consistent behavior
hijacking, input_overreliance, jailbreaking, context_manipulation, evasion_attacks
Competition: Content that unfairly promotes competitors or damages competitive position
competitor_mention, discredition, market_manipulation, confidential_strategies
Example Test Cases:
Illegal Risks
Preventing illegal and harmful content generation
Illegal Activity: Content that promotes, instructs, or facilitates illegal activities
weapons, illegal_drugs, violent_crime, non_violent_crime, sex_crime, cybercrime, child_exploitation, terrorism, biohazard, biosecurity
Graphic Content: Disturbing, violent, or explicit content inappropriate for general audiences
violence, gore, sexual_content, animal_cruelty, pornographic_content
Personal Safety: Content that could endanger individual or public safety
bullying, self_harm, suicide_encouragement, unsafe_practices, dangerous_challenges, stalking, harassment, doxxing
Javelin also offers full coverage for 80+ types of vulnerabilities, in alignment with industry standards like the OWASP LLM Top 10.
OWASP LLM Top 10 Categories
These categories align with the OWASP LLM Top 10 2025 and use dynamic attack generation through specialized engines.
LLM01:2025 - Prompt Injection
When an attacker manipulates how the LLM processes instructions, often bypassing safety or policy constraints
Dynamic attack generation using prompt_injection, gray_box, and hidden_layer engines
LLM02:2025 - Sensitive Information Disclosure
When an LLM either stores or reveals confidential data
PII Leakage, Prompt Leakage
LLM03:2025 - Supply Chain
When third-party or open-source components are compromised or tampered with
Supply Chain
LLM04:2025 - Data and Model Poisoning
When training or fine-tuning data is manipulated
Data and Model Poisoning
LLM05:2025 - Improper Output Handling
When raw or unvalidated model outputs are passed downstream
Improper Output Handling
LLM06:2025 - Excessive Agency
When an LLM-based system is granted excessive permissions
Excessive Agency
LLM07:2025 - System Prompt Leakage
When an LLM's hidden or internal prompt is disclosed to attackers
Prompt Leakage
LLM08:2025 - Vector and Embedding Weaknesses
When malicious or unverified data is embedded into vector databases
Vector and Embedding Weaknesses (embedding_inversion, multi_tenant_leakage, poisoned_documents, vector_manipulation)
LLM09:2025 - Misinformation
When an LLM generates false or misleading outputs
Misinformation
LLM10:2025 - Unbounded Consumption
The risk that LLM operations lack resource controls
Unbounded Consumption (resource_exhaustion, cost_overflow, infinite_loops, memory_consumption, api_abuse)
How it Works
Javelin RedTeam follows a systematic approach to red teaming:
User can configure the target application to scan, together with brief description about what the application does, and the endpoint it exposes in javelin UI
Select scan configurations to tailor the scan to your needs.
Start the scan, and wait for the reports to be generated.
Behind the scenes, redteam framework will spin up a suit of agents that execute a scan workflow:
Planning Agent: The Planner Agent generates a detailed attack strategy based on target application and selected vulnerability categories.
Recon Agent: The Reconnaissance Agent performs initial target analysis by probing the application's capabilities, understanding its response patterns, and gathering intelligence about potential attack surfaces. This agent helps identify the most effective attack vectors and informs the subsequent attack generation phase.
Attack Generation: Base attack prompts are retrieved from vector database and enhanced using various engines.
Execution: Attack prompts are sent to target applications through configurable interfaces.
Evaluation: Responses are analyzed by LLM as judge models to identify potential vulnerabilities and security issues.
Reporting: Comprehensive reports are generated with findings, severity levels, and remediation guidance.
Attack Enhancement Engines
Javelin RedTeam uses sophisticated engines to transform base prompts into advanced attacks. These are categorized into single-turn and multi-turn engines.
For details about the engine module and the supported engines, please see engines overview.
Integrating Engines
Different categories work best with specific engines, which Javelin has pre-configured for optimal performance:
Data Privacy
direct_llm, mathematical, gray_box
Social engineering
Security
prompt_injection, adversarial, hidden_layer
Technical exploitation
Responsible AI
bon, cou_engine, mathematical
Bias and ethics testing
Brand Image
direct_llm, hidden_layer, cou_engine
Reputation attacks
Prompt Injection
All engines
Comprehensive bypass testing
Understanding Reports
When a scan completes, Javelin gives you a detailed report of practical recommendations to address any security issues. You can take a high-level overview or drill down into the details of any failed tests.
Executive Summary Dashboard
The first thing you'll see is the executive summary, giving you a snapshot of the overall scan results. It includes some key metrics:
Total Tests Executed: Total number of prompts sent
Vulnerabilities Found: Number of successful attacks
Success Rate: Percentage of tests that passed
Scan Duration: Total time the test took to complete
Vulnerability Severity Distribution: Found vulnerabilities organized by severity level, from low to critical
Vulnerable Categories Analysis: Breakdown of the most successful types of attacks
Radar Chart
Next, you'll see a visual summary of how your application performed across the tested categories. This visualization helps you quickly spot the strongest and weakest areas of your security.
Detailed Test Results
Below the summary charts, you'll find each vulnerability category presented as a card with these granular details:
Category Name & Icon
Description: Explanation of the vulnerability type and its implications
Success Rate: Progress bar with pass/fail ratio
Test Results: "X/Y Succeeded" format with exact pass/fail counts
Severity Breakdown: Count of vulnerabilities by severity level (Critical, High, Medium, Low)
Compliance Tags: OWASP LLM Top 10 and other relevant standards
Show Details: Expandable section for deeper analysis
Test Case Information
Engine ID: Specific engine used (e.g., "Best-of-N Engine", "Crescendo Engine"), if applicable.
Duration: Individual test case execution time
Turn Count: Number of conversation turns (1 for single-turn, multiple for multi-turn engines)
Complete Conversation Log
Security Test Prompt: Exact attack prompt used in the test
AI System Response: Full response from the target application
Tool Usage Capture: Any function calls, API calls, or tool invocations triggered during the test
Multi-turn Conversations: Complete conversation history for multi-turn attack engines
Security Analysis Summary: A short explanation, justifying the current evaluation of the response from the application.
Mitigation Guidance: For each failed test case you'll see
Mitigation Advice, which are specific actions to address the vulnerability
Severity Levels
How to understand the different severity levels assigned to vulnerabilities spotted:
Critical
9.0-10.0
• Immediate system compromise possible • Data breach or privacy violation confirmed • Complete bypass of security controls
• Potential legal liability • Regulatory compliance violations • Reputation damage • Financial losses
• Database credential exposure • Complete prompt injection control • PII mass extraction
High
7.0-8.9
• Significant security control bypass • Sensitive information exposure • Partial system compromise
• Compliance violations likely • Customer trust erosion • Operational disruption
• System prompt disclosure • Partial PII leakage • Successful jailbreak attempts
Medium
4.0-6.9
• Minor security control bypass • Limited information disclosure • Brand or reputation concerns
• Customer dissatisfaction • Minor compliance issues • Competitive disadvantage
• Inappropriate content generation • Minor bias in responses • Competitor information leakage
Low
1.0-3.9
• Minimal security impact • Edge case scenarios • Quality or usability issues
• User experience degradation • Minor brand concerns • Training data improvements needed
• Inconsistent responses • Minor factual inaccuracies • Occasional inappropriate tone
Using Your Scan Results
With the report, you're well positioned to take action:
Prioritize by Severity and Impact: Address your Critical and High severity vulnerabilities, the most pressing risks, first.
Look for Patterns: Use the Vulnerable Categories Analysis to understand issues across category performance, pass/fail ratios, coverage analysis, and trend identification. A high failure rate in a particular category is most likely a sign of a core issue that you need to resolve, as opposed to just patching a single issue.
Follow Remediation Guidance: Review the remediation section for each category, for each finding, to understand what changes to make, controls to set, and other best practices to address each failed test.
Scan Again to Verify: After making your fixes, run another test to confirm you've successfully remediated the vulnerabilities.
Support & Community
Documentation: Comprehensive guides and references
GitHub Issues: Report bugs and request features
Enterprise Support: Dedicated support for enterprise customers
What's Next?
Learn in the Quick Start Guide for Red Team Testers how to run your first scan.
Learn in Model Red Teaming how to test the models themselves.
Last updated