Abuse Detection and Control
Agentic abuse is distinct from prompt injection or data exfiltration — it targets the agent's execution lifecycle rather than its content. A looping agent, a session consuming unbounded tokens, or a tool call sequence quietly building toward exfiltration can each cause significant damage without triggering a single content-based detector.
Highflame detects and enforces against these patterns through a combination of session-aware signal tracking, behavioral pattern matching, and Cedar policies that evaluate context accumulated across multiple turns — not just the current request.
Abuse scenarios and controls
Agent loops
What it is: An agent repeatedly invokes the same tool without making progress — either stuck in an error-recovery cycle, manipulated by a prompt to repeat an action, or misconfigured such that the model always produces the same tool call.
Detectors:
loop_detected(boolean) — fires when the same tool is invoked consecutively above a count thresholdloop_count(integer) — the number of consecutive identical tool invocations observed so far in the session
Default policy (agentic_safety.cedar):
forbid(principal, action == Guardrails::Action::"call_tool", resource)
when { context has loop_detected && context.loop_detected == true };Code agent profile (code_agent/agentic_security.cedar) adds a count-based gate — the loop fires after 5 consecutive invocations, giving the agent room to legitimately retry a failed tool before the block kicks in:
@id("code-block-loops")
@severity("high")
forbid(principal is Guardrails::Agent, action == Guardrails::Action::"call_tool", resource)
when {
context has loop_detected && context.loop_detected == true &&
context has loop_count && context.loop_count > 5
};Tuning: If your agents legitimately retry failed tool calls (e.g., a flaky external API), raise the loop_count threshold in a custom policy scoped to that agent identity. Do not disable loop detection entirely — even well-intentioned retries can spin indefinitely if the upstream is down.
Token budget overruns
What it is: A session consumes tokens far beyond what is expected for its task — either due to a model stuck in a reasoning loop, a context window growing uncontrolled across many turns, or an adversarial input designed to inflate token usage as a denial-of-service against your LLM API budget.
Detector:
budget_exceeded(boolean) — fires when the session has consumed more tokens than its configured limit
Default policy (agentic_safety.cedar):
When budget_exceeded fires, the action is blocked and the session is effectively terminated — no further tool calls or prompts will be evaluated until a new session begins.
Setting a budget: Configure token budgets per agent or per application in Highflame Studio → Shield → Policies. The budget applies to the sum of input and output tokens across all turns in the session. Set separate limits for development agents (low limit, fail fast) and production agents (higher limit, with alerting before the hard block).
Monitoring: The Observatory Sessions view shows budget_exceeded as a flag in the session detail panel, and the total tokens in/out for each session. Use this to calibrate budgets — set the hard limit at ~150% of your 95th percentile session token usage.
Suspicious action sequences
What it is: Multi-step tool call patterns that, taken individually, appear benign but together form a recognized attack trajectory. The pattern detector analyzes the sequence of actions across the session rather than evaluating each action in isolation.
Detector:
suspicious_pattern(boolean) — fires when a known attack sequence is detectedpattern_type(string) — identifies which sequence was matchedsequence_risk(integer, 0–100) — confidence score for the matched pattern
Supported pattern types:
credential_theft
Read a credential file → encode the content → call a network or write tool
data_exfiltration
Access sensitive data → transform or aggregate → call an external endpoint
db_exfiltration
Query database → collect rows → transmit to external destination
destructive_sequence
Enumerate files or resources → delete or overwrite in bulk
Default policy (agentic_safety.cedar):
Code agent profile adds type-specific rules. For credential theft, for example, any non-first-party agent is blocked immediately when the pattern fires regardless of sequence_risk:
Tuning: Sequence detection is inherently context-sensitive. If legitimate workflows match a pattern (e.g., an ETL agent that reads a config file, transforms it, and posts to an API), add a Cedar permit scoped to that agent identity and tool set to carve out the legitimate case before the forbid applies.
Cumulative session risk
What it is: Individual events in a session each carry a risk contribution. As risk accumulates across turns — even if no single event crosses a block threshold — the session-level circuit breakers engage.
Session context fields:
session_cumulative_risk_score(integer) — running total of risk contributions across all turnssession_threat_turns(integer) — number of turns in which at least one threat signal firedsession_max_injection_score(integer) — peak injection score seen in any single turn
These fields are evaluated on every request in the session, so policies can respond to the session's history — not just the current message.
MAS profile thresholds (multi_agent/agent_safety.cedar):
session_cumulative_risk_score > 200
Non-first-party agents restricted from sensitive tools
session_cumulative_risk_score > 500 or session_threat_turns > 5
Unverified agents fully locked out from all tool calls
A2A profile thresholds (a2a_security/escalation_detection.cedar) are tighter — the circuit breakers trip at 150 and 3 threat turns respectively, because A2A sessions lack an orchestrator to catch escalating risk between turns.
Writing a custom circuit breaker:
See the Cedar Cookbook for more session circuit breaker patterns.
Behavioral drift (rug pull)
What it is: An agent or tool behaves correctly for an initial period to establish trust, then pivots to a different objective — exfiltrating data, executing destructive operations, or ignoring its original instructions. This is the "rug pull" pattern: the behavior change happens after trust has been established, making it harder to catch with per-request policies.
Detectors:
rug_pull_detected(boolean) — fires when behavioral drift is detected relative to the agent's established pattern in the sessionrug_pull_score(integer, 0–100) — confidence score for the drift signal; higher = more sudden and significant deviation
The rug pull detector tracks the sequence of tool calls, action types, and output patterns across the session and flags when the pattern shifts abruptly after 3 or more normal calls.
A2A profile policy (a2a_security/supply_chain.cedar):
MAS and code agent deployments should apply similar rules. The rug pull detector is also used in the code agent supply chain profile to catch MCP server tools that change behavior mid-session.
Tool risk gating
What it is: Some tools carry inherent risk regardless of context — shell execution, bulk file deletion, external HTTP calls. Tool risk gating ensures that high-risk tools are accessible only to agents and sessions that meet a minimum trust and risk threshold.
Context fields:
tool_risk_score(integer, 0–100) — risk level assigned to the tooltool_category(string) — categorical classification:dangerous,sensitive,shell,file_system,external_api,mcp,databasetool_is_sensitive(boolean) — convenience flag fortool_risk_score > 60
Default policy gates:
Dangerous tool block
tool_category == "dangerous"
Blocked for all non-first-party agents
Shell execution block
tool_category == "shell"
Blocked in code agent profile unless explicitly permitted
Sensitive tool threshold
tool_risk_score > 70
Blocked in code agent profile
Autonomous agent cap
agent_type == "autonomous" and tool_risk_score > 70
Blocked in MAS profile regardless of trust level
Custom per-agent tool allowlist:
Cedar evaluates the most specific applicable rule — the permit for the deploy bot takes precedence over the broad forbid. See the Cedar Cookbook for more allowlist patterns.
Monitoring abuse in Observatory
All of the above signals are visible in Observatory:
Threats view: Filter by Policy category = agentic_security to see all abuse-related detections across your fleet. Use the Event type filter to isolate tool call events specifically.
Sessions view: The session detail panel shows loop_detected, budget_exceeded, cumulative_risk, and turn_count for every session. Sort the session list by violations descending to surface the most active sessions first.
Command Center: The detector drift heatmap tracks firing rates for loop_detected, budget_exceeded, and suspicious_pattern over time. A sudden spike in any of these signals across many sessions indicates a systemic issue (model regression, a poisoned tool, or an active attacker campaign) rather than an isolated event.
Recommended controls by deployment type
Customer-facing chatbot
Token budget per session; chat_assistant profile; loop detection enabled
Autonomous code agent
code_agent profile; path security; sequence detection for credential theft and destructive ops; tool risk cap at 70 for autonomous agents
RAG / data pipeline
data_pipeline profile; sequence detection for data/DB exfiltration; PII zero-tolerance
Multi-agent orchestration
multi_agent profile; session circuit breakers at cumulative risk 200/500; post-detection lockdowns
Peer-to-peer agents
a2a_security profile; escalation detection with tighter thresholds (150 cumulative, 3 threat turns)
High-security / regulated
All of the above + advanced_detection profile; alert on every agentic_security event
Related
Policy Templates — which profiles include abuse detection rules
A2A Policies — session escalation detection in peer-to-peer agent deployments
Multi-Agent Policies — cross-turn session circuit breakers in orchestrated systems
Cedar Cookbook — writing custom thresholds and allowlists
Observatory Sessions — monitoring session-level agentic metrics
Last updated