Abuse Detection and Control

Agentic abuse is distinct from prompt injection or data exfiltration — it targets the agent's execution lifecycle rather than its content. A looping agent, a session consuming unbounded tokens, or a tool call sequence quietly building toward exfiltration can each cause significant damage without triggering a single content-based detector.

Highflame detects and enforces against these patterns through a combination of session-aware signal tracking, behavioral pattern matching, and Cedar policies that evaluate context accumulated across multiple turns — not just the current request.


Abuse scenarios and controls

Agent loops

What it is: An agent repeatedly invokes the same tool without making progress — either stuck in an error-recovery cycle, manipulated by a prompt to repeat an action, or misconfigured such that the model always produces the same tool call.

Detectors:

  • loop_detected (boolean) — fires when the same tool is invoked consecutively above a count threshold

  • loop_count (integer) — the number of consecutive identical tool invocations observed so far in the session

Default policy (agentic_safety.cedar):

forbid(principal, action == Guardrails::Action::"call_tool", resource)
when { context has loop_detected && context.loop_detected == true };

Code agent profile (code_agent/agentic_security.cedar) adds a count-based gate — the loop fires after 5 consecutive invocations, giving the agent room to legitimately retry a failed tool before the block kicks in:

@id("code-block-loops")
@severity("high")
forbid(principal is Guardrails::Agent, action == Guardrails::Action::"call_tool", resource)
when {
    context has loop_detected && context.loop_detected == true &&
    context has loop_count && context.loop_count > 5
};

Tuning: If your agents legitimately retry failed tool calls (e.g., a flaky external API), raise the loop_count threshold in a custom policy scoped to that agent identity. Do not disable loop detection entirely — even well-intentioned retries can spin indefinitely if the upstream is down.


Token budget overruns

What it is: A session consumes tokens far beyond what is expected for its task — either due to a model stuck in a reasoning loop, a context window growing uncontrolled across many turns, or an adversarial input designed to inflate token usage as a denial-of-service against your LLM API budget.

Detector:

  • budget_exceeded (boolean) — fires when the session has consumed more tokens than its configured limit

Default policy (agentic_safety.cedar):

When budget_exceeded fires, the action is blocked and the session is effectively terminated — no further tool calls or prompts will be evaluated until a new session begins.

Setting a budget: Configure token budgets per agent or per application in Highflame StudioShieldPolicies. The budget applies to the sum of input and output tokens across all turns in the session. Set separate limits for development agents (low limit, fail fast) and production agents (higher limit, with alerting before the hard block).

Monitoring: The Observatory Sessions view shows budget_exceeded as a flag in the session detail panel, and the total tokens in/out for each session. Use this to calibrate budgets — set the hard limit at ~150% of your 95th percentile session token usage.


Suspicious action sequences

What it is: Multi-step tool call patterns that, taken individually, appear benign but together form a recognized attack trajectory. The pattern detector analyzes the sequence of actions across the session rather than evaluating each action in isolation.

Detector:

  • suspicious_pattern (boolean) — fires when a known attack sequence is detected

  • pattern_type (string) — identifies which sequence was matched

  • sequence_risk (integer, 0–100) — confidence score for the matched pattern

Supported pattern types:

Pattern
Sequence detected

credential_theft

Read a credential file → encode the content → call a network or write tool

data_exfiltration

Access sensitive data → transform or aggregate → call an external endpoint

db_exfiltration

Query database → collect rows → transmit to external destination

destructive_sequence

Enumerate files or resources → delete or overwrite in bulk

Default policy (agentic_safety.cedar):

Code agent profile adds type-specific rules. For credential theft, for example, any non-first-party agent is blocked immediately when the pattern fires regardless of sequence_risk:

Tuning: Sequence detection is inherently context-sensitive. If legitimate workflows match a pattern (e.g., an ETL agent that reads a config file, transforms it, and posts to an API), add a Cedar permit scoped to that agent identity and tool set to carve out the legitimate case before the forbid applies.


Cumulative session risk

What it is: Individual events in a session each carry a risk contribution. As risk accumulates across turns — even if no single event crosses a block threshold — the session-level circuit breakers engage.

Session context fields:

  • session_cumulative_risk_score (integer) — running total of risk contributions across all turns

  • session_threat_turns (integer) — number of turns in which at least one threat signal fired

  • session_max_injection_score (integer) — peak injection score seen in any single turn

These fields are evaluated on every request in the session, so policies can respond to the session's history — not just the current message.

MAS profile thresholds (multi_agent/agent_safety.cedar):

Threshold
Effect

session_cumulative_risk_score > 200

Non-first-party agents restricted from sensitive tools

session_cumulative_risk_score > 500 or session_threat_turns > 5

Unverified agents fully locked out from all tool calls

A2A profile thresholds (a2a_security/escalation_detection.cedar) are tighter — the circuit breakers trip at 150 and 3 threat turns respectively, because A2A sessions lack an orchestrator to catch escalating risk between turns.

Writing a custom circuit breaker:

See the Cedar Cookbook for more session circuit breaker patterns.


Behavioral drift (rug pull)

What it is: An agent or tool behaves correctly for an initial period to establish trust, then pivots to a different objective — exfiltrating data, executing destructive operations, or ignoring its original instructions. This is the "rug pull" pattern: the behavior change happens after trust has been established, making it harder to catch with per-request policies.

Detectors:

  • rug_pull_detected (boolean) — fires when behavioral drift is detected relative to the agent's established pattern in the session

  • rug_pull_score (integer, 0–100) — confidence score for the drift signal; higher = more sudden and significant deviation

The rug pull detector tracks the sequence of tool calls, action types, and output patterns across the session and flags when the pattern shifts abruptly after 3 or more normal calls.

A2A profile policy (a2a_security/supply_chain.cedar):

MAS and code agent deployments should apply similar rules. The rug pull detector is also used in the code agent supply chain profile to catch MCP server tools that change behavior mid-session.


Tool risk gating

What it is: Some tools carry inherent risk regardless of context — shell execution, bulk file deletion, external HTTP calls. Tool risk gating ensures that high-risk tools are accessible only to agents and sessions that meet a minimum trust and risk threshold.

Context fields:

  • tool_risk_score (integer, 0–100) — risk level assigned to the tool

  • tool_category (string) — categorical classification: dangerous, sensitive, shell, file_system, external_api, mcp, database

  • tool_is_sensitive (boolean) — convenience flag for tool_risk_score > 60

Default policy gates:

Gate
Condition
Effect

Dangerous tool block

tool_category == "dangerous"

Blocked for all non-first-party agents

Shell execution block

tool_category == "shell"

Blocked in code agent profile unless explicitly permitted

Sensitive tool threshold

tool_risk_score > 70

Blocked in code agent profile

Autonomous agent cap

agent_type == "autonomous" and tool_risk_score > 70

Blocked in MAS profile regardless of trust level

Custom per-agent tool allowlist:

Cedar evaluates the most specific applicable rule — the permit for the deploy bot takes precedence over the broad forbid. See the Cedar Cookbook for more allowlist patterns.


Monitoring abuse in Observatory

All of the above signals are visible in Observatory:

Threats view: Filter by Policy category = agentic_security to see all abuse-related detections across your fleet. Use the Event type filter to isolate tool call events specifically.

Sessions view: The session detail panel shows loop_detected, budget_exceeded, cumulative_risk, and turn_count for every session. Sort the session list by violations descending to surface the most active sessions first.

Command Center: The detector drift heatmap tracks firing rates for loop_detected, budget_exceeded, and suspicious_pattern over time. A sudden spike in any of these signals across many sessions indicates a systemic issue (model regression, a poisoned tool, or an active attacker campaign) rather than an isolated event.


Scenario
Recommended controls

Customer-facing chatbot

Token budget per session; chat_assistant profile; loop detection enabled

Autonomous code agent

code_agent profile; path security; sequence detection for credential theft and destructive ops; tool risk cap at 70 for autonomous agents

RAG / data pipeline

data_pipeline profile; sequence detection for data/DB exfiltration; PII zero-tolerance

Multi-agent orchestration

multi_agent profile; session circuit breakers at cumulative risk 200/500; post-detection lockdowns

Peer-to-peer agents

a2a_security profile; escalation detection with tighter thresholds (150 cumulative, 3 threat turns)

High-security / regulated

All of the above + advanced_detection profile; alert on every agentic_security event


Last updated