Threat Response

This guide walks enterprise administrators through responding to threat events detected by Overwatch across AI coding assistants and their MCP server integrations. It covers how threats appear in the console, how to triage them, what actions to take by severity, and how to tune policies after a false positive.

If you have not yet configured policies, start with Setting Up Policies. For understanding which MCP servers are in use and who is using them, see Discovery & Metrics.


Overview: What Threat Events Look Like

When Overwatch detects a policy violation or security signal, it records a structured threat event. These events are visible in real time in the Highflame Studio console under Threats & Violations. Each event represents a single decision point — one prompt evaluation, tool call, file operation, or model response — where one or more detectors fired and a Cedar policy matched.

Threat events aggregate across all users and coding assistants in your organization. They are not session-level summaries; they are individual, discrete records of what happened, what was detected, and what the system decided (allowed or denied). This makes them suitable for both real-time response and post-incident audit.

Code Agent threat events also appear in the Observatory Threats view, where they can be correlated with browser violations, gateway blocks, and Shield guardrail events from the same user or session.


Threat Categories and Severity Levels

Overwatch organizes threats into categories that correspond to the policy categories defined in your policy configuration.

Category
Description
Typical Severity

Prompt Injection

Direct or indirect injection attempts, jailbreaks, multi-turn trajectory drift

High / Critical

Secrets Detection

API keys, tokens, credentials, or other secrets in prompts, tool calls, or responses

High

PII Detection

Personally identifiable information (credit cards, SSNs, names+contact data) in content

Medium / High

Dangerous Tool Usage

Shell execution, destructive file operations, mass-delete commands, external writes

High / Critical

Agent Security

Tool poisoning, rug pull attacks, indirect injection targeting coding agents

Critical

Content Safety

Violent, hateful, harmful, or sexually explicit content

Medium / High

Policy Violation

Action denied by an organization rule, scope restriction, or custom Cedar policy

Low / Medium / High

Severity levels:

  • Low — policy matched but risk is minimal; no immediate action required. Examples: a tool call slightly exceeding a risk threshold, a keyword match on a borderline term.

  • Medium — policy matched with moderate risk; warrants review and possible policy tuning. Examples: PII detected in a file read, repeated borderline injection scores.

  • High — clear policy violation with meaningful risk; warrants blocking or notifying the user. Examples: a secret in a prompt, a high-confidence injection attempt, a dangerous tool executed outside an approved context.

  • Critical — severe risk requiring immediate response. Examples: tool poisoning attack, agent executing destructive shell commands with exfiltrated credentials, multi-step rug pull attempt.


Step-by-Step Triage Workflow

Finding Threats in the Console

  1. Open Highflame Studio and navigate to your organization's Overwatch console.

  2. In the left navigation, select Threats & Violations.

  3. The page shows a summary row at the top with four counts: Total, Allowed, Denied, and Threats. Use these to orient quickly — a large gap between Total and Denied means most threats are being logged but not blocked (check your enforcement mode).

  4. The event list below the summary shows individual threat records in reverse chronological order.

  5. Use the available filters to narrow the list:

    • Severity — filter to High or Critical to triage the most urgent events first.

    • Category — isolate a specific threat type (e.g., Agent Security only).

    • Code Agent — filter to a specific IDE or assistant (Cursor and Claude Code).

    • MCP Server — filter to events involving a specific server.

    • User — filter to a specific developer.

    • Time range — focus on a specific window.

Reading a Threat Event

Click any event row to open the detail view. The key fields:

Field
What it tells you

Timestamp

When the event was recorded (UTC).

User

The developer whose agent session triggered the event.

Code Agent

Which AI coding assistant was in use (Cursor, Claude Code, etc.).

MCP Server

Which MCP server the tool call was directed to, if applicable.

Action

The runtime action: process_prompt, execute_tool, read_file, write_file, etc.

Decision

allowed or denied — what Overwatch actually did.

Severity

Low, Medium, High, or Critical.

Detected Threats

Structured list of what fired: injection score, secret presence, PII flag, tool risk, and so on. These are the projected context values that Cedar evaluated.

Matched Policies

Which Cedar policies contributed to the decision.

Payload Preview

A truncated view of the content that was evaluated (secrets and PII may be masked).

Session ID

The agent session this event belongs to. Click to view all events in the session, or open it in the Observatory Sessions view for a cross-product timeline including gateway and browser events.

Determining True Positive vs. False Positive

Before taking a response action, determine whether the threat is genuine.

Signs it is a true positive:

  • The injection_score is 80 or above and the payload contains clear instruction-override language.

  • contains_secrets is true and the payload preview shows a recognizable credential format.

  • The tool risk is high and the action is a destructive shell command or external write outside any approved context.

  • The matched policies are from the Agent Security category (tool poisoning or rug pull indicators).

  • Multiple detectors fired on the same event (correlated signals increase confidence).

Signs it may be a false positive:

  • The injection score is in the 40–65 range and the payload is a rephrased technical question.

  • PII was detected but the content is a test fixture or synthetic data with fake personal information.

  • A tool is flagged as high risk but it is a well-understood internal tool operating in an approved environment.

  • The matched policy has a threshold that has not yet been calibrated to your organization's traffic patterns.

  • The same user has many similar events all flagged at the same low-medium severity with no escalation pattern.

If you are unsure, use the Playground to replay the payload against your current policy set and examine the projected context. See Tuning After a False Positive below.


Response Actions by Severity

Low and Medium Threats

For low and medium severity events where no immediate harm has occurred:

  1. Review the event detail. Confirm whether the detection was accurate. Check the payload preview and matched policies.

  2. Check for patterns. Navigate to the session view to see whether this is an isolated event or part of a sequence. Multiple low-severity events in a session can indicate escalation.

  3. Tune the policy if needed. If the event is a false positive, adjust the threshold or add an exception. Use the Playground to validate before deploying. See the tuning section below.

  4. Add to monitoring if borderline. If the signal is genuine but below your enforcement threshold, consider lowering the threshold incrementally in monitor mode to observe the rate before enforcing.

  5. No user notification required for most low/medium events unless a pattern is emerging.

High Severity Threats

For high severity events where a clear policy violation has occurred:

  1. Confirm the decision. Check whether the event was denied or allowed. If allowed, your policies may be in monitor mode or the threshold was not reached. Consider whether the policy should be tightened.

  2. Block the specific tool or server if the risk is targeted. If the threat involves a specific MCP server or tool, disable that server from the console while you investigate. See How to Disable a Specific MCP Server below.

  3. Notify the affected user. Reach out to the developer whose session triggered the event. Explain what was detected and what action was taken. In most cases this is informational — the developer may not have been aware of the injection in a retrieved document or MCP tool description.

  4. Review the session. Click the Session ID to view the full event history for that session. For a deeper view including distributed traces and cross-product events, open the session in Observatory Sessions. Check whether other tools were called before or after the threat event that could indicate a broader compromise.

  5. Update policy thresholds. If the event was a true positive that slipped through at a lower threshold, tighten the threshold and redeploy in enforce mode.

Critical Threats

Critical threats require immediate action and coordination.

  1. Revoke the agent session immediately. From the Sessions view, locate the session associated with the event and revoke it. This terminates the active agent connection and prevents further tool execution in that session.

  2. Disable the MCP server involved. If the threat originated from or was directed at a specific MCP server, disable it from the Discovery page. See How to Disable a Specific MCP Server below.

  3. Notify the affected user and their team. Inform the developer and, depending on what was accessed or executed, their team or manager.

  4. Preserve the event record. Do not delete or modify the threat event. It is the primary artifact for post-incident review.

  5. Escalate to your security team. Follow your organization's incident response process. See the Escalation Path section below.

  6. Conduct a session replay. Review all events in the session to reconstruct what the agent did, what data it accessed, and whether any exfiltration or destructive action completed before the block.

  7. Review for lateral exposure. Check whether the same MCP server is connected to other developers' agents. If the threat was a tool poisoning attack embedded in the MCP server's tool descriptions, all users of that server are potentially affected.


How to Disable a Specific MCP Server from the Console

When a threat is linked to a specific MCP server, you can disable that server for your organization directly from the Overwatch console.

  1. Navigate to Discovery in the left navigation.

  2. Locate the MCP server in the MCP Servers Overview table. You can search by server name or filter by the code agent.

  3. Click the server row to open the server detail panel.

  4. In the server detail panel, find the Status control and toggle it to Disabled.

  5. Confirm the action when prompted. The server will be blocked for all new connections across the organization immediately. Existing sessions may continue until revoked.

  6. Optionally, navigate to Sessions and revoke any active sessions that were connected to this server.

To re-enable the server after investigation is complete, return to the server detail panel and toggle Status back to Enabled. Document the reason for re-enabling in your incident record.


How to Tune Policies After a False Positive

A false positive is a legitimate request that was incorrectly denied. Tuning policies reduces false positives while preserving protection against genuine threats.

Step 1: Identify the matching policy.

Open the false-positive event detail in Threats & Violations and note the Matched Policies field. This tells you which Cedar rule caused the deny.

Step 2: Understand the projected context.

Note the Detected Threats field. These are the context key values that Cedar evaluated. For example: injection_score: 72, tool_risk: high, content_type: tool_call.

Step 3: Open the Playground.

Navigate to Policy in the left navigation, then open the Playground. The Playground lets you test policy changes against specific payloads interactively.

Step 4: Paste or type a representative payload.

Enter a payload that resembles the false-positive request. Run it against your current policy set to confirm it reproduces the deny decision.

Step 5: Adjust the policy.

Options for reducing false positives without removing protection:

  • Raise the threshold. If injection_score >= 60 is matching legitimate rephrased instructions, raise it to >= 75 and observe the impact.

  • Narrow the scope. Add a content_type or action condition to limit the policy to the cases you actually want to block. For example, apply injection blocking only to content_type == "prompt" if false positives are occurring on tool_call content.

  • Add an exception. Use a Cedar permit with a specific resource or principal condition to carve out the legitimate case while keeping the broader forbid in place.

Step 6: Validate in the Playground.

After making the change, re-run the false-positive payload. Confirm it now resolves to permit. Also test a known-malicious payload to confirm the policy still blocks it.

Step 7: Deploy in monitor mode first.

Apply the updated policy with the application still in monitor mode. Monitor the Threats & Violations feed for one to two days to confirm the false-positive rate has decreased without increasing the miss rate.

Step 8: Switch to enforce.

Once you are satisfied with the tuned behavior, switch the application to enforce mode. Monitor closely for the first few hours after the mode change.


How to Set Up Alerts So Threats Reach Your Team

Threat events in the console are useful for triage, but for real-time response your team needs to receive alerts in the channels they already monitor.

Highflame supports two primary alert destinations:

  • Slack — send threat alerts to a channel via an incoming webhook.

  • Splunk — forward structured threat events to your SIEM via the HTTP Event Collector.

For setup instructions, see Alerts.

Scoping alerts to Code Agent threats specifically:

After configuring your integration, use the trigger_condition field to restrict alerting to the threat types and severity levels relevant to your code agent security program. For example, to receive alerts only for agent-security-specific threats:

This prevents high-volume low-severity events from flooding your alert channel while ensuring critical threats reach your team immediately. Refer to the Shield API reference for the full list of supported threat type identifiers.


Escalation Path for Critical Incidents

When a critical threat event requires escalation beyond the standard triage workflow:

  1. Immediate containment (0–15 minutes):

    • Revoke the affected agent session.

    • Disable the implicated MCP server if one is involved.

    • Confirm whether any destructive or exfiltration action completed before the block.

  2. Initial notification (15–60 minutes):

    • Notify the affected developer and their manager.

    • Page or message your security team via your standard incident channel.

    • Create an incident record in your ticketing system with the threat event ID, session ID, and a summary of what was detected.

  3. Investigation (1–4 hours):

    • Replay the full session in the console to reconstruct the sequence of actions. Use Observatory Sessions to see a unified timeline that includes gateway events and browser activity from the same user.

    • Use Observatory Traces to view the distributed trace for the agent run, including LLM calls, tool invocations, and latency breakdown.

    • Determine whether the threat was sourced from an external MCP server (tool poisoning) or originated in user input (direct injection).

    • Check whether other developers use the same MCP server or have had similar events.

    • Export the event record from Threats & Violations for your incident file.

  4. Remediation:

    • For tool poisoning: work with the MCP server owner to identify and remove the poisoned tool description. Keep the server disabled until it is remediated and re-scanned.

    • For credential exposure: rotate any secrets that were present in the detected payload immediately, regardless of whether exfiltration is confirmed.

    • For destructive execution: assess blast radius and restore from backup as appropriate.

  5. Post-incident:

    • Update Cedar policies to prevent recurrence.

    • Lower thresholds or add targeted rules for the detected pattern.

    • Document findings and distribute a brief post-mortem to the engineering team.

    • Re-enable blocked servers only after verification that the threat has been remediated.


What's Next

Last updated