Securing Model Calls

Every LLM request routed through the Agent Gateway passes through a layered set of security controls — before the request reaches the model provider, during provider interaction, and before the response is returned to the caller. This page documents each control layer.


Provider credential isolation

Applications never handle model provider API keys. When you register a provider in Highflame Studio, the API key is stored in the Secrets Vault (backed by AWS Secrets Manager with KMS encryption at rest). Applications reference a named route; the Gateway resolves the route to a provider and retrieves the credential from the vault at request time.

This means:

  • A compromised agent or application cannot leak the underlying model API key — it never has access to it

  • Rotating a provider key requires updating it in one place (the vault), not across every application

  • If the secrets detector fires on an inbound request, it catches an attempt to exfiltrate a key that the caller should not have in the first place


Caller authentication

Every request to the Gateway must present a valid credential before any model call is initiated. Authentication is resolved in the following priority order:

Priority
Method
Format

1

Highflame API key

X-Highflame-APIKey: hf_sk_...

2

Service key (bearer)

Authorization: Bearer hf_sk_...

3

ZeroID JWT (OAuth)

Authorization: Bearer eyJ... (RS256-signed)

Unauthenticated requests receive a 401 before any downstream processing occurs. JWT tokens are validated against the Highflame JWKS endpoint; signing keys are refreshed every 5 minutes with lazy refresh on unknown key IDs. Token claims carry the caller's account_id, project_id, and agent identity, which are propagated into guardrail context and Observatory traces.

Agents using ZeroID issue short-lived JWTs through the highflame-authn service. These tokens expire quickly, limiting the blast radius of a stolen credential and enabling continuous access evaluation (CAE) for real-time revocation.


Model pinning

Callers specify a route name, not a model. The route configuration — stored in Highflame Studio and controlled by administrators — determines which provider and model handle the request. Applications cannot override this at call time.

This prevents:

  • Model substitution attacks — an agent cannot be manipulated into routing a sensitive request to a different, less-safe model

  • Provider drift — model selection is an admin decision, not a developer decision, and changes are audited

  • Prompt-based model switching — even if an injected prompt attempts to redirect to a different model endpoint, the Gateway ignores it and uses the route's configured model

To change which model a route uses, an admin updates the route configuration in Studio. No application changes are required, and the change is reflected on the next request.


Input evaluation

Before any request is forwarded to the model provider, it passes through the Shield detection pipeline. Detection runs in three tiers ordered by latency:

Tier 1 — Fast (< 5 ms, deterministic)

Runs on every request with no latency budget impact:

Detector
What it checks

Secrets

API keys, bearer tokens, SSH keys, PEM certificates, cloud credentials (AWS, GCP, Azure, GitHub)

PII

Email addresses, phone numbers, SSNs, credit card numbers, IP addresses

Security patterns

SQL injection, XSS, path traversal, command injection, invisible Unicode characters

Keyword matching

Aho-Corasick matching against configured blocklists

Tool risk scoring

Scores tool call arguments (0–100) and flags sensitive tool categories

Action pattern

State-machine detection of multi-step attack sequences (e.g., read file → HTTP post)

Loop detection

Detects repeated identical tool calls above the configured threshold (default: 5)

Token budget

Checks session token consumption against per-session limits

Tier 2 — Standard (10–200 ms, ML/NLP)

Runs on requests where Tier 1 does not early-exit:

Detector
What it checks

Injection

ML model for prompt injection and jailbreak attempts (produces injection_confidence 0–100)

Deep context

Multi-turn injection detection using conversation history across the session

Toxicity

Violence, hate speech, sexual content, criminal activity, weapons, profanity

Language

Language identification across 75 languages

Sentiment

Sentiment analysis for downstream policy use

Tier 3 — Slow (50–500 ms, cloud APIs)

Runs when elevated risk signals warrant deeper inspection:

Detector
What it checks

Enterprise DLP

Google Cloud DLP for regulated PII detection with fuzzy matching

Content Safety

Google Model Armor content safety API

Phishing

CheckPhish URL scanning for malicious links in prompts

Early exit: If Tier 1 produces a deny decision and the Cedar policy does not require further evidence, Tier 2 and 3 are skipped. This keeps the median latency impact under 5 ms for requests that are cleanly safe or clearly blocked.

Cedar policy enforcement

Detector outputs are projected into Cedar context keys (injection_confidence, pii_detected, contains_secrets, tool_risk_score, etc.) and evaluated against your Cedar policies. The policy decision determines the enforcement action:

Action
Behavior

Block

Request is rejected with 403; violation recorded in Observatory

Redact

Detected content is masked before the request is forwarded to the model

Alert

Request is forwarded; a detection event is emitted for review

Monitor

Request is forwarded; detection is recorded silently with no user notification


Session-aware context

The detection pipeline is session-aware. Detection signals and action history are persisted in Redis across turns, so policies can respond to the cumulative state of a conversation:

Session state includes:

  • session_pii_detected — whether PII has appeared in any turn

  • session_secrets_detected — whether secrets have appeared in any turn

  • session_injection_detected — whether an injection signal has fired in any turn

  • session_cumulative_risk_score — running total of risk contributions across all turns

  • session_threat_turns — number of turns with at least one threat signal

  • session_max_injection_score — peak injection score seen in any turn

This enables circuit-breaker policies that lock down sensitive operations after a threshold of risk accumulates — see Abuse Detection and Control for patterns.


Output evaluation

Model responses are evaluated by the same Shield pipeline before being returned to the caller. Output-phase detection checks for:

Check
What it catches

PII in response

Personal information generated or repeated by the model

Secrets in response

Credentials the model may have retrieved or hallucinated

Hallucination

Factual inconsistency scored by the hallucination detector

Phishing links

Malicious URLs in model-generated content

Toxicity

Harmful content in the model's output

If the output evaluation produces a deny, the response is blocked and the caller receives a 403 — the model's response is never returned. If redact applies, sensitive content is masked before the response leaves the Gateway.


Rate limiting and token budgets

Provider rate limits: The Gateway automatically retries on 429 (rate limited) and 5xx responses from model providers using exponential backoff, up to 3 attempts. This is transparent to callers.

Per-session token budgets: Token consumption is tracked across turns in the session. When a session exceeds its configured budget, the budget_exceeded flag fires and Cedar policies can block further requests. This prevents runaway sessions from consuming unbounded model API spend.

Cost tracking: Every model call is recorded with token counts and cost attribution in Observatory. The Command Center cost intelligence view surfaces per-agent and per-model cost breakdowns in real time.


Audit trail

Every model call generates a trace in Observatory containing:

  • Caller identity (agent ID, user, application)

  • Route and provider/model used

  • Request timestamp and end-to-end latency

  • Shield evaluation results: detector scores, Cedar policy decisions, determining policies

  • Token usage (input and output tokens)

  • Cost attribution

  • Full payload snapshot at each evaluation point (subject to your data handling policy)

Traces are retained for 30 days. Detection events are retained for 90 days. See Traces and Threats in Observatory for investigation surfaces.


Last updated