Data Pipeline

The data_pipeline profile is designed for agents that process, transform, or route data — RAG pipelines, vector database agents, ETL workflows, and any agent that reads from external sources and writes to internal systems. It applies stricter thresholds than the defaults and adds zero-tolerance rules for sensitive data types.


Why data pipelines need their own profile

Data pipeline agents face a specific threat that other agent types do not: the content they process is often retrieved from external or untrusted sources. A RAG pipeline that fetches documents from the web, a vector database that was populated from third-party data, or an ETL agent that consumes API responses — all of these bring in content that the agent will read and potentially act on.

This creates two primary risks:

Indirect injection via retrieved content. A malicious document in the retrieval corpus can carry instructions that redirect the agent's behavior. The data_pipeline profile lowers the indirect injection threshold to 65 (vs. the default ~80) to catch lower-confidence signals earlier, because the consequences of a missed injection in a data pipeline are more severe — a redirected pipeline can exfiltrate an entire dataset rather than a single response.

PII and secrets flowing through uncontrolled paths. Data pipelines often process records that contain personal information or credentials. Without explicit zero-tolerance rules, these can pass through the pipeline and end up in model context, outputs, or downstream systems.


Profile files


privacy.cedar — Zero-tolerance PII

Blocks all PII in pipeline content, with explicit zero-tolerance rules for the most sensitive types.

General PII block:

Any detected PII in the pipeline — whether in prompts, tool inputs, tool outputs, or model responses — is blocked. This is stricter than the default PII policy, which operates at a confidence threshold.

@id("data-pii-block-all")
@severity("critical")
forbid(principal, action, resource)
when { context has pii_detected && context.pii_detected == true };

Zero-tolerance sensitive types:

The following data types are blocked unconditionally, regardless of confidence score, when their specific detector fires:

Type
Context flag

Social Security Numbers

pii_types contains ssn

Credit / debit card numbers

pii_types contains credit_card

Passport numbers

pii_types contains passport

Medical record identifiers

pii_types contains medical_id

Tax identification numbers

pii_types contains tax_id

These types are subject to regulatory requirements (HIPAA, PCI-DSS, GDPR) and have no legitimate reason to be present in pipeline content that passes through an LLM.


security.cedar — Secrets and injection

Secrets — strict:

All secrets in pipeline content are blocked, both in inputs and outputs. Pipeline outputs may be logged, stored in vector databases, or passed to downstream consumers — any of these paths could expose a leaked credential further.

Injection — lowered threshold:

The injection confidence threshold is lowered to 65 (vs. ~80 in the defaults) to account for the higher probability of injected content in externally-sourced data:

This lower threshold will generate more Monitor-mode events during initial rollout. Run in Monitor mode for 1–2 weeks to review the false positive rate against your specific data corpus before switching to Block.


agentic_security.cedar — Exfiltration and tool risk

Exfiltration patterns:

Blocks tool calls that match data or database exfiltration sequence patterns — an agent that queries a database and then attempts to transmit the results externally.

Tool risk threshold:

The tool risk threshold is lowered to 60 (vs. 70 in the code_agent profile and ~85 for the dangerous tier in defaults). Data pipelines typically do not need high-risk tools — if a pipeline agent is invoking tools with a risk score above 60, that warrants review.


Applying the profile


Rollout guidance

The injection threshold (65) and the general PII block are the two rules most likely to generate false positives on real pipeline data. Recommended order:

  1. Start with Monitor mode. Apply the full profile in Monitor mode and run your pipeline against representative data for 1–2 weeks. Review detections in the Observatory Threats view filtered to policy_category = privacy and policy_category = security.

  2. Tune exceptions. If legitimate data consistently triggers PII detection (e.g., test fixtures with synthetic personal data), add a scoped permit for that data path or pipeline step.

  3. Switch to Block for PII and secrets first. These have near-zero false positive rates on real pipeline data. Switch privacy.cedar and security.cedar to Block.

  4. Switch injection and exfiltration to Block last. After validating the signal quality against your retrieval corpus, switch the remaining rules.


Pair data_pipeline with advanced_detection for regulated environments. The advanced_detection profile adds:

  • ML classifier-based PII detection (higher accuracy on atypical formats)

  • Type-specific secret blocking (AWS IAM, GCP service accounts, database URLs)

  • Bulk PII detection (3+ PII matches in one response = data dump indicator)


Last updated