Data Pipeline
The data_pipeline profile is designed for agents that process, transform, or route data — RAG pipelines, vector database agents, ETL workflows, and any agent that reads from external sources and writes to internal systems. It applies stricter thresholds than the defaults and adds zero-tolerance rules for sensitive data types.
Why data pipelines need their own profile
Data pipeline agents face a specific threat that other agent types do not: the content they process is often retrieved from external or untrusted sources. A RAG pipeline that fetches documents from the web, a vector database that was populated from third-party data, or an ETL agent that consumes API responses — all of these bring in content that the agent will read and potentially act on.
This creates two primary risks:
Indirect injection via retrieved content. A malicious document in the retrieval corpus can carry instructions that redirect the agent's behavior. The data_pipeline profile lowers the indirect injection threshold to 65 (vs. the default ~80) to catch lower-confidence signals earlier, because the consequences of a missed injection in a data pipeline are more severe — a redirected pipeline can exfiltrate an entire dataset rather than a single response.
PII and secrets flowing through uncontrolled paths. Data pipelines often process records that contain personal information or credentials. Without explicit zero-tolerance rules, these can pass through the pipeline and end up in model context, outputs, or downstream systems.
Profile files
privacy.cedar — Zero-tolerance PII
Blocks all PII in pipeline content, with explicit zero-tolerance rules for the most sensitive types.
General PII block:
Any detected PII in the pipeline — whether in prompts, tool inputs, tool outputs, or model responses — is blocked. This is stricter than the default PII policy, which operates at a confidence threshold.
@id("data-pii-block-all")
@severity("critical")
forbid(principal, action, resource)
when { context has pii_detected && context.pii_detected == true };Zero-tolerance sensitive types:
The following data types are blocked unconditionally, regardless of confidence score, when their specific detector fires:
Social Security Numbers
pii_types contains ssn
Credit / debit card numbers
pii_types contains credit_card
Passport numbers
pii_types contains passport
Medical record identifiers
pii_types contains medical_id
Tax identification numbers
pii_types contains tax_id
These types are subject to regulatory requirements (HIPAA, PCI-DSS, GDPR) and have no legitimate reason to be present in pipeline content that passes through an LLM.
security.cedar — Secrets and injection
Secrets — strict:
All secrets in pipeline content are blocked, both in inputs and outputs. Pipeline outputs may be logged, stored in vector databases, or passed to downstream consumers — any of these paths could expose a leaked credential further.
Injection — lowered threshold:
The injection confidence threshold is lowered to 65 (vs. ~80 in the defaults) to account for the higher probability of injected content in externally-sourced data:
This lower threshold will generate more Monitor-mode events during initial rollout. Run in Monitor mode for 1–2 weeks to review the false positive rate against your specific data corpus before switching to Block.
agentic_security.cedar — Exfiltration and tool risk
Exfiltration patterns:
Blocks tool calls that match data or database exfiltration sequence patterns — an agent that queries a database and then attempts to transmit the results externally.
Tool risk threshold:
The tool risk threshold is lowered to 60 (vs. 70 in the code_agent profile and ~85 for the dangerous tier in defaults). Data pipelines typically do not need high-risk tools — if a pipeline agent is invoking tools with a risk score above 60, that warrants review.
Applying the profile
Rollout guidance
The injection threshold (65) and the general PII block are the two rules most likely to generate false positives on real pipeline data. Recommended order:
Start with Monitor mode. Apply the full profile in Monitor mode and run your pipeline against representative data for 1–2 weeks. Review detections in the Observatory Threats view filtered to
policy_category = privacyandpolicy_category = security.Tune exceptions. If legitimate data consistently triggers PII detection (e.g., test fixtures with synthetic personal data), add a scoped
permitfor that data path or pipeline step.Switch to Block for PII and secrets first. These have near-zero false positive rates on real pipeline data. Switch
privacy.cedarandsecurity.cedarto Block.Switch injection and exfiltration to Block last. After validating the signal quality against your retrieval corpus, switch the remaining rules.
Recommended additions
Pair data_pipeline with advanced_detection for regulated environments. The advanced_detection profile adds:
ML classifier-based PII detection (higher accuracy on atypical formats)
Type-specific secret blocking (AWS IAM, GCP service accounts, database URLs)
Bulk PII detection (3+ PII matches in one response = data dump indicator)
Related
Policy Templates — all available profiles and selection guide
Advanced Detection Policies — ML-enhanced PII and secrets detection
Abuse Detection and Control — exfiltration sequence detection in depth
Cedar Cookbook — tuning injection thresholds and adding data path exceptions
Last updated