# Data Pipeline

The `data_pipeline` profile is designed for agents that process, transform, or route data — RAG pipelines, vector database agents, ETL workflows, and any agent that reads from external sources and writes to internal systems. It applies stricter thresholds than the defaults and adds zero-tolerance rules for sensitive data types.

***

## Why data pipelines need their own profile

Data pipeline agents face a specific threat that other agent types do not: the content they process is often retrieved from external or untrusted sources. A RAG pipeline that fetches documents from the web, a vector database that was populated from third-party data, or an ETL agent that consumes API responses — all of these bring in content that the agent will read and potentially act on.

This creates two primary risks:

**Indirect injection via retrieved content.** A malicious document in the retrieval corpus can carry instructions that redirect the agent's behavior. The `data_pipeline` profile lowers the indirect injection threshold to 65 (vs. the default \~80) to catch lower-confidence signals earlier, because the consequences of a missed injection in a data pipeline are more severe — a redirected pipeline can exfiltrate an entire dataset rather than a single response.

**PII and secrets flowing through uncontrolled paths.** Data pipelines often process records that contain personal information or credentials. Without explicit zero-tolerance rules, these can pass through the pipeline and end up in model context, outputs, or downstream systems.

***

## Profile files

***

### privacy.cedar — Zero-tolerance PII

Blocks all PII in pipeline content, with explicit zero-tolerance rules for the most sensitive types.

**General PII block:**

Any detected PII in the pipeline — whether in prompts, tool inputs, tool outputs, or model responses — is blocked. This is stricter than the default PII policy, which operates at a confidence threshold.

```cedar
@id("data-pii-block-all")
@severity("critical")
forbid(principal, action, resource)
when { context has pii_detected && context.pii_detected == true };
```

**Zero-tolerance sensitive types:**

The following data types are blocked unconditionally, regardless of confidence score, when their specific detector fires:

| Type                        | Context flag                       |
| --------------------------- | ---------------------------------- |
| Social Security Numbers     | `pii_types` contains `ssn`         |
| Credit / debit card numbers | `pii_types` contains `credit_card` |
| Passport numbers            | `pii_types` contains `passport`    |
| Medical record identifiers  | `pii_types` contains `medical_id`  |
| Tax identification numbers  | `pii_types` contains `tax_id`      |

These types are subject to regulatory requirements (HIPAA, PCI-DSS, GDPR) and have no legitimate reason to be present in pipeline content that passes through an LLM.

***

### security.cedar — Secrets and injection

**Secrets — strict:**

All secrets in pipeline content are blocked, both in inputs and outputs. Pipeline outputs may be logged, stored in vector databases, or passed to downstream consumers — any of these paths could expose a leaked credential further.

```cedar
@id("data-secrets-strict")
@severity("critical")
forbid(principal, action, resource)
when { context has contains_secrets && context.contains_secrets == true };

@id("data-block-output-secrets")
@severity("critical")
forbid(principal, action == Guardrails::Action::"process_prompt", resource)
when {
    context has direction && context.direction == "output" &&
    context has contains_secrets && context.contains_secrets == true
};
```

**Injection — lowered threshold:**

The injection confidence threshold is lowered to 65 (vs. \~80 in the defaults) to account for the higher probability of injected content in externally-sourced data:

```cedar
@id("data-injection-defense")
@severity("high")
forbid(principal, action == Guardrails::Action::"process_prompt", resource)
when {
    context has injection_confidence && context.injection_confidence >= 65
};
```

This lower threshold will generate more Monitor-mode events during initial rollout. Run in Monitor mode for 1–2 weeks to review the false positive rate against your specific data corpus before switching to Block.

***

### agentic\_security.cedar — Exfiltration and tool risk

**Exfiltration patterns:**

Blocks tool calls that match data or database exfiltration sequence patterns — an agent that queries a database and then attempts to transmit the results externally.

```cedar
@id("data-block-exfiltration")
@severity("critical")
forbid(principal, action == Guardrails::Action::"call_tool", resource)
when {
    context has suspicious_pattern && context.suspicious_pattern == true &&
    context has pattern_type &&
    (context.pattern_type == "data_exfiltration" || context.pattern_type == "db_exfiltration")
};
```

**Tool risk threshold:**

The tool risk threshold is lowered to 60 (vs. 70 in the `code_agent` profile and \~85 for the dangerous tier in defaults). Data pipelines typically do not need high-risk tools — if a pipeline agent is invoking tools with a risk score above 60, that warrants review.

***

## Applying the profile

{% tabs %}
{% tab title="Python" %}

```python
from highflame.shield import GuardrailsClient

client = GuardrailsClient(api_key="...")

client.policies.load_profile("data_pipeline/*")
```

{% endtab %}

{% tab title="TypeScript" %}

```typescript
import { GuardrailsClient } from "@highflame/sdk";

const client = new GuardrailsClient({ apiKey: "..." });

await client.policies.loadProfile("data_pipeline/*");
```

{% endtab %}
{% endtabs %}

***

## Rollout guidance

The injection threshold (65) and the general PII block are the two rules most likely to generate false positives on real pipeline data. Recommended order:

1. **Start with Monitor mode.** Apply the full profile in Monitor mode and run your pipeline against representative data for 1–2 weeks. Review detections in the [Observatory Threats view](/observatory/threats.md) filtered to `policy_category = privacy` and `policy_category = security`.
2. **Tune exceptions.** If legitimate data consistently triggers PII detection (e.g., test fixtures with synthetic personal data), add a scoped `permit` for that data path or pipeline step.
3. **Switch to Block for PII and secrets first.** These have near-zero false positive rates on real pipeline data. Switch `privacy.cedar` and `security.cedar` to Block.
4. **Switch injection and exfiltration to Block last.** After validating the signal quality against your retrieval corpus, switch the remaining rules.

***

## Recommended additions

Pair `data_pipeline` with `advanced_detection` for regulated environments. The `advanced_detection` profile adds:

* ML classifier-based PII detection (higher accuracy on atypical formats)
* Type-specific secret blocking (AWS IAM, GCP service accounts, database URLs)
* Bulk PII detection (3+ PII matches in one response = data dump indicator)

```python
client.policies.load_profile("data_pipeline/*")
client.policies.load_profile("advanced_detection/*")
```

***

## Related

* [Policy Templates](/agent-authorization-and-control-shield/policy-templates.md) — all available profiles and selection guide
* [Advanced Detection Policies](/agent-authorization-and-control-shield/policy-templates/advanced-detection-policies.md) — ML-enhanced PII and secrets detection
* [Abuse Detection and Control](/agent-authorization-and-control-shield/policy-templates/abuse-detection.md) — exfiltration sequence detection in depth
* [Cedar Cookbook](/agent-authorization-and-control-shield/cedar-cookbook.md) — tuning injection thresholds and adding data path exceptions


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.highflame.ai/agent-authorization-and-control-shield/policy-templates/data-pipeline-policies.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
