> For the complete documentation index, see [llms.txt](https://docs.highflame.ai/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.highflame.ai/deployment-guides/aws/cross-region.md). # Cross Region ## Cross-Region Active-Passive Deployment on EKS

### Overview Highflame's cross-region disaster recovery posture is **active-passive (warm-standby)**: a single AWS region (`us-east-1`, the *active* region) serves 100% of production traffic, while a second region (`us-west-2`, the *passive* region) runs a continuously-provisioned but idle replica of every workload, ready to absorb traffic on operator-initiated failover. This document describes the deployment topology, the data-replication contracts that bind the two regions, the failover runbook, and the operational invariants that keep the passive region restorable {% hint style="info" %} The active and passive regions are **deployment-time parameters**, not a hard-coded pair. Throughout this document, any reference to specific regions (e.g. `us-east-1`, `us-west-2`) is **illustrative only** — Highflame supports any AWS region as either active or passive, subject to a single hard constraint: **every AWS service Highflame depends on must be generally available in both the chosen active and passive regions, with feature parity for the specific capabilities we use**. {% endhint %} ### Topology Each region is a self-contained, fully-provisioned EKS cluster — same Kubernetes version, same node groups, same Helm releases, same `app-config/` ConfigMaps. The passive cluster is **not** a "scaled-to-zero" cluster: control-plane components, ingress controllers, cert-manager, external-dns, the Cluster Autoscaler/Karpenter, observability sidecars, IRSA roles, and all stateful sets (Postgres clients, ClickHouse) are running. What differs is **traffic** (no upstream sends requests to it) and **write workload** (its databases are followers/restored from snapshots, not primaries). Concretely, the passive region's deployments run at reduced replica counts — typically `replicas: 1` for stateless services versus `replicas: 3–6` in the active region — sized for the minimum needed to keep image pulls warm The two clusters share **nothing at the Kubernetes layer**: separate control planes, separate VPCs, separate node IAM roles, separate KMS keys (with cross-region replication enabled on the keys that encrypt replicated data). Application configuration is identical and version-pinned via Helm value templates — the same templates apply to both regions. This is the single most important invariant: **drift between the two regions is the failure mode** that turns a 30-minute failover into a 6-hour incident. When we deploy the services, please ensure both regions are updated at the same time. #### Traffic Routing The public entry point is an AWS Global Accelerator with endpoint groups, pointing at the active region's ALB as the primary and the passive region's load balancer as the secondary. Global Accelerator will do a health check on the probe endpoint exposed by each region's ingress.

#### Failover Runbook Failover is **manual and irreversible without a planned failback**. The trigger is either (a) a regional AWS outage confirmed by AWS Health Dashboard or (b) sustained application-layer failure in the active region that cannot be resolved in-region within the SLA. The steps, in order: 1. **Declare the incident** in the `#incidents` channel and page the secondary on-call. Do not start failover unilaterally. 2. **Promote Aurora Global Database** in the passive region via the AWS console. Wait for the promoted cluster to reach `available` state (≈60 seconds). This promotes the Passive region (R2) to become the new READ/WRITE Primary, while the Active region (R1) is demoted to a READ-only Replica. 3. **Restore ClickHouse** in the passive region: scale the `clickhouse-backup` sidecar (or run a one-shot Job) with `download` + `restore_remote` against the most recent CRR'd archive. Validate row counts against expected ranges before proceeding. 4. **Flip the Target in the Global Accelerator: T**raffic begins arriving within the TTL window. 1. New Active Region (R2): Set traffic dial to 100 2. Original Passive Region (R1): Set traffic dial to 0 5. **Scale up** the passive region's stateless deployments to production replica counts (`kubectl scale deployment/* --replicas=N`. Karpenter/Cluster Autoscaler provisions nodes as needed; expect \~3 minutes for the first batch of nodes to join. 6. **Validate** end-to-end by running the smoke-test suite against the new active region's public endpoint. Watch error rates and latency dashboards for the next 30 minutes. 7. **Communicate** status to customers per the SLA. **Failback** to the original active region is a separate, scheduled operation — never done under pressure during the incident. It requires verifying the services, replaying ClickHouse backups in the reverse direction, and performing a second planned failover during a maintenance window. #### RTO / RPO Targets | Component | RPO | RTO | | -------------------------- | ---------------------- | ------------------- | | Aurora-backed services | <1s | \~2 min (promotion) | | ClickHouse (Observatory) | 1–24h (cron-dependent) | 30–90 min (restore) | | S3-backed assets | \~15 min (RTC SLA) | 0 (immediate) | | Redis cache state | Total loss accepted | 0 (cold-start) | | End-to-end (DNS converged) | — | 15–45 min | These are **operational targets**, not contractual SLOs. The contractual SLO published to customers should be padded — typically RTO ≤ 2h and RPO ≤ 1h — to absorb the long tail of DNS propagation and customer-side SDK retry behavior. #### Known Gaps and Anti-Patterns This design **does not protect against logical corruption** — a bad migration or a poisoned write replicates into the passive region within the RPO window. Mitigate with retained point-in-time backups (Aurora PITR, ClickHouse `BACKUP_KEEP ≥ 7`) and the ability to restore to a timestamp *before* the corruption was introduced. **Do not assume cross-region replication is a backup.** It is durability against region failure, not against operator error. This design **does not preserve in-flight requests**. Any request mid-flight at failover time is lost; clients see a 5xx and must retry. Workloads that cannot tolerate this (synchronous billing, payment authorization) need application-layer idempotency keys and replay semantics, not better infrastructure. This design **assumes the failover decision is bigger than any single service**. Partial-region failovers (Admin in `us-east-1`, Shield in `us-west-2`) are forbidden — they create a split-brain on the shared Aurora and ClickHouse state and have caused every major DR incident the team has experienced. If one regional service is unhealthy, fail the whole region. Finally, this design **costs roughly 1.5–2x of single-region operations** in steady state (full passive infra at reduced replica counts, plus Cross-Region Replication (CRR) egress and Aurora Global Database charges). --- # Agent Instructions This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com. ## Querying This Documentation If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question. Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter: ``` GET https://docs.highflame.ai/deployment-guides/aws/cross-region.md?ask=&goal= ``` `ask` is the immediate question: it should be specific, self-contained, and written in natural language. `goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation. Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.