Cross Region

Active - Passive Highflame Cluster on EKS

Cross-Region Active-Passive Deployment on EKS

Overview

Highflame's cross-region disaster recovery posture is active-passive (warm-standby): a single AWS region (us-east-1, the active region) serves 100% of production traffic, while a second region (us-west-2, the passive region) runs a continuously-provisioned but idle replica of every workload, ready to absorb traffic on operator-initiated failover. This document describes the deployment topology, the data-replication contracts that bind the two regions, the failover runbook, and the operational invariants that keep the passive region restorable

The active and passive regions are deployment-time parameters, not a hard-coded pair. Throughout this document, any reference to specific regions (e.g. us-east-1, us-west-2) is illustrative only — Highflame supports any AWS region as either active or passive, subject to a single hard constraint: every AWS service Highflame depends on must be generally available in both the chosen active and passive regions, with feature parity for the specific capabilities we use.

Topology

Each region is a self-contained, fully-provisioned EKS cluster — same Kubernetes version, same node groups, same Helm releases, same app-config/ ConfigMaps. The passive cluster is not a "scaled-to-zero" cluster: control-plane components, ingress controllers, cert-manager, external-dns, the Cluster Autoscaler/Karpenter, observability sidecars, IRSA roles, and all stateful sets (Postgres clients, ClickHouse) are running. What differs is traffic (no upstream sends requests to it) and write workload (its databases are followers/restored from snapshots, not primaries). Concretely, the passive region's deployments run at reduced replica counts — typically replicas: 1 for stateless services versus replicas: 3–6 in the active region — sized for the minimum needed to keep image pulls warm

The two clusters share nothing at the Kubernetes layer: separate control planes, separate VPCs, separate node IAM roles, separate KMS keys (with cross-region replication enabled on the keys that encrypt replicated data). Application configuration is identical and version-pinned via Helm value templates — the same templates apply to both regions. This is the single most important invariant: drift between the two regions is the failure mode that turns a 30-minute failover into a 6-hour incident. When we deploy the services, please ensure both regions are updated at the same time.

Traffic Routing

The public entry point is an AWS Global Accelerator with endpoint groups, pointing at the active region's ALB as the primary and the passive region's load balancer as the secondary. Global Accelerator will do a health check on the probe endpoint exposed by each region's ingress.

Failover Runbook

Failover is manual and irreversible without a planned failback. The trigger is either (a) a regional AWS outage confirmed by AWS Health Dashboard or (b) sustained application-layer failure in the active region that cannot be resolved in-region within the SLA. The steps, in order:

  1. Declare the incident in the #incidents channel and page the secondary on-call. Do not start failover unilaterally.

  2. Promote Aurora Global Database in the passive region via the AWS console. Wait for the promoted cluster to reach available state (≈60 seconds). This promotes the Passive region (R2) to become the new READ/WRITE Primary, while the Active region (R1) is demoted to a READ-only Replica.

  3. Restore ClickHouse in the passive region: scale the clickhouse-backup sidecar (or run a one-shot Job) with download + restore_remote against the most recent CRR'd archive. Validate row counts against expected ranges before proceeding.

  4. Flip the Target in the Global Accelerator: Traffic begins arriving within the TTL window.

    1. New Active Region (R2): Set traffic dial to 100

    2. Original Passive Region (R1): Set traffic dial to 0

  5. Scale up the passive region's stateless deployments to production replica counts (kubectl scale deployment/* --replicas=N. Karpenter/Cluster Autoscaler provisions nodes as needed; expect ~3 minutes for the first batch of nodes to join.

  6. Validate end-to-end by running the smoke-test suite against the new active region's public endpoint. Watch error rates and latency dashboards for the next 30 minutes.

  7. Communicate status to customers per the SLA.

Failback to the original active region is a separate, scheduled operation — never done under pressure during the incident. It requires verifying the services, replaying ClickHouse backups in the reverse direction, and performing a second planned failover during a maintenance window.

RTO / RPO Targets

Component
RPO
RTO

Aurora-backed services

<1s

~2 min (promotion)

ClickHouse (Observatory)

1–24h (cron-dependent)

30–90 min (restore)

S3-backed assets

~15 min (RTC SLA)

0 (immediate)

Redis cache state

Total loss accepted

0 (cold-start)

End-to-end (DNS converged)

15–45 min

These are operational targets, not contractual SLOs. The contractual SLO published to customers should be padded — typically RTO ≤ 2h and RPO ≤ 1h — to absorb the long tail of DNS propagation and customer-side SDK retry behavior.

Known Gaps and Anti-Patterns

This design does not protect against logical corruption — a bad migration or a poisoned write replicates into the passive region within the RPO window. Mitigate with retained point-in-time backups (Aurora PITR, ClickHouse BACKUP_KEEP ≥ 7) and the ability to restore to a timestamp before the corruption was introduced. Do not assume cross-region replication is a backup. It is durability against region failure, not against operator error.

This design does not preserve in-flight requests. Any request mid-flight at failover time is lost; clients see a 5xx and must retry. Workloads that cannot tolerate this (synchronous billing, payment authorization) need application-layer idempotency keys and replay semantics, not better infrastructure.

This design assumes the failover decision is bigger than any single service. Partial-region failovers (Admin in us-east-1, Shield in us-west-2) are forbidden — they create a split-brain on the shared Aurora and ClickHouse state and have caused every major DR incident the team has experienced. If one regional service is unhealthy, fail the whole region.

Finally, this design costs roughly 1.5–2x of single-region operations in steady state (full passive infra at reduced replica counts, plus Cross-Region Replication (CRR) egress and Aurora Global Database charges).

Last updated