◆High Availability · Disaster Recovery

Built to never go dark.

Every component runs hot. Active and standby database nodes replicate continuously. A virtual IP floats above the cluster — keepalived watches the primary, and the moment it stops responding, traffic is on the standby before the average viewer notices a buffer. This isn't a Postgres tutorial bolted onto a SaaS — it's first-class infrastructure your operator never has to think about.

99.999^%

Service-Level Objective

<30^s

Failover Recovery

3^×

Replication Targets

Operator Action Required

§ ATopology

Cluster Anatomy

Three nodes, one truth.

An active/standby database pair replicates over a private network. A pair of nginx load-balancers participates in VRRP for a virtual IP that floats between them. The application cluster reads through a connection pool that flushes and retries when the primary moves. Every link is dual-pathed.

▲Topology · HA-Cluster-0013 nodes · vrrp v3

§ BFailover

Sub-30-second walkthrough

What happens, second by second.

When a primary node fails, recovery happens in four ordered steps. Operator action: zero. Customer-side disruption: shorter than a TCP retransmit.

T+0s

Heartbeat lost

Standby's keepalived stops receiving the primary's VRRP advertisements. After 3 missed beacons (~1.5s), the standby promotes itself.

Detection · 1.5s

T+2s

Promotion + STONITH

The new primary fences the old node — power-off via cloud API — and accepts writes. Replication direction inverts; the recovered node will rejoin as standby.

Promotion · 2s

T+5s

Virtual IP migrates

The VRRP virtual IP moves to the new primary's NIC. The app layer's connection pool flushes; in-flight queries retry against the new endpoint.

VIP migration · 3s

T+<30s

Steady state

Service is restored. The recovered node, when reachable, syncs and joins as warm standby. Audit log entries are written. Operator receives a single notification, after the fact.

Full recovery · 30s

§ CMatrix

What is — and isn't — preserved.

Reads

Continuous. Standby serves read replicas during failover; clients never see an interruption on read traffic.

Writes

Briefly suspended (< 5 seconds) during promotion. In-flight writes are retried by the connection pool with idempotency keys.

User sessions

JWT-based, stateless. Sessions persist through the failover with zero re-authentication.

Background jobs

BullMQ queues are Redis-backed and survive failover. In-progress jobs complete on the new primary.

Live streams

Origin servers are independent of the database. Stream ingestion and HLS delivery are unaffected.

Operator dashboard

Renders against the new primary the moment VIP migrates. Audit log shows the failover event.

§ ∞Talk to us

Pilot a managed HA cluster

Engineer-grade uptime.

Talk to engineering→See pricing