High Availability · Disaster Recovery

Built to never go dark.

Every component runs hot. Active and standby database nodes replicate continuously. A virtual IP floats above the cluster — keepalived watches the primary, and the moment it stops responding, traffic is on the standby before the average viewer notices a buffer. This isn't a Postgres tutorial bolted onto a SaaS — it's first-class infrastructure your operator never has to think about.

99.999%
Service-Level Objective
<30s
Failover Recovery
3×
Replication Targets
0
Operator Action Required
§ ATopology
Cluster Anatomy

Three nodes, one truth.

An active/standby database pair replicates over a private network. A pair of nginx load-balancers participates in VRRP for a virtual IP that floats between them. The application cluster reads through a connection pool that flushes and retries when the primary moves. Every link is dual-pathed.

Topology · HA-Cluster-0013 nodes · vrrp v3
VIRTUAL IP10.0.0.100● PRIMARYdb-01writes · accepting○ STANDBYdb-02replicating · warmAPP CLUSTERtuse · ott · prism○ LB-Anginx · keepalivedvrrp master○ LB-Bnginx · keepalivedvrrp backupCLIENT TRAFFIC ↑ 12.4M CCV
§ BFailover
Sub-30-second walkthrough

What happens, second by second.

When a primary node fails, recovery happens in four ordered steps. Operator action: zero. Customer-side disruption: shorter than a TCP retransmit.

T+0s

Heartbeat lost

Standby's keepalived stops receiving the primary's VRRP advertisements. After 3 missed beacons (~1.5s), the standby promotes itself.

Detection · 1.5s
T+2s

Promotion + STONITH

The new primary fences the old node — power-off via cloud API — and accepts writes. Replication direction inverts; the recovered node will rejoin as standby.

Promotion · 2s
T+5s

Virtual IP migrates

The VRRP virtual IP moves to the new primary's NIC. The app layer's connection pool flushes; in-flight queries retry against the new endpoint.

VIP migration · 3s
T+<30s

Steady state

Service is restored. The recovered node, when reachable, syncs and joins as warm standby. Audit log entries are written. Operator receives a single notification, after the fact.

Full recovery · 30s
§ CMatrix

What is — and isn't — preserved.

Reads
Continuous. Standby serves read replicas during failover; clients never see an interruption on read traffic.
Writes
Briefly suspended (< 5 seconds) during promotion. In-flight writes are retried by the connection pool with idempotency keys.
User sessions
JWT-based, stateless. Sessions persist through the failover with zero re-authentication.
Background jobs
BullMQ queues are Redis-backed and survive failover. In-progress jobs complete on the new primary.
Live streams
Origin servers are independent of the database. Stream ingestion and HLS delivery are unaffected.
Operator dashboard
Renders against the new primary the moment VIP migrates. Audit log shows the failover event.
§ ∞Talk to us
Pilot a managed HA cluster

Engineer-grade uptime.