Built to never go dark.
Every component runs hot. Active and standby database nodes replicate continuously. A virtual IP floats above the cluster — keepalived watches the primary, and the moment it stops responding, traffic is on the standby before the average viewer notices a buffer. This isn't a Postgres tutorial bolted onto a SaaS — it's first-class infrastructure your operator never has to think about.
Three nodes, one truth.
An active/standby database pair replicates over a private network. A pair of nginx load-balancers participates in VRRP for a virtual IP that floats between them. The application cluster reads through a connection pool that flushes and retries when the primary moves. Every link is dual-pathed.
What happens, second by second.
When a primary node fails, recovery happens in four ordered steps. Operator action: zero. Customer-side disruption: shorter than a TCP retransmit.
Heartbeat lost
Standby's keepalived stops receiving the primary's VRRP advertisements. After 3 missed beacons (~1.5s), the standby promotes itself.
Promotion + STONITH
The new primary fences the old node — power-off via cloud API — and accepts writes. Replication direction inverts; the recovered node will rejoin as standby.
Virtual IP migrates
The VRRP virtual IP moves to the new primary's NIC. The app layer's connection pool flushes; in-flight queries retry against the new endpoint.
Steady state
Service is restored. The recovered node, when reachable, syncs and joins as warm standby. Audit log entries are written. Operator receives a single notification, after the fact.