The postgresql primary database crashed and was offline for roughly 10 minutes. This was due to WAL logs overflowing and filling up the wal_log directory in the pgsql data directory.
Wal_logs are stored on disk and used to send data to follower databases. We also store wal logs in archive mode just in case there is an issue and we need to restore to a point in time. Wal_log archiving was disabled by manipulating the archive command. This is a common practice because to disable archive mode you must restart the whole DB. We don’t want to do this because we store most of the ETL database in Ram. For some reason this version of postgres was not happy about the archive command being manipulated and it just stored the wal logs in an alternate location to the archive location, until it filled the disk and crashed postgres.
To remediate this issue for the future we will monitor the archive directory and wal_log directory and routinely delete older wal_logs in an automated fashion.
If you’ve gotten this far, thanks for reading and we appreciate you.