Database Outage

Incident Report for Helium

Postmortem

The postgresql primary database crashed and was offline for roughly 10 minutes. This was due to WAL logs overflowing and filling up the wal_log directory in the pgsql data directory.

‌

Wal_logs are stored on disk and used to send data to follower databases. We also store wal logs in archive mode just in case there is an issue and we need to restore to a point in time. Wal_log archiving was disabled by manipulating the archive command. This is a common practice because to disable archive mode you must restart the whole DB. We don’t want to do this because we store most of the ETL database in Ram. For some reason this version of postgres was not happy about the archive command being manipulated and it just stored the wal logs in an alternate location to the archive location, until it filled the disk and crashed postgres.

To remediate this issue for the future we will monitor the archive directory and wal_log directory and routinely delete older wal_logs in an automated fashion.

If you’ve gotten this far, thanks for reading and we appreciate you.

Posted Oct 15, 2021 - 09:52 UTC

Resolved

This issue has been fully resolved. We apologize for any inconvenience.

Posted Oct 15, 2021 - 09:43 UTC

Monitoring

Something rapidly consumed the disk on our primary database for the API and caused a crash. We have recovered everything, but continue to hunt the culprit.

Posted Oct 15, 2021 - 09:02 UTC

Investigating

We are currently experiencing a database outage for the Helium API. We are working as quickly as possible to diagnose and bring it back online. More details soon.

Posted Oct 15, 2021 - 08:55 UTC