API Databases Overloaded

Incident Report for Helium

Postmortem

Use of Explorer , API , Helium Apps Degraded

We noticed severe service degradation against the helium api today. This issue greatly impacted the use of phone apps, explorer and api.

Diagnosis

Core databases to helium-api were spiked to a load average of over 200(which is through the roof) and api workers were crashing. A recovery backoff was fired to keep the api servers from continuously restarting and exacerbating the issue.

We verified there was no hardware problem with the core databases
We verified that the api worker cluster was not overload and there were no hardware problem
We verified there were no issues with the content delivery network.

At this point we knew we had to identify someone accidentally creating a denial of service against the API. With an API that is handling millions of requests per hour. This is a bit like finding a needle in a hay stack, more so when everything is failing.

Recovery

After some time we discovered the user that was affecting us and we stopped them. Database and API performance has completely recovered and things are looking good.

Remediation Plan

We are working to come up with a better process that will allow us to easily identify api consumers. This will allow us to impose limits on API consumers that are not attached to Phone, Explorer, and other core Helium services. Stay tuned, and thanks for hanging in there.

Posted Oct 23, 2021 - 22:23 UTC

Resolved

Traffic has gone back to normal. Thanks for hanging in there folks.

Posted Oct 23, 2021 - 22:00 UTC

Update

We are going to continue to monitor but system performance should be back to normal. Before closing this incident there will be a post mortem and remediation plan.

Posted Oct 23, 2021 - 21:15 UTC

Monitoring

We've identified the "bad api client" and have temporarily remediated the issue while we continue to investigate. Performance should be starting to recover.

Posted Oct 23, 2021 - 21:07 UTC

Update

We are still working to diagnose the source of the issue and will update you as soon as we can with a report. The explorer and API are still in a degraded state.

Posted Oct 23, 2021 - 20:56 UTC

Investigating

Something is currently overloading the API databases and diminishing the user experience in explorer and the helium phone apps. We are investigating and searching for this rogue process.

Posted Oct 23, 2021 - 18:56 UTC