API Databases Overloaded
Incident Report for Helium
Postmortem

Use of Explorer , API , Helium Apps Degraded

We noticed severe service degradation against the helium api today. This issue greatly impacted the use of phone apps, explorer and api.

Diagnosis

Core databases to helium-api were spiked to a load average of over 200(which is through the roof) and api workers were crashing. A recovery backoff was fired to keep the api servers from continuously restarting and exacerbating the issue.

  • We verified there was no hardware problem with the core databases
  • We verified that the api worker cluster was not overload and there were no hardware problem
  • We verified there were no issues with the content delivery network.

At this point we knew we had to identify someone accidentally creating a denial of service against the API. With an API that is handling millions of requests per hour. This is a bit like finding a needle in a hay stack, more so when everything is failing.

Recovery

After some time we discovered the user that was affecting us and we stopped them. Database and API performance has completely recovered and things are looking good.

Remediation Plan

We are working to come up with a better process that will allow us to easily identify api consumers. This will allow us to impose limits on API consumers that are not attached to Phone, Explorer, and other core Helium services. Stay tuned, and thanks for hanging in there.

Posted Oct 23, 2021 - 22:23 UTC

Resolved
Traffic has gone back to normal. Thanks for hanging in there folks.
Posted Oct 23, 2021 - 22:00 UTC
Update
We are going to continue to monitor but system performance should be back to normal. Before closing this incident there will be a post mortem and remediation plan.
Posted Oct 23, 2021 - 21:15 UTC
Monitoring
We've identified the "bad api client" and have temporarily remediated the issue while we continue to investigate. Performance should be starting to recover.
Posted Oct 23, 2021 - 21:07 UTC
Update
We are still working to diagnose the source of the issue and will update you as soon as we can with a report. The explorer and API are still in a degraded state.
Posted Oct 23, 2021 - 20:56 UTC
Investigating
Something is currently overloading the API databases and diminishing the user experience in explorer and the helium phone apps. We are investigating and searching for this rogue process.
Posted Oct 23, 2021 - 18:56 UTC
This incident affected: User Apps (API) and Explorer, Helium App.