We noticed severe service degradation against the helium api today. This issue greatly impacted the use of phone apps, explorer and api.
Core databases to helium-api were spiked to a load average of over 200(which is through the roof) and api workers were crashing. A recovery backoff was fired to keep the api servers from continuously restarting and exacerbating the issue.
At this point we knew we had to identify someone accidentally creating a denial of service against the API. With an API that is handling millions of requests per hour. This is a bit like finding a needle in a hay stack, more so when everything is failing.
After some time we discovered the user that was affecting us and we stopped them. Database and API performance has completely recovered and things are looking good.
We are working to come up with a better process that will allow us to easily identify api consumers. This will allow us to impose limits on API consumers that are not attached to Phone, Explorer, and other core Helium services. Stay tuned, and thanks for hanging in there.