Elevated API Errors

Incident Report for Minds

Postmortem

Incident attendees:

Zack
Martin
Ben
Mark

Summary of Issue:

Generalized slowness of the site, feeds loading slowly. Latency spike for engine. Email and SMS 2FA not functioning. Verification codes being sent although upon entering them engine responded with a 401.

How did this affect end users?

Slowness when loading feeds or otherwise using the app, as well as being unable to use Email or SMS 2FA.

How was the incident discovered?

End user reports as well as reports from within the team. Eventually an alarm was triggered for engine latency.

What steps were taken to resolve the incident?

Performed a rolling update for Redis master and replicas.

Issue Timeline (UTC)

02:00 UTC (4/18): First reports of slowness from end users.
13:03 UTC (4/18): First alarm triggered for production engine latency.
17:00 UTC (4/18): Attempted scaling up Engine deployment in response to engine latency.
18:40 UTC (4/18): Increased size on volume for Elasticsearch data node.
21:22 UTC (4/18): Added Elasticsearch data node.
21:42 UTC (4/18): Attempted adding Redis replica.
21:51 UTC (4/18): Restarted the Redis master and replicas.
21:52 UTC (4/18): Issue resolved.

Root Cause Analysis

Application latency, 2FA codes not working.
Cache was not being replicated to Redis replicas.

Follow up action (GitLab Issues)

Infrastructure/#19
- Include redis_master_link_up and redis_master_sync_in_progress in Redis dashboard
- Create alert for redis_master_link_up
Infrastructure/#20
- Track Elasticsearch replication
- Unassigned shards alert
Infrastructure/#21
- Look at Elasticsearch replica distribution across data nodes and delete unused indices
Infrastructure/#22
- Investigate missing Redis logs from ES
Engine/#2302
- Move 2FA codes out of Redis
Docs/#55
- Create an alarm for request latency, lower and higher priority based on latency

Posted Apr 19, 2022 - 19:51 UTC

Resolved

The issue was identified to be caused by a large number of cache misses, and was resolved by performing a rolling update on the Redis master and replicas.

Posted Apr 19, 2022 - 17:14 UTC

Investigating

We're experiencing an elevated level of API errors and are currently looking into the issue.

Posted Apr 18, 2022 - 13:09 UTC

This incident affected: Core services (API, Web, Search / Feeds).