Elevated API Errors
Incident Report for Minds
Postmortem

Incident attendees:

  • Zack
  • Martin
  • Ben
  • Mark

Summary of Issue:

Generalized slowness of the site, feeds loading slowly. Latency spike for engine. Email and SMS 2FA not functioning. Verification codes being sent although upon entering them engine responded with a 401.

How did this affect end users?

Slowness when loading feeds or otherwise using the app, as well as being unable to use Email or SMS 2FA.

How was the incident discovered?

End user reports as well as reports from within the team. Eventually an alarm was triggered for engine latency.

What steps were taken to resolve the incident?

Performed a rolling update for Redis master and replicas.

Issue Timeline (UTC)

  • 02:00 UTC (4/18): First reports of slowness from end users.
  • 13:03 UTC (4/18): First alarm triggered for production engine latency.
  • 17:00 UTC (4/18): Attempted scaling up Engine deployment in response to engine latency.
  • 18:40 UTC (4/18): Increased size on volume for Elasticsearch data node.
  • 21:22 UTC (4/18): Added Elasticsearch data node.
  • 21:42 UTC (4/18): Attempted adding Redis replica.
  • 21:51 UTC (4/18): Restarted the Redis master and replicas.
  • 21:52 UTC (4/18): Issue resolved.

Root Cause Analysis

  • Application latency, 2FA codes not working.
  • Cache was not being replicated to Redis replicas.

Follow up action (GitLab Issues)

  • Infrastructure/#19

    • Include redis_master_link_up and redis_master_sync_in_progress in Redis dashboard
    • Create alert for redis_master_link_up
  • Infrastructure/#20

    • Track Elasticsearch replication
    • Unassigned shards alert
  • Infrastructure/#21

    • Look at Elasticsearch replica distribution across data nodes and delete unused indices
  • Infrastructure/#22

    • Investigate missing Redis logs from ES
  • Engine/#2302

    • Move 2FA codes out of Redis
  • Docs/#55

    • Create an alarm for request latency, lower and higher priority based on latency
Posted Apr 19, 2022 - 19:51 UTC

Resolved
The issue was identified to be caused by a large number of cache misses, and was resolved by performing a rolling update on the Redis master and replicas.
Posted Apr 19, 2022 - 17:14 UTC
Investigating
We're experiencing an elevated level of API errors and are currently looking into the issue.
Posted Apr 18, 2022 - 13:09 UTC
This incident affected: Web, API, and Search / Feeds.