Incident attendees:
Summary of Issue:
Generalized slowness of the site, feeds loading slowly. Latency spike for engine. Email and SMS 2FA not functioning. Verification codes being sent although upon entering them engine responded with a 401.
How did this affect end users?
Slowness when loading feeds or otherwise using the app, as well as being unable to use Email or SMS 2FA.
How was the incident discovered?
End user reports as well as reports from within the team. Eventually an alarm was triggered for engine latency.
What steps were taken to resolve the incident?
Performed a rolling update for Redis master and replicas.
Issue Timeline (UTC)
- 02:00 UTC (4/18): First reports of slowness from end users.
- 13:03 UTC (4/18): First alarm triggered for production engine latency.
- 17:00 UTC (4/18): Attempted scaling up Engine deployment in response to engine latency.
- 18:40 UTC (4/18): Increased size on volume for Elasticsearch data node.
- 21:22 UTC (4/18): Added Elasticsearch data node.
- 21:42 UTC (4/18): Attempted adding Redis replica.
- 21:51 UTC (4/18): Restarted the Redis master and replicas.
- 21:52 UTC (4/18): Issue resolved.
Root Cause Analysis
- Application latency, 2FA codes not working.
- Cache was not being replicated to Redis replicas.
Follow up action (GitLab Issues)
Infrastructure/#19
- Include
redis_master_link_up
and redis_master_sync_in_progress
in Redis dashboard
- Create alert for
redis_master_link_up
Infrastructure/#20
- Track Elasticsearch replication
- Unassigned shards alert
Infrastructure/#21
- Look at Elasticsearch replica distribution across data nodes and delete unused indices
Infrastructure/#22
- Investigate missing Redis logs from ES
Engine/#2302
- Move 2FA codes out of Redis
Docs/#55
- Create an alarm for request latency, lower and higher priority based on latency