January 8, 2023
Summary of Issue:
This week, we rolled out some additional components that were intended to increase the resiliance of Redis to failure (namely Sentinel and HA Proxy). As a part of this effort, we also now take backups of our cache that can be restored on restart. As we've now learned, larger datasets (such as our production cache) require much more CPU resource to be allotted for these backup processes than we had planned. During one of these backups, the CPU for our container began to throttle. Eventually, this caused a Kubernetes liveness probe to fail, and the container restarted. Replication was seemingly unable to recover (likely due to resource constraints) after the restart, eventually leading to subsequent probe failures and restarts.
Another component of interest would be the Minds backend, which can also be refactored to fallback to requesting from the origin server if the cache is unavailable. This would mean that in the event of Redis failing, users would experience slowness while the application remains usable. In current state, Redis is a critical dependency that breaks the application if down.
In closing, CPU constraints seem to be the root cause here. That said, there's much that can be done to both harden our caching layer and the application itself to be more resilient in the face of such failures. Please see the "Follow up actions" section for more details.
How did this affect end users? (Link Severity/Priority)
- Failing logins.
- Generalized latency.
- Error messages (NOREPLICAS)
How was the incident discovered?
Alarm was triggered on API latency and Redis master status.
What steps were taken to resolve the incident?
- Attempted restart of statefulset.
- Flushed Redis cache by recreating pods w/o RDB file present.
Issue Timeline (UTC)
- [02:22] - Alarm triggered for high API latency.
- [02:29] - Issue identified as Redis being downed. Kubernetes liveness probes failed, and containers were repeatedly restarting.
- [02:30] - Attempted rolling restart of Redis stateful set, however replication was unable to recover.
- [03:00] - Recreated Redis cache, flushing existing keys.
- [03:03] - After flushing the cache, latency spiked and eventually returned to a normal state.
- [03:21] - Closed incident.
Root Cause (5 Whys)
- Enabling Redis backups is likely causing more processes to be running, as Redis will fork backup operations to a background process. see here
- This would increase overall CPU utlization, consistent with our metrics.
- With the CPU being throttled, this could potentially starve the Redis exporter. This may cause slowness when responding to the liveness probe.
- Liveness probe fails, and Redis restarts. Replication is then unable to recover (more testing required to reproduce).
Follow up actions
- Increase CPU request for Redis.
- Enable latency monitoring on the Redis side. This will compliment our Grafana metrics and provide more useful debug info in the future.
- Attempt to reproduce replication failure w/Litmus tests. We should confirm our above suspicions regarding CPU restraints being the culprit here.
- Introduce a timeout when retrieving things from cache, rather than failing the request we can instead request the origin if Redis is down.