Date
January 11, 2023
Incident attendees:
Summary of Issue:
This disruption to service was the result of a large number of requests to the Minds Nostr relay. Due to limitations with the existing architecture, the relay causes quite a bit of strain on the engine in situations like this. Each REQ
event supplied to the relay currently entails an HTTP request to the Minds backend in order to retreive post data, which in this case overwhelmed the pods. In order to ensure we can provide the best experience for using Minds on Nostr, we'll briefly be taking the relay offline while we work to address these inefficiencies.
How did this affect end users?
- The site would be totally inaccessible as all engine pods were in a failed state.
How was the incident discovered?
Alarm was triggered on API latency and Redis master status.
What steps were taken to resolve the incident?
- Attempted to scale up engine pods to accomodate additional requests from the Minds Nostr relay.
- Scaled down Minds Nostr relay to 0 replicas, making it inaccessible.
Issue Timeline (UTC)
- [02:10] - Alarms triggered for high production latency.
- [02:12] - First response, troubleshooting begins.
- [02:20] - Rolled out additional compute to attempt to accomodate additional requests.
- [03:20] - Nostr relay scaled down in order to terminate the service.
Root Cause (5 Whys)
- Minds Nostr relay receives a high number of requests.
- Due to limitations with the current architecture, this places a tremendous strain on the engine (Minds backend).
- Eventually the engine is unable to accomodate, and pods begin to fail liveness probes.
- As some of our pods begin to fail, the remaining healthy pods become even more stressed and become unable to serve.
- Site becomes inaccessible.
Follow up actions
- Refactor data architecture for Minds Nostr relay to address existing inefficiencies.
- Implement circuit-breakers to prevent downstream dependencies from being able to overwhelm the Minds engine.
- Limit message rate for clients on the relay to reduce strain from individual subscriptions.
- Improve access logging for easier searchability.
- Migrate Kubernetes
Ingress
for Minds Nostr relay to Traefik IngressRoute
for Traefik metrics.
- Implement Prometheus metrics for the relay to better track active subscriptions and message rate.