Minds API Inaccessible

Incident Report for Minds

Postmortem

Date

January 11, 2023

Incident attendees:

Fausto
Zack
Mark

Summary of Issue:

This disruption to service was the result of a large number of requests to the Minds Nostr relay. Due to limitations with the existing architecture, the relay causes quite a bit of strain on the engine in situations like this. Each REQ event supplied to the relay currently entails an HTTP request to the Minds backend in order to retreive post data, which in this case overwhelmed the pods. In order to ensure we can provide the best experience for using Minds on Nostr, we'll briefly be taking the relay offline while we work to address these inefficiencies.

How did this affect end users?

The site would be totally inaccessible as all engine pods were in a failed state.

How was the incident discovered?

Alarm was triggered on API latency and Redis master status.

What steps were taken to resolve the incident?

Attempted to scale up engine pods to accomodate additional requests from the Minds Nostr relay.
Scaled down Minds Nostr relay to 0 replicas, making it inaccessible.

Issue Timeline (UTC)

[02:10] - Alarms triggered for high production latency.
[02:12] - First response, troubleshooting begins.
[02:20] - Rolled out additional compute to attempt to accomodate additional requests.
[03:20] - Nostr relay scaled down in order to terminate the service.

Root Cause (5 Whys)

Minds Nostr relay receives a high number of requests.
Due to limitations with the current architecture, this places a tremendous strain on the engine (Minds backend).
Eventually the engine is unable to accomodate, and pods begin to fail liveness probes.
As some of our pods begin to fail, the remaining healthy pods become even more stressed and become unable to serve.
Site becomes inaccessible.

Follow up actions

Refactor data architecture for Minds Nostr relay to address existing inefficiencies.
Implement circuit-breakers to prevent downstream dependencies from being able to overwhelm the Minds engine.
Limit message rate for clients on the relay to reduce strain from individual subscriptions.
Improve access logging for easier searchability.
Migrate Kubernetes Ingress for Minds Nostr relay to Traefik IngressRoute for Traefik metrics.
Implement Prometheus metrics for the relay to better track active subscriptions and message rate.

Posted Jan 13, 2023 - 17:16 UTC

Resolved

A bug in the way we interface with Nostr brought our core infrastructure down. Our relay will be unavailable until we fix this issue.

We'll publish a full post-mortem tomorrow containing the full details of the outage.

Posted Jan 11, 2023 - 03:54 UTC

Investigating

We're experiencing an elevated level of API errors and are currently looking into the issue.

Posted Jan 11, 2023 - 02:10 UTC

This incident affected: Core services (API, Web, Search / Feeds, Notifications), Video (Video Service), and Legacy chat (chat.minds.com).