Postmortem of outage on 20th December

On 20 December, Cachix experienced a six-hour downtime, the second significant outage since the service started operating on 1 June 2018.

Here are the details of what exactly happened and what has been done to prevent similar events from happening.

Timeline (UTC)

Root cause analysis

All ~24k HTTP requests that reached the backend during the outage failed with the following exception:

Dec 20 09:01:14 cachix-production cachix-server[654]: CallStack (from HasCallStack):
Dec 20 09:01:14 cachix-production cachix-server[654]: error, called at ./Control/AutoUpdate.hs:139:48 in auto-update-0.1.6-F3d4kPU62BK2oPp4eHKJaA:Control.AutoUpdate
Dec 20 09:01:14 cachix-production cachix-server[654]: [2020-12-20 09:01:14][Cachix][Error][cachix-production-cachix][PID 654][ThreadId 643235][cachix-server-0.1.0.0-LHzKyzKAQiLEVcLRmpg6VW:Cachix.Server src/Cachix/Server.hs:108:13] Unhandled exception: Control.AutoUpdate.mkAutoUpdate: worker thread exited with exception: thread blocked indefinitely in an MVar operation

This looks like the infamous deadlock due to a bug in threading code.

I recently upgraded to GHC 8.10.3 together with a switch to non-moving GC and bumped to Stackage nightly. The bug could be just anywhere in the Haskell software stack, as I haven’t seen this before.

Unfortunately, it’s not possible to currently debug it without a reproducible test case.

The only hope here would be to have exception provenance in GHC so that when exceptions are rethrown they preserve the cause stacktrace of the origin exception.

Failure to wake me up

A Pagerduty phone call never reached my phone. Looking at logs, there are no details why that was the case.

Neither did my phone receive an SMS about a missed call. Signal was weak, but I have confirmed that my phone is able to accept calls on the nightstand.

I used Pagerduty to test call my phone and nothing happened. This is extremely disappointing as am I paying Pagerduty specifically for this feature.

Pagerduty support said that their telco provider found discrepencies and that it was supposed to ring for 15 s.

Improvements

Pagerduty notifications

I asked around to find a way to have my phone on silent while also letting the Pagerduty application use sound, rather than relying on phone calls.

Unfortunately this is not possible on Android when the phone is on silent: there is no way for an application to override silent mode.

However, on my Android device, Do Not Disturb allows overrides and a similar effect is created when it is configued properly: Under the section Exceptions I needed to set Don't allow any messages and under the Behavior section restrict notifications to No sound from notifications. Settings may vary depending on the Android flavor and version.

Systemd watchdog health checks

The backend is already using the warp-systemd package that allows specifying the interval at which systemd watchdog is notified.

I have extended it to support executing a function as a health check that makes an HTTP request and restarts the backend in case erorrs like these happen again. And they do happen.

Domen