Crash-only Design 💥

I'm a self-taught software engineer, so I learned what I know about distributed systems on the job. I believe I first encountered "Crash-only Design" in the course of my work at Laszlo Systems. I don't remember exactly how it came up, but I think it may have been part of our effort to make the Laszlo Presentation Server (a J2EE app) OSGi compliant. OSGi, like a lot of frameworks, includes a spec for "graceful" shutdown. I was surprised when one of the senior engineers told me not to implement this or to extend it into our own framework, referencing this now-classic paper.

The precept of crash-only is relatively simple: don't ever write "normal" termination for your service. When your time is up, you just exit as quickly possible, whether you are in a "good" state or not. If you're running from the command-line, this means that your app should always be fine if you just hit CTRL-C. It also suggests that crashing is kind of a good thing, or at least it shouldn't be a super bad thing. If you don't know exactly what your program should do when it encounters an unexpected error, you should propagate the exception and allow it to crash.

This lesson made a big impression on me because it is relatively easy to understand, yet it has many of the properties common to the principles of distributed systems. It's beyond counter-intuitive and almost perverse that crashing more makes systems more reliable. It embraces the inevitability of failure, which is the over-arching theme of working with distributed systems. And it promotes thinking about error pathways and exceptional conditions, which is the cornerstone of robust system design but which is hard to incorporate into the everyday routine of incremental software development.

The best argument for crash-only comes from simplicity. You know your program is going to crash at some point, so you'll need a startup path that allows it to recover from crashes. Now if this is an alternate to the "normal" startup path, this special handling increases the complexity not just of the development, but also of the design, verification and testing of your system. The paper says this as well as I ever could.

Since crashes are unavoidable, software must be at least as well prepared for a crash as it is for a clean shutdown. But then—in the spirit of Occam’s Razor—if software is crash-safe, why support additional, non-crash mechanisms for shutting down?

As with most systems that are intended for dealing with exceptional situations, there is immense danger that any recovery system will actually compound failure. Furthermore, since it's a path that's exercised relatively rarely, it is hard to predict or debug what will happen when the recovery pathway is exercised in production, at a time when a bunch of stuff is (by definition!) already going wrong.

Maybe a more intuitive argument comes from the software developer's inevitable double-life as hack IT support for their friends and family. Turn-it-off-and-turn-it-on-again remains a surprisingly effective way to resolve computer problems. Similarly, every production system I've ever worked on has some kind of provision for "churn," capturing the idea that system performance and sometimes even correctness slowly degrades with uptime. By preferring to crash, we are accelerating this churn and encouraging it to be fast and painless. If the system always comes up on the recovery pathway, then the developers will have the right incentives to keep that pathway fast. The more we encourage churn in our systems the harder we make it build up the kind of unrecoverable warm state that is so dangerous for a distributed system.

But maybe the best argument, to my mind, is that crash-tolerance is an extreme form of concurrency. Now, let me remind you that concurrency is not parallelism. Concurrency is more like "interruptibility," and crashing is the ultimate form of interruption. There's no faster way to fail than just not being available at all, so in some senses we can think of crashing as a way of promoting responsiveness. The original paper places high importance on on timeouts, but I found this a little confusing. Timeouts are super important but they are also really hard to get right, since they tend to have to have a bunch of hidden dependencies in the form of matching configuration across system boundaries. In general, I think it's better to find out quickly that the system that you are trying to talk to just isn't there, rather than wait for timeout. This is why, as this line of research demonstrated, crashing and restarting often ends up being faster than trying to recover. Propagating unavailability back to a caller, even if it goes all the way back to an end-user, is generally preferable to tolerating slow or (worse) incorrect operation.

Now, crash-only isn't a panacea. One aspect of this line of thinking that I don't think has really withstood the test of time is the idea of microcrash/microreboot. This is the basis of the Erlang/OTP error-handing strategy, and I don't think that's really practical. I have a lot of love and respect for the ideas in Erlang, and I know that real-world results show that it can be used to build extremely reliable systems. But my limited personal experience with Erlang includes a time when we identified a module built for an early version of Facebook chat that was "crashing" hundreds of thousands of times a day. Because crashes were so lightweight, we were missing important signals about errors in the application. When crashes get small enough to effectively be errors, some of the power of this paradigm is lost.

But this is only a quibble. Crash-only is the right philosophical stance for a distributed systems programmer. It states unequivocally that we don't tolerate errors in production. When we receive a message we don't understand, or we encounter a state that we thought was unreachable, we should crash. Crashes are good for the same reason that build errors are preferable to build warnings. We want to be conservative in how define errors and aggressive in how we handle them. For the most part, we don't want application developers to be careful or cunning in how they address sticky situations; we want to fail loudly and spectacularly in order to bring error conditions to the fore.