🆙 Higher and Higher

For a long time, Facebook Chat had this feature where you could access it using an XMPP client. This was when the browser-based chat was so unreliable that I guess XMPP seemed like a reasonable alternative. But this meant that a couple of the engineers on my team were saddled with the odious task of translating the under-specified internal Facebook Chat API to the icky specifics of XMPP. During a rant about this doomed system, my friend alluded to "the end-to-end argument," which I had never heard of before. Reading up, I was blown away by the simple and obvious truth of it.

“Functions placed at low levels of a system may be redundant or of little value.”

One of the few accepted axioms of distributed systems

The end-to-end argument is like an Occam’s Razor of systems design. If failures can happen in the application, why bother to suppress them in low-level modules?

The most solid footing for the argument is the extraordinarily successful design of modern networking. Sure, it may be tempting to add delivery acknowledgement or duplicate suppression to the network layer, but even with those low-level guarantees, the same problems could still arise in the application. Since they must be handled there, the only reason to handle them in lower-level modules is as an optimization.

The paper points out that networking isn’t the only thing that can fail. Disks corrupt; power breaks; memory corrupts absolutely. Your sad, unsuspecting code can get zapped by a gamma ray or burned in a fire. In a scaled, distributed system, anything that can happen will happen. The culmination of the end-to-end argument is that failures are so common in the software that the only possible definition of correctness is in terms of the application as a whole.

Maybe this conclusion seems obvious now, but it was not immediately so. In fact, most of our intuitions about simultaneity and synchronization do not bear up to scrutiny. Although we experience time as a continuous dimension, it seems more likely to me that time is a construct of the mind. So this is an area where our intuition tends to fail us. It turns out that it's more important to optimize the common case by being non-blocking than to optimize the worst case by trying to synchronize. This was so surprising that the design of datagrams was not taken seriously until it was tested in simulation! Almost no one expected the way to make systems more reliable is not to suppress errors, but to accommodate them.

And the batshit insanity of XMPP is testament to this conclusion. Although the stupidities of this protocol are uncountable, the worst thing about it is that it conflates a network message with a user message. The end-to-end argument encourages us to think about application-level data transmission in terms of state synchronization, rather than message passing, because message passing can fail. But XMPP spends a ton of effort on the sending and receiving of network messages and very little on how clients synchronize state. All of the other mistakes that compound the problems with XMPP, even the baffling choice of XML as a streaming transport encoding, pale in comparison to that fundamental flaw.

But the original authors of the end-to-end argument really weren’t talking about user-level protocols like XMPP. To them, TCP/IP looked like “high in the stack.” I wonder what they would make of a typical modern web app, which comprises layers and layers of modules, all of which lie along a multidimensional frontier which is itself changing as we invent new paradigms and new techniques. The question about the end-to-end argument is how far you take it. Where exactly do you draw the boundaries between high- and low-level? What even are the strata of a modern software application?

It’s remarkable that in the forty years since this paper was published, the answers to these questions have only become less clear. It's also what makes working on software so much fun. It means that every team still gets to grapple with fundamental problems like error handling and fault recovery. There’s this lovely example in the paper pointing out the difference in fault tolerance between a voice call and a voice message, even though these applications seem so similar. If you're on a call, throughput trumps fidelity–you can always ask the other caller to repeat themselves. For a message, throughput doesn't matter and fidelity is critical. It's subtle, and it's a great example of why the best engineers are not the ones who can hide an extra bit flag in an unsigned short; they're the ones who truly empathize with the users. Although at this point, I hope we can all agree that it’s pretty much always an error to leave a voicemail, unless your intentions are overtly hostile.

I think one of the most interesting contemporary applications of the end-to-end argument is to certify the superiority of pull over push. I used to think that push was “the right way” to deliver updates to a long-lived client for applications that have a soft real-time element. But my experience working on these apps at scale convinced me otherwise. The argument for polling is almost like the argument for crash-only design: the app is going to need to refresh its state when it initializes or when it misses a message for any reason, so why not optimize for that?

Startup speed is the most important feature for most mature apps, and startup is almost always about pulling the latest state. Even when latency of updates is at a premium, it is rarely worth the tremendous effort it takes to reverse the entire dataflow of your application. Directed polling and incremental client update are usually better designs than pure push. It seems to me like the industry is generally settling on this conclusion, notching another win for the end-to-end argument.

As strong as this evidence is, I think you could also make the case that the end-to-end argument has been less generally applicable than maybe it first seemed. Perhaps the end-to-end argument is really only about networking. When it comes to persistence, the most trusted solutions in the industry are generally SQL-based databases like Postgres or SQLite. This seems to me to be a violation of the end-to-end principle. Basic SQL concepts like Transaction Isolation Levels move complex questions about failure and success out of the domain of the application.

If the end-to-end principle were truly universal, I would think that a much simpler persistence strategy, specifically one that is based on event sourcing, would be more competitive with solutions that rely on a centralized database. But, sadly, none of these solutions have made it to the mainstream. The basic abstraction that SQL offers, one of total order and unambiguous simultaneity, is precisely one that a distributed application can't maintain. Sometimes I think that a lot of the problems in modern distributed apps come from trying to maintain this pretense. We take the convenient shortcut of centralizing state in a database which fundamentally cannot scale, and then spend a ton of effort on scaling by breaking that state back up into caches.

So many of our other abstractions rely on this fundamental assumption that maybe we will be forced down a path that is tied to central storage forever. But sometimes I wonder what would have happened if we had invented networking before disks. Would we make the same assumptions about our data? Would we be so comfortable giving it up to a central entity? Would we accept that it is stored unencrypted by default? Would we expect to be able to easily search it in aggregate? Right now the trend is towards centralizing storage in increasingly opaque services that are managed by corporate behemoths. I'm not happy with this direction, but I'm still hopeful that there's something we haven't yet discovered, something akin to the fundamental insight that packets brought to networking, that will make good on the idealistic promise of the early internet.