A taxonomy of complexity
We can think of software applications — whether large distributed systems or a small, self-contained apps — as containing a certain amount of essential complexity. That is to say, complexity that is inherent to the problem at hand, and which cannot be reduced any further. Complexity that is not essential is called incidental. One of our jobs as engineers is to minimize the amount of incidental complexity in the systems we build.
Incidental complexity is inevitable: we might use a complicated framework because it saves us time; a standard protocol because it’s universal; or we may be forced to reckon with the complexities of virtual machines because we want to make use of public cloud infrastructure. It’s simply not a good use of our time to re-invent a full computing stack to suit the needs of a single application.
What we gain by taking on this incidental complexity, then, is leverage, time saved, and reusability. In other words, we become more productive as higher levels of abstraction help us to focus on the problem at hand — what it is that makes a particular application unique — instead of building bespoke computing infrastructure to suit the application.
But we also lose. As complexity increases — much of it due to incidental complexity — systems become harder to operate and reason about. Often it becomes more difficult to extend your application over time as the computing infrastructure you’ve relied on isn’t quite general enough, or if the abstractions provided don’t quite match your needs. When this happens, we often build workarounds, or we might intentionally violate abstraction boundaries. This in turn makes it even more difficult to reason about your application as it may violate assumptions made in other layers of the computing stack.
Thus, incidental complexity is something we should take on reluctantly, with great caution, employing careful cost-benefit analysis. All too often we see the benefits of generality without considering its costs.
Incidental complexity through distribution
A large part of software design is about making choices pertaining to how we structure the software we’re building. Often this involves defining components that hide complexity and provide layering. Such “component architecture” is a well-worn method that helps us to build complex systems out of simpler parts. Well-designed components are often compositional, providing a sort of software LEGO brick set that can be combined countless ways, and whose potential is limited only by its user's imagination.
A major tension in such software component design is how complexity is distributed. That is, how much complexity is hidden by what components, and how your “complexity budget” is distributed. Are there a number of components that hide about the same amount of complexity? Or is the total complexity of the system hidden behind a few, choice components? Note that I’m not talking about leaky abstractions—that the degree to which abstractions fail to hide the details of their implementations—but rather about the _amount_ of complexity hidden by each component. For example, a trivial component may simply check if a number is odd, while another may provide a full SQL database.
A common example of this trade-off is the choice of databases when building at-scale distributed systems. We might choose to use a strongly consistent, transactional database. Such a database abstracts a lot of complexity, and provides a strong foundation on which you can build a data-oriented application. An alternative choice is to use an eventually consistent database. While this often scales better, it provides a much weaker abstraction: The application has to assume responsibility for data reconciliation, inconsistencies, and sometimes even time travel! If we look at these two systems naively, we may find that their total complexity is about the same (they provide roughly the same amount of functionality, for roughly the same amount of engineering complexity), only it’s distributed differently. In the former, much of the complexity resides in the database itself, while the application itself can rely on the strong guarantees extended by this database to remain comparatively simple. In the latter, the database itself might be quite a bit simpler, but the application assumes a lot of the complexity of data management.
Let’s call the complexity distribution of the former (strongly consistent, transactional database) “concentrated”, while the latter is “diffuse”. What I mean by this is: in the former, a lot of complexity is abstracted by a single component; in the latter, the same guarantees can only be provided by making a much larger swath of the application responsible for data consistency—the complexity related to data management is diffuse.
My claim is that generally, we should prefer concentrated complexity to diffuse complexity. While it may seem that the systems contain roughly the same amount of complexity (and indeed, it’s often the case that the system with a diffuse complexity distribution seems simpler on the surface), it is much more difficult to manage, extend, and reuse.
What’s more, it is usually the case that systems with concentrated complexity become more adaptable over time. Because, say, the complexity of managing data consistency is concentrated, we can apply changes—optimizations, bug fixes, improvements—within a small, well-defined boundary. Indeed a good heuristic to use for good systems design is: are most changes isolated, or do they require coordination across multiple layers of the stack?
I also want to note here that this is not a matter of leaky abstractions, but rather about weak abstractions. That is to say, you can define a component in such a way that using pushes a lot of responsibility to its users. You might call that a “weak” abstraction. A strong abstraction hides more complexity, but this doesn’t mean it is leak-proof.
Design choices through the lens complexity distribution
Our working example so far — that of consistency guarantees in databases — represents a common trade off when designing systems for modern cloud architectures. But there are many others; let’s look at a few classic examples through the lens of complexity distribution.
Manual vs. managed memory. Languages that require manual memory management provide a weak abstraction and push the responsibilities of managing object lifetimes to the programmer. Manual memory management is comparatively simple to implement, and in many cases provide good and predictable performance. The trade-off is complexity diffusion. The semantics of object lifetimes pervade the programs, adding little pieces of complexity — like paper cuts — almost everywhere. This becomes especially difficult to manage well for concurrent code, and is an evergreen source of bugs. Managed memory languages, on the other hand, hide the complexity of memory management from the programmer. The programmer allocates objects freely; they are automatically reclaimed when no longer needed. Implementing a performant and predictable garbage collector is no easy task, but it helps to redistribute the overall complexity of the program: memory management is concentrated in one place (the language runtime) and provides a strong set of guarantees to its users. (There is an interesting and emerging middle ground here, too: Languages like Rust capture object lifetimes in its type system, pushing some additional—machine checked—complexity to the user, while providing strong guarantees about object lifetimes. Do such middle grounds exist elsewhere?)
Caching systems vs. serving systems. A common pattern in cloud systems design is to introduce lookaside caches (usually memcached or Redis) to improve cost and performance. The benefits are tantalizing: a seemingly simple, standalone component can provide temporary storage for frequently accessed data—it’s a handy performance band-aid. However, systems employing such tactics often end up with a diffuse complexity distribution: how do we handle invalidation, read-repair, concurrency? These questions are pushed to the (often multiple) clients of these systems, and, more often than not, it becomes increasingly difficult to reason about data consistency, integrity, and semantics—this has become the shared responsibility of a constellation of systems. An alternative approach is to build a serving system. Such a system is write-through and read-through. That is, it is responsible for providing a scalable and performant interface to a set of other systems. Serving systems may very well deploy memcached internally to store data in memory, but this is an internal implementation detail, and no longer exposed to clients. Building serving systems is usually harder than deploying a managed memcached cluster with some lightweight client code, but you are well-served by concentrating the complexity. Incidentally, reimagining your system as a serving system will also require you to think through what the interface actually is; adding caching band-aid usually does not.
RISC vs. CISC. This is actually a counter-example! (Maybe.) We could think of CISC architectures as concentrating complexity: by providing a rich set of high-level instructions, CISC architectures concentrate complexity into hardware. The compiler can generate higher-level code which is then interpreted by the processor itself. RISC architectures on the other hand present a simpler, less-powerful interface, that requires the compiler to fill in more gaps. Though, notably, in most cases the complexity is only redistributed a little: from hardware to the compiler; it is not diffuse. (Or maybe we should think of RISC as a middle ground. The idea of RISC taken to its extreme, in the form of VLIW architectures, has failed rather spectacularly.)
What can we do about it?
I think that complexity distribution is a useful lens to apply to systems design. The main observation I’ve made in this post is that, generally speaking, systems with concentrated complexity distributions tend to have fewer bugs, be more extensible, and scale better with total systems complexity. It’s worth thinking about how you might improve a system by redistributing its complexity.
It’s also important to think about complexity distribution at the meta level: local and distributed decision making. If a set of teams make local decisions about how to minimize the complexity of their subsystem, the system itself often becomes unwieldy and overly complex—the system has a lot of diffuse complexity. This can usually be prevented by thoughtful systems design: maybe there are a set of locally-complex components that can help to reduce the total complexity of the system?