Anyone who has spent any amount of time in a large enterprise knows that there are lots of intractable problems. Academics (and consultants and business schools) often tout they can get to the “root” of a problem and give you a “surgical” solution that will solve the problem and keep everything else the same. Ironically, such magical thinking that a root cause can be found and expertly solved might actually be the root cause of management and other fads.
There are 2 types of magical thinking going on here. The first is that a complex systems problem has a “root” cause. The second (and more subtle one) is that there exists a “superhero” who can find and fix it and make everything perfect again. At this point, you might be asking, so if there’s no “root cause” what exactly are we supposed to “solve” or fix?
The insight I want to share is that in any complex system that has evolved / changed over time, there are always several layers of complexity. Each layer of complexity is an attempt to cover or hedge against some risks or uncertainties in another layer and each such layer introduces its own uncertainties and risks. We see this kind of layering pretty much everywhere. Individual cells band together to form larger organisms and the individual cells shed some capabilities and take on specialized responsibilities. Groups of people form a tribe or village and each individual is no longer completely independent and often takes on specialized professions. This is the “fractal” view of the domain.
In all these (and countless other) examples, there are layers of complexity where each layer does something specialized for the system as a whole while sacrificing some other trait. The trick then is to see how the “problem”, and not just the symptom, manifests in multiple layers. And the solution is to add additional layers that transform a problem you find difficult to solve into something you can potentially reason about and deal with.
Let’s see this in action in the case of slow performance in your database system. The “root” cause thinking will tell you to find the slowest query and tune it by adding an index or rewriting it so that it uses the write index or using whatever other magical incantations your DBA’s give you. This might even solve the most recent “problem”, but it has only solved the symptom.
Let’s look at the same problem fractally. This “problem” is always going to be there and today’s tuning exercise has only temporarily solved it. We need to look at all the layers and figure out additional layers we need to add. This might possibly reveal that your application has poorly written querying patterns. You can’t rewrite so much code, but you could wrap the worst parts in a caching layer to reduce the frequency of bad queries hitting the database. You might also add some profiling in various layers of your system to collect response times, throughput and other such things to help you identify and proactively deal with those in the future. You might need to have some cross training between your dev and ops teams to identify performance issues and how to make each other’s lives easier the next time it strikes. You might also want to talk to your users to understand how often this will happen and potentially think of alternative solutions to give them in such scenarios. This “solution” doesn’t seem very clean or elegant. Because it probably mirrors how real life systems evolve.
The key take-away I would like you to have is that any given problem you hear about is most likely to be a symptom. Root cause will likely give you a way to alleviate the symptom. Solving the root cause alone is not sufficient. See the layers above and below the symptom and how they need to be layered further to transform the problem into something more manageable. Also, have the humility to know that you have not really solved the problem, only figured out a temporary way to contain and manage it.