Causes in Real Life – How Organizations Perform a Root Cause Analyses (RCA)

Posted on July 19, 2018 by

Having spent considerable time studying the vexing problems related to causation in philosophy, I was immediately intrigued when I learned that companies and other organizations routinely engage in or perform what they call root cause analyses (RCAs). I recently had the opportunity to take the courses and training in order to perform RCAs, and have subsequently performed a few such analyses. The purpose of this note is to simply share this information back to philosophers – not because I think these concepts will help solve any of the philosophical issues or problems surrounding causation, but rather to provide further insight into the uses of the causation concept in practice. If anything, I hold out more hope of philosophy improving the root cause analysis method, either directly or by at least providing a better understanding of its inherit limits. Lastly, I should also mention that this is not meant as an instruction manual for how to do a root cause analysis: I only touch on the highlights (i.e., points I happen to find interesting) and I do not provide enough information to enable a person to actually perform a RCA.

First of all, as might be expected, companies (I’ll use ‘companies’, but this applies to organizations in general) occasionally have incidents or events that they do not wish to repeat. What they do is hire (either externally or internally) a qualified person or group to perform an RCA. The idea is to have an independent or objective investigator identify the causes of the incident, and then make recommendations on which measures to implement to make sure the event will not happen again (called ‘corrective actions.’)

There are several methods that companies have used to identify root causes. Some simple RCA methods include the ‘5 whys’ method and the fishbone method. In this note I talk about a more sophisticated method that has been used for decades around the world and in every industry. This method is taught in numerous cities with various courses focused on different aspects of root cause analyses. I won’t identify the company that has developed this method, since I am here just interested in the more abstract aspects of this popular (if not dominant) way of performing RCAs today.

RCAs start with identifying the incident or event to analyze. The incident is defined very pragmatically: it is (typically) the worst thing that happened. For instance, although numerous events might have happened the same afternoon as an explosion in an industrial building, such as reduced worker productivity or perhaps a minor car accident in the parking lot, the root cause analysis would most likely focus on the explosion itself. Speaking from experience, in some cases the incident may not be a single event at all, but instead it could simply be an undesirable state in which the organization finds itself. The incident will, at any rate, be the reason why you are performing the root cause analysis in the first place.

Witnesses are then interviewed and all forms of evidence such as documents and emails are reviewed. A timeline of actions, events or generally ‘things that happened’ is constructed. This timeline is constructed first, without making judgments about what did not happen or perhaps what should have happened. Once constructed, conditions are then listed under each event that happened. The conditions can be thought of as properties or facts about the event itself, such as place, time, and whether a procedure was used or available for that action. They can then also include things that did not happen or should have happened, such as the lack of use of a particular procedure.

So far at this point there is no explicit use of any particular causal concept. Events in the timeline are not necessarily taken to be causally connected (they are arranged merely chronologically), and the conditions listed beneath each event are meant to further describe and provide background information. Of course, both the events and conditions are only included because they are suspected of being possibly relevant to the incident in some way, but their actual relevance will be determined at a later stage. At this point it is more important that all the possibly relevant facts, as best as can be determined, are listed.

The next step is to identify a specific subset of causes. The operating concept of cause here is the subjunctive conditional: causes are actions or conditions that, if they were corrected, would have prevented the incident, or would have significantly mitigated its consequences. But more specifically, we are looking for actions or conditions that are considered mistakes – something that someone did that they shouldn’t have, or something that they should have done but failed to do. So here we have the subjunctive conditional at work in determining what counts as a cause, and also a normative judgment about what someone ‘ought’ to have done, in order to focus more narrowly on the causes of interest.

Once these causes of interest (i.e., mistakes) are identified, we are close to getting to the ‘root’ cause. This was perhaps the more interesting part for me given what I learned in philosophy: what could this system possibly take to be a ‘root cause,’ and how could one identify it in practice? The first thing I learned was that, despite the name, there is no requirement or presumption that there is only one ‘root’ cause. There will in fact often be more than one, and that is okay. This is actually to be expected, because there will always be at least one root cause for every mistake identified.

The next surprise was that this method utilizes preloaded buckets or categories (with sub-categories) of causes. You first identify which one of the pre-set broad cause categories applies in the case of each and every mistake (they are done separately). An example of such a broad cause category would be ‘Human Performance Difficulty.’ These broad categories or buckets have been crafted from experience, and just to make sure every cause fits into one category or another, there is a catch-all category called ‘other.’ The interesting part is that the investigators I talked with who had decades of experience in performing RCAs never had to use this category. In other words, empirical data (or experience) was used to generate very successful broad cause categories into which the causal factors almost always fit.

Once each mistake has been analyzed under the broad cause categories, it then gets analyzed through a series of ‘yes or no’ questions to determine which smaller buckets apply, again preloaded from experience. At this level, more than one category can apply. But you are not done yet: each mistake then gets analyzed and refined two more times into even smaller and smaller buckets. The smallest buckets are the “root” causes. The root causes, just like the broader causes above them, are based on experience and the list is meant to be exhaustive – every mistake will have been caused by one or more of these root causes at the end of the analysis.

I’ll provide an example here to make this description more concrete. Suppose an explosion occurred, and that at least one mistake was identified: a person didn’t follow (at least successfully) a procedure for transferring a flammable liquid from one container to another. Suppose we first determine that this is a ‘Human Performance Difficulty’ (rather than ‘Sabotage’ for example), and we then analyze it further. As an example, let’s assume that a procedure was available, and was used, but a mistake was made despite this, and that further the work was being performed in an adverse environment. The ‘yes or no’ questions would funnel our questions into the smaller buckets called ‘procedures’ and ‘human engineering.’

Next, we take the facts we know about the situation, and (for example) determine that for the procedure: it was followed incorrectly (rather than the procedure not being available, for example), and that this was caused by poor equipment identification (rather than being too complex, for example). Secondly, regarding human engineering, we may identify that the work environment was a factor (rather than that system was too complex, for example), and that more specifically the lighting needs improvement (rather than it being too noisy, for example.) The root causes of this incident, then, are poor equipment identification, and poor lighting.

Each root cause has several suggested corrective actions – those universal changes that are suggested for implementation in order to prevent the incident from happening again. As before, the corrective actions are pre-loaded for each root cause based on experience. People are free to come up with their own corrective actions depending on the needs of the organization, of course, but it is helpful to see what is generally thought to be best practice as a starting point. In the example above, the corrective actions might be fairly obvious (improved lighting and better equipment identification), but that is not always the case. This concludes my basic description of how RCAs are commonly done in practice.

To summarize, the system in general uses the subjunctive conditional (one of Hume’s definitions) with the normative criterion to identify the initial (broad) causes. The entire subsequent categorization and analysis scheme is generated from experience – the analysis is not left open-ended, but rather there are predetermined buckets into which the causes fall. This summary is simplified and does not show how the system deals with the more difficult cases or common errors people make while doing an analysis. However, it appears as though problematic cases or common errors simply get incorporated into the system: either new sub-categories or root causes can always be added over time, or where this would be redundant, instructions are provided for analyzing those causes.

There does appear to be a strong conceptual connection between root causes, corrective actions, and indeed the general purpose of doing RCAs. RCAs are done in order to avoid repeating the incident, but on the other hand it is only actions or conditions that are considered to be mistakes that need or should be further analyzed and ultimately corrected – we don’t expect any person or system to do that which is not reasonable to do. This judgment is typically captured by looking at industry best practices – that is, the very best standards currently in use within an industry. The idea is that we ‘ought’ to meet these best standards. Thus the list of root causes (there are about one hundred) is already very restricted: it excludes most causes someone could or might dream up, such as ‘not enough guidance from our alien overlord.’ In practice, such a cause is not considered a mistake at all, and there would be no associated corrective action issued for such a cause. The number of root causes that are listed at the end of the day thus appears to be simply the minimum number of distinct ways that someone can make a mistake or fail to meet the best standards within an industry.

As mentioned at the outset, I did not intend this paper to shed any light on what (metaphysically) it is to be a cause. Since this is a blog, I’m wondering if philosophers believe there are any insights here, or if they found any of this surprising (like I did). Comments are welcome.

Mike can be found on twitter at @DrMikeSteiner