Today’s root-cause methodologies often employ the following failure-investigation process. It’s straightforward and practical. And the failure-cause spreadsheet that results from the process will provide a convenient record-keeping document to use in tracking the subsequent investigation.
♦ Initially, a committee is assembled, usually people from within the company, who are expected to have expertise with the equipment or process in which the failure occurred. The committee prepares a spreadsheet listing all the potential causes for the failure that they think may be involved. Next to each spreadsheet column entry for possible causes are at least two corresponding blanks: one blank to list evidence that supports this being a cause of the failure; the other to list evidence that supports this not being a cause of the failure.
♦ Having prepared the potential failure-cause list, the committee directs various company technicians and mechanics to investigate the failure and look for and record any relevant evidence to fill in the blanks in the spreadsheet. Some items may be sent to outside laboratories for specialized examination and evaluation.
♦ When the blanks of the spreadsheet have been sufficiently filled in to the satisfaction of the committee, the entry with the most convincing evidence to confirm that it is the cause of the failure is generally declared the “winner.” Likewise, entries that the committee believes contain convincing evidence that they are definitely not a cause of the failure are dismissed.
♦ After evaluation of the evidence is complete and a root cause has been determined, a plan is put forth to implement corrective actions to fix the problem and, presumably, prevent recurrence.
This process, however, has some pitfalls that can significantly undermine the method’s effectiveness. Those pitfalls need to be recognized and understood.
UNDERPINNINGS OF THE METHOD
No matter how the above process is named, it is basically a variant of the Method of Exhaustion that’s often described in fundamental logic texts. The idea is that if all possible cause hypotheses of a failure are listed AND evidence related to the failure is objectively and thoroughly gathered, the root cause will eventually reveal itself.
♦ Competing hypotheses that are incorrect will be dismissed, one by one, as evidence that falsifies them is gathered.
♦ The real cause of the failure will not have verifiable evidence to eliminate it from contention, AND will have verifiable evidence to validate that it is the cause of the failure.
In cases where there is a scarcity of evidence to indicate which hypothesis in the list is the actual cause, and if enough hypotheses in the list are sufficiently falsified by verifiable evidence, it is sometimes argued that the remaining hypothesis must be the correct one.
This variant of the Method of Exhaustion is more commonly called the Process of Elimination. The method was made semi-famous Sir Conan Doyle’s short story, “The Adventure of the Blanched Soldier,” where his fictional detective Sherlock Holmes stated, “When you have eliminated the impossible, whatever remains, however improbable, must be the truth.”
While that statement sounds plausible and has the support of no less than Sherlock Holmes, himself, the logic underpinning the statement has two significant requirements that should not be overlooked.
1. The initial list of possible causes must be complete. That is, the list must contain all the
root-cause possibilities that could have led to the failure.
2. Every listed cause, except the root-cause candidate itself, must be rigorously falsified and
eliminated from contention by verifiable evidence. AND diligent efforts to falsify (disprove)
the root-cause candidate entry must fail to do so.
These two requirements can be difficult to satisfy. They essentially put a burden upon the committee that prepared the list to have complete knowledge of the possible failure causes that could be involved.
Obviously, the actual cause must be included in the list. AND evidence used to eliminate or support any item in the list should have a high probability of either correctly eliminating or supporting the item. Whenever possible, evidence cited to support or eliminate an item should be unique to either supporting or falsifying an hypothesis. Evidence cited, for example, as disproving a certain hypothesis should not have a reasonable alternate explanation.
Consider this example of unique evidence: The introduction of DNA testing has allowed some people sentenced to death or lengthy imprisonment to be set free, despite all the safeguards in the legal system to ensure that innocent people are not wrongly convicted. These people were originally convicted of serious crimes by jurors that had been persuaded by apparently convincing “for” and “against” evidence, similar to the “for” and “against” evidence listed in a failure-cause spreadsheet. Those jurors didn’t have a “shadow of a doubt” when they voted to convict. Such is the value of unique evidence.
Pitfall: Actual Cause Not on the List
If the actual cause is not included in the spreadsheet list for whatever reason, a cause that is on the list (perhaps one similar to the actual cause), may be concluded to be the cause. There’s a tendency that once such a failure-cause list is prepared, one of the cause items on it will likely be determined to be the cause, even if the real cause is not listed.
If a cause mode is not listed, evidence to verify or falsify that cause may not be sought, although it might be found by accident. When that occurs, there is a good chance that the accidental discovery will be dismissed by the investigators because it was not on the list. Further, any corrective actions proposed for a cause on the list which is not the actual failure cause will likely be ineffective; it will be a waste of company time and resources. It is like taking medicine for the wrong disease. In such cases, it is very likely that the failure will recur in more or less the same way down the line.
The press of time and circumstances, such as a major production line being down or a primary power plant being off-line, can often cause an investigation committee to decide to accept something in the prepared failure-cause list as “probably right,” despite scant evidence having been gathered to make a well-informed decision. When such a decision point is reached due to economic or managerial pressures, it is worth considering not only what a correct decision may save in downtime versus another day or so of investigating, but also what a wrong decision may cost if the failure repeats.
If a certain failure has occurred before, and the team working on the problem has decided to do what was done before to put the production line or plant back on-line, the question should be posed: was the prior cause of failure really found and addressed to prevent recurrence, or is the team simply replacing parts and is again leaving the underlying cause of failure in place?
In several of her detective novels, the author Agatha Christie used the process of elimination as a red herring to fool her readers. In one story, for example, the reader is led to believe that the last person to die in the manor on the island would have to be the murderer. Instead, the actual murderer committed suicide earlier in the story. He was the third person to die and had contrived ways that the others would eventually die after him. In another Christie classic, the reader is led to believe that the murderer was one of the individuals being investigated, as described by the narrator of the story. In the end, though, the murderer turned out to be the narrator.
Pitfall: Confirmation Bias
Confirmation bias is when evidence is gathered basically to prove a favored hypothesis. This is sometimes the favored hypothesis of the ranking person on a root-cause committee. Consequently, the team presumes that investigating the other possibilities is a waste of time and money. The authority figure pushing the favored hypothesis often will assert that he or she “has seen this same failure before, and this event is just like it.” Investigators on the team then may find it more convenient to please the boss rather than disagreeing with him or her.
Confirmation bias causes technical investigators to mainly pursue evidence to support the favored hypothesis, and neglect or perhaps dismiss pursuing that related to other cause possibilities. If the favored hypothesis turns out to not be the actual cause of the failure, the corrective actions to prevent recurrence will be ineffective and the failure will likely recur.
Pitfall: Anti-Confirmation Bias
The committee preparing the failure-cause list is often composed of company managers, line supervisors and upper level administrators. Depending upon the company’s management style, it may be detrimental to an investigator’s career or company standing to suggest that any of the departments associated with members of the committee have any involvement in causing the failure.
In short, some departments or company functions may tacitly be omitted from the possible failure-cause list to avoid displeasing management or certain persons in authority. “We are not the problem” will be the silent message transmitted to the rank and file in the team, and the team will all act accordingly.
Pitfall: A Repair List Instead of a Possible Cause List
This occurs when the list of potential failure causes degenerates into a list of physical items that may have failed and simply need to be fixed or repaired. Items on the list are checked to see if they failed, but the question is not posed about why the item failed, why it was unexpected when the failure occurred, and why was the failure not prevented by administrative barriers.
It’s a given that all equipment and processes fail when left alone. It is only through the intervention by humans that equipment and process failures are anticipated and averted by maintenance plans, regular monitoring, inspections, and administrative processes. While random failures of course do occur, generally unexpected failures are the consequence of flawed or incomplete maintenance and administrative processes to prevent such failures.
When the failure-cause list is initially prepared, there is a tendency to primarily include physical items, processes, or conditions, or perhaps even personal errors by operators, and such. As pointed out before, if an item is not included in the list, there is a high likelihood of it being overlooked and the root cause attributed to one of the other items on the list. This situation results in the underlying cause being overlooked.
For this reason, it may be expedient for a second failure-cause list to be initiated when the first one has been more or less completed and corrective actions to prevent recurrence are being considered. The second failure-cause list should be of a “higher level” in that it evaluates the evidence already gathered to fill in the first spreadsheet and asks the following type of questions.
♦ What administrative action or actions could have prevented the failure?
-
-
- Was the replacement or maintenance period too long?
- Was the operating environment as expected in the design assumptions?
- Has the environment changed since the original configurations was emplaced?
-
♦ If the failure was due to a human error, was the error random or systemic?
-
-
- If the failure was due to human error, what has been the frequency of this human error
based upon past records here and in other plants? Was this frequency taken into account? - Was there appropriate training? For example, if the failure occurred due to poor equipment
alignment, did the installers really understand how to conduct a hot versus cold alignment of
the equipment? - Could the chance of human error be avoided or even obviated by better design? Was training
being used as a cheap substitute for good design? Were the risks understood?
- If the failure was due to human error, what has been the frequency of this human error
-
THE BOTTOM LINE
The keepers of the equipment and processes, meaning you and your superiors and subordinates, essentially give permission for failure to occur when administrative processes and procedures are inadequate. The equipment or the process is not imbued with contrariness or spiritual animus. It simply follows the rules of physics that you allow it to follow. Thus, in the hunt for the root cause of a failure, it’s crucial to ensure that prevention measures are reviewed and addressed.TRR
ABOUT THE AUTHOR
Randall Noon is a registered professional engineer and author of several books and articles about failure analysis. He has conducted root-cause investigations for four decades, in both nuclear and non-nuclear power facilities. Contact him at [email protected].
Tags: reliability availability, maintenance, RAM, root-cause analysis, failure investigation, failure analysis