Root-Cause Issue: Permission to Fail

The Second Law of Thermodynamics is the one about entropy tending to increase in a system as it moves toward equilibrium. A corollary of the Second Law relating to maintenance engineering is that all operating machinery eventually fails if left alone. It’s only through intervention by people or, perhaps, by automated maintenance systems devised by people, that machinery and equipment are able to remain functional over time.

Fundamentally, maintenance programs are based upon experience. Wear, excessive heat or cold, chemical attack, lubrication loss, excessive vibration, material degradation, and time in service are some of the usual characteristics that are correlated with impending failure. Through experience, maintenance engineers have learned to avoid failure by monitoring and trending some or all of these characteristics and then intervening appropriately.

INVESTIGATING FAILURE
Hidden within the preceding fundamentals is the fact that in any unexpected failure, whether it occurs in a machine, a machine element, or a whole operating system, a thorough root-cause analysis must reckon simultaneously with two types of causes: the physical or mechanical cause of the failure, and the administrative shortcoming or omission that gave the failure permission to occur. Both types of causes are important and must be addressed to reliably prevent recurrence of the failure. Of course, the whole point of a root-cause investigation is to prevent recurrence. To do otherwise simply permits the equipment to regularly run to failure.

The first part of the investigation (the physical cause) answers the question, what mechanistically happened that culminated in failure. Translation: What sequence of events, conditions, and consequences occurred that eventually resulted in the unexpected failure. This typically is determined by close examination of the physical evidence, and sound engineering analysis of the same.

The second part of the investigation (determining the shortcoming or omission that gave the equipment permission to fail) establishes what was not done that would have otherwise prevented the failure from unexpectedly occurring. In other words, the second part determines what kind of human intervention should have been done. This part is just as important as understanding the first part, but often is neglected or given short shrift.

If there were no administrative procedures in place to prevent the unexpected failure, the investigator could make a clean, initial start. One effective way to determine what type of administrative intervention is appropriate to prevent recurrence is to lay out a flow chart of the physical sequence of events, conditions and consequences. In examining the flow chart, points within the sequence can be identified where the failure pathway can be more conveniently interrupted and the otherwise eventual consequence of failure prevented. Since a failure can often be prevented at several points in the failure pathway, factors such as production scheduling, economics, manpower availability, and so on can be considered.

If, in fact, there were an administrative procedure in place to prevent this particular failure, the task would be to evaluate and fix the existing procedure. This can be done in a similar fashion as already noted, except that the existing procedure’s intentions can be compared to what actually occurred. But doing so is not necessarily as simple as it sounds.

Because the existing administrative procedure failed in some fashion to prevent a recurrence, there may be inferred issues of personal blame and incompetence that can complicate the procedure-change process, especially if subordinates are investigating the efficacy of a procedure promulgated by a superior. Keep in mind that ego is often a difficult barrier to change.

There also may be an issue concerning whether the procedure had been executed as intended. Mechanics and technicians, for example, may not read and understand procedures written by managers or engineers the same why that the managers or engineers intended those procedures to be understood. Complex sentence structure, poor punctuation, pronouns with unclear antecedents, vague instructions, unclear technical vocabulary, and error prone procedure steps can lead to inadvertent errors and omissions when the procedure is executed perhaps months later. There may also be issues related to the skill and knowledge of the mechanics and technicians. Were they properly trained to do the work specific in the procedure? Were conditions conducive for successful execution of the procedure as imagined by the procedure’s authors? Consequently, while the procedure’s intent was good and the procedure’s content was accurate, the execution may have been inadequate for a variety of reasons.

If this is a failure that has recurred, it could be further indication that the prior failure investigation was inadequate. A procedure designed to prevent recurrence of a particular failure will likely not be effective if the physical cause determination was inaccurate. Since the second part of an effective root-cause investigation, determining the shortcoming that permitted the failure to occur, depends upon an accurate determination of the physical cause, a procedure to prevent failure that is predicated on the wrong causal factor has little chance of success no matter how well the procedure is executed.

Last is the fact that while some failures look the same, they may have been generated by different or just somewhat related physical causes. Assuming that two failures that look the same and have the same consequences were caused by the same physical phenomenon can be a mistake. Enough evidence should be gathered and evaluated to verify that the same physical causes for the two failures being compared were at work. Perhaps the existing procedure was fine to prevent one kind of failure pathway, but was inadequate to stop a second failure pathway that created the same failure result. Changing the existing procedure to prevent the second type of failure pathway, might then allow the first failure pathway to recur.

A CASE IN POINT
One of the more famous failures in the U.S. space program was the Columbia Space Shuttle Disaster. This occurred on Feb. 1, 2003, when the Columbia Space Shuttle disintegrated upon re-entry. All seven of the crew were killed. Significantly, this was the second such disaster in the Shuttle Space program. The first was in 1986 when the Space Shuttle Challenger broke up soon after launch.

Briefly, the failure of the Columbia was determined to have been physically caused by impact from a piece of foam insulation that dislodged from the external tank and impacted the leading edge of the left wing of the shuttle’s orbiter during launch. The impact damaged tiles on the orbiter vehicle that protected its wing during re-entry. During re-entry, the damaged heat shield tiles on the orbiter allowed hot gases created during re-entry to penetrate into the wing and then damage the structure of the orbiter itself. This destabilized the orbiter which then broke up.

Initially, the failure of the Columbia was blamed on human error relating to installation of the spray-on foam insulation on the external tank. Employees who did this task were then thoroughly retrained to install such foam insulation without defects.

In Dec. 2005, however, more than three years after the event, it was announced that a more thorough investigation of the physical mechanisms that permitted the foam insulation to dislodge found that a fundamental physical mechanism in the failure pathway of the Columbia was thermal expansion and contraction. The expansion and contraction occurred during filling of the craft’s external tank. This otherwise well-understood effect, i.e., expansion and contraction due to temperature changes, allowed cracking to initiate between the foam insulation and the external tank. Consequently, pieces of foam insulation could dislodge during launch.

To be clear, the failure was not caused by human error related to installation of the foam insulation on the external tank by technicians, as had been presumed for three years. When the Dec. 2005 announcement was made concerning the discovery of the expansion and contraction causal factor, NASA apologized to the insulation technicians who had carried a burden of blame for the catastrophe for three years.

Further, there was a second, significant administrative shortcoming. Smaller Impacts and near misses by pieces of foam insulation and ice during previous launches were downplayed. Also, predicted impact effects by ice or foam pieces in computer simulations were also downplayed by NASA.

Chapter 7, “The Accident’s Organizational Causes,” contained in the Columbia Accident Investigation Board Report, published in Aug. 2003, details some of the shortcomings noted in the NASA organization that existed at that time that contributed to or lead to the catastrophe of Feb. 1, 2003. It is certainly worth reading Chapter 7 in its entirety for those whose job may involve root-cause analysis. However, in regard to the importance of determining both the mechanistic cause and the administrative cause that permit a significant failure to occur, the first paragraph of that chapter is quoted in full below:

“Many accident investigations make the same mistake in defining causes. They identify the widget that broke or malfunctioned, then locate the person most closely connected with the technical failure: the engineer who miscalculated an analysis, the operator who missed signals or pulled the wrong switches, the supervisor who failed to listen, or the manager who made bad decisions. When causal chains are limited to technical flaws and individual failures, the ensuing responses aimed at preventing a similar event in the future are equally limited: they aim to fix the technical problem and replace or retrain the individual responsible. Such corrections lead to a misguided and potentially disastrous belief that the underlying problem has been solved. The [NASA] Board did not want to make these errors. A central piece of our expanded cause model involves NASA as an organizational whole.”

In short, operating equipment, given time, naturally will fail unless people administratively intervene to prevent failure. Administrative shortcomings give failure permission to occur.TRR

ABOUT THE AUTHOR
Randall Noon is a registered professional engineer and author of several books and articles about failure analysis. He has conducted root-cause investigations for four decades, in both nuclear and non-nuclear power facilities. Contact him at noonrandy@att.net.

Tags: reliability availability, maintenance, RAM, root-cause analysis, failure investigation, failure analysis, Columbia Accident Investigation Board Report

Root-Cause Issue: Permission to Fail

FEATURED CATEGORIES