At its core, reliability engineering is an analytical process. We’re the actuaries for the manufacturing-process industries. We analyze data to determine failure rates and mean times between/to failure (MTBF/MTTF) or mean time to repair (MTTR). We employ Weibull analysis to understand the risk profile as a function of time, cycles, km, etc. We plug findings into reliability block diagrams so that we can make informed decisions about production, safety, and/or environmental risk in the process.
I could go on, but the bottom line is that we analyze data, a significant amount of which is quantitative, so that we may advise those who must make decisions to manage risk. Unfortunately, in most plants we’re forced to work with poor quality data. We must improve this situation by creating standard taxonomies for failure modes, mechanisms, and causes and standardize the data-collection process.
AT A PARETO DEAD END
In my role as a consulting reliability engineer, I’ve been involved in a great number reliability-data analysis projects in support of clients over the past 25 years. For companies in the nascent stages of their reliability-maturity journeys, the process is very predictable. We employ Pareto Analysis to identify our top five or 10 “bad actors” based upon maintenance cost, forced downtime, etc.
Once we know which assets, or classes of assets, upon which to focus, we turn our attention to the EAM system to evaluate work notifications, work plans, and work closeout notes for the assets under investigation. Our goal is to create a Pareto chart for the failure modes for each to the top priority assets. This is where the process often topples over: We struggle to logically group the events.
One of the great leverage points for reliability engineering in the plant is the repeatable nature of failure modes and mechanisms and associated corrective actions. We have very few truly new and unique failures in the plant. If, however, we can’t logically group failure events, we can’t leverage their repeatability in our analysis.
TAXONOMIES TO THE RESCUE
If failures and associated corrective actions are indeed repeatable, we can give them a name. A taxonomy is a standardized classification system. For plant reliability engineering, we need standardized taxonomies for failure reporting, failure modes, failure mechanisms, and failure causes. The taxonomies for a pump and for a conveyor will probably be different. But, if they share failure modes, the classification of those common modes should be identical to simplify the process and minimize the risk of human failures.
Let’s consider the example of a motor driven, directly-coupled, leaking centrifugal water pump. Without a standardized taxonomy for notifications, three operators writing notifications on the same leaky pump are likely to provide three (or more) different stories. One operator may write a notification that says, “Pump is broken.” Another may write a notification that says, “Pump needs maintenance.” A third operator may write a notification that says, “Pump is leaking water at the flange.” You get the picture. How, as a reliability engineer conducting a pareto analysis, would we be able to group those three varying notifications?
We need a taxonomy for writing notifications. In our example, leaking is a common failure mode for a centrifugal pump. On the other hand, the pump can, conceivably, leak two fluids: water and lubricant . In the above example, it’s water. So, If the pump is leaking water, there are four places where it could occur: 1) inlet flange; 2) mechanical seal or packing; 3) outlet flange; or 4) a welded joint.
If the operator notification read, “Pump is leaking water at outlet flange,” each time that failure mode was observed, our pareto analysis would be much easier and much more productive. It’s not that complicated. Thus, when the maintainer tends to a leaky flange, the possible failure mechanisms might include: 1) missing flange fasteners; 2) loose or under-tensioned fasteners; 3) missing, damaged or misaligned gasket. Creating and enforcing the use of simple taxonomies enables reliability engineers to see patterns and make reliability apportionment and resource allocation recommendations more effectively.
NOT REINVENTING THE WHEEL
I regularly turn to the following three standards when undertaking the process of standardizing taxonomy systems:
♦ IEC 60812: Failure Modes & Effects Analysis (FMEA and FMECA)
♦ ISO 14224: Petroleum, Petrochemical and Natural Gas Industries –
Collection and Exchange of Reliability and Maintenance Data for Equipment
♦ DOE-NE-STD-1004-92: Root Cause Analysis Guidance Document.
IEC 60812 offers a very-high-level taxonomy of failure modes, but it doesn’t get too detailed about mechanisms and causes. It’s a good starting point.
ISO 14224 is a detailed treatment of failure modes, mechanisms and causes. It’s a real workhorse.
ISO 14224 is petrochemical-industry-oriented. But with some creativity, I’ve successfully leveraged it in food-processing, mining and pulp & paper industries.
DOE-NE-1004-92 offers the best taxonomy of failure causes I’ve ever encountered. Plus, it’s a document that you can download for free. (Email email@example.com, if you can’t find it, and I’ll send you a copy).
Another document that any mechanical-reliability engineer should have in his or her library is the Naval Surface Warfare Center’s Handbook of Reliability Prediction Procedures for Mechanical Equipment. It’s another free document with over 500 pages that are packed with information about various mechanical components and parts, including standardized FMEA. (Again, reach out to me if you can’t find it, and I’ll send you a copy.)
REMEMBER THE BOTTOM LINE
As the title of this article notes, data is the difference between deciding and guessing. That said, if your data isn’t organized, you can’t effectively analyze it. And, if you can’t analyze the data, it can’t support effective decisions.
Creating standardized taxonomies for failure reporting, modes, mechanisms, and causes, might appear to be a daunting task. Remember, though, that much of the heavy lifting has already been done for you and is available in the standards.
Organizations will be challenged to drive reliability improvements without strong analytics. So don’t delay. Develop standardized taxonomies for your operations and train your team on utilizing them faithfully.TRR
ABOUT THE AUTHOR
Drew Troyer has 30 years of experience in the RAM arena. Currently a Principal with T.A. Cook Consultants, he was a Co-founder and former CEO of Noria Corporation. A trusted advisor to a global blue chip client base, this industry veteran has authored or co-authored more than 250 books, chapters, course books, articles, and technical papers and is popular keynote and technical speaker at conferences around the world. Drew is a Certified Reliability Engineer (CRE), Certified Maintenance & Reliability Professional (CMRP), holds B.S. and M.B.A. degrees, and is Master’s degree candidate in Environmental Sustainability at Harvard University. Contact him directly at 512-800-6031 or firstname.lastname@example.org.
Tags: reliability, availability, maintenance, RAM, Pareto Analysis, root-cause analysis, RCA, IEC 60812, ISO 14224, DOE-NE-STD-1004-92