Home / Resources / Root Cause Analysis Methods for Equipment Failures: A Practical Guide
Guide 9 min read

Root Cause Analysis Methods for Equipment Failures: A Practical Guide

Why Failures Repeat — And How to Stop the Cycle

A pump seal fails. Maintenance replaces it. Three months later, the same seal fails again. Maintenance replaces it again. This cycle repeats for years, consuming parts, labor, and patience — until someone finally asks: why does this seal keep failing?

That question is the beginning of root cause analysis. RCA is the discipline of investigating failures beyond the immediate physical cause to find the underlying reasons. The seal didn’t fail because “seals wear out.” The seal failed because the shaft runout exceeded the seal’s tolerance, which happened because the bearing was worn, which happened because contaminated oil destroyed the bearing surface, which happened because the breather vent was missing after the last overhaul. Fix the breather vent policy, and the entire failure chain stops.

Plants that perform RCA on significant failures and implement corrective actions typically reduce repeat failures by 50-70% within two years. The investigation time pays for itself many times over in avoided repeat repairs.

When to Perform RCA

Not every failure warrants a full root cause investigation. You need a threshold that captures significant failures without overwhelming your team.

Trigger an RCA when:

  • The failure caused safety or environmental consequences
  • The total cost (repair + downtime + consequential damage) exceeds a defined threshold — $10,000-25,000 is common for mid-size plants
  • The failure has occurred more than twice on the same equipment in 12 months
  • The failure affected equipment classified as critical
  • Management requests investigation of a specific event

This typically results in 10-30 RCA investigations per year for a mid-size facility — a manageable workload that delivers high-value insights.

The Five Levels of Cause

Understanding the hierarchy of causation helps structure your investigation and ensures you dig deep enough.

  1. Failure mode — What happened? The seal faces separated and allowed leakage.
  2. Physical cause — What physical mechanism caused it? Excessive shaft runout caused dynamic seal face separation.
  3. Human cause — What human action or inaction contributed? The bearing was not replaced during the last scheduled overhaul despite showing early wear indicators.
  4. Latent cause — What system or process allowed the human cause? The PM work instructions didn’t include bearing inspection criteria, so the decision was left to individual judgment.
  5. Organizational cause — What organizational factor created the latent cause? The maintenance planning process doesn’t include a formal review of PM work scopes against OEM recommendations and failure history.
  6. Most maintenance investigations stop at level 2 — the physical cause. “The bearing failed” is the conclusion, and a new bearing is installed. The failure repeats because the reasons behind the bearing failure were never addressed. Effective RCA pushes to levels 3-5 where the systemic fixes live.

    RCA Methods: Choosing the Right Tool

    5 Whys

    The simplest RCA technique. Start with the problem statement and ask “why” repeatedly until you reach a root cause that can be addressed with a corrective action.

    Example:

    • Problem: Gearbox failed catastrophically.
    • Why? Gear teeth stripped due to overload.
    • Why was it overloaded? The coupling between motor and gearbox was misaligned, creating excessive load.
    • Why was it misaligned? Alignment was not checked after the motor was replaced last month.
    • Why wasn’t alignment checked? The work order for motor replacement didn’t include an alignment step.
    • Why didn’t it include alignment? Standard work procedures for motor replacement haven’t been updated to include laser alignment verification.

    Root cause: Incomplete standard work procedures. Corrective action: Update the motor replacement procedure to require laser alignment with documented tolerances before returning to service.

    The 5 Whys method works well for straightforward failure chains. It falls apart when multiple causal factors interact, because it follows a single linear path. For complex failures with multiple contributing factors, use a fishbone diagram or fault tree.

    Fishbone (Ishikawa) Diagram

    The fishbone diagram organizes potential causes into categories. The standard manufacturing categories are: Machine, Method, Material, Man (People), Measurement, and Environment. For maintenance RCA, a more useful set is: Equipment, Procedures, People, Materials/Spares, Conditions, and Management Systems.

    The team brainstorms potential causes in each category and maps them on the diagram. Then evidence from the investigation is used to confirm or eliminate each potential cause. The fishbone is a structuring tool — it ensures the team considers a broad range of causes rather than jumping to the first plausible explanation.

    Fault Tree Analysis

    Fault tree analysis works backward from the failure event using logic gates (AND, OR) to map how combinations of events and conditions lead to the top-level failure. It’s more rigorous than the fishbone and particularly useful for failures involving multiple simultaneous conditions.

    An OR gate means any one of the inputs can cause the output. An AND gate means all inputs must be present for the output to occur. This distinction matters because OR-gate causes need individual attention, while AND-gate causes can be addressed by eliminating any single input.

    Fault trees take more time to construct but provide a clearer picture of complex failure mechanisms. They’re most valuable for high-consequence failures where understanding the full causal structure justifies the additional investigation effort.

    Conducting the Investigation

    Preserve Evidence

    The first rule of failure investigation: preserve the evidence before anyone cleans up, disposes of parts, or repairs the equipment. Failed components should be retained for examination. Photographs of the failure scene capture details that memories lose. Operating data from the DCS or SCADA system should be downloaded before it’s overwritten.

    This requires culture change in many plants. The natural response to a failure is to fix it as fast as possible and get back into production. Taking 30 minutes to document and preserve evidence before starting repairs feels counterproductive in the moment, but it makes the difference between solving the problem permanently and repeating it.

    Interview People

    Talk to the operators who were running the equipment when it failed. Talk to the maintenance technicians who last worked on it. Talk to the person who found the problem. Ask open-ended questions: “Walk me through what happened.” “What did you notice before the failure?” “Was anything different about how the equipment was operating recently?”

    Don’t lead witnesses. “Did you notice the vibration was high?” presupposes the answer. “Did you notice anything unusual about the equipment?” lets them tell you what they actually observed.

    Analyze Failed Components

    Failed bearings, gears, seals, and other components carry evidence of how they failed. Bearing failure patterns — spalling, brinelling, smearing, electrical discharge, corrosion — each point to different root causes. Gear wear patterns — pitting, scoring, scuffing, tooth breakage — indicate specific loading and lubrication conditions.

    If your maintenance team doesn’t have component failure analysis expertise, send critical failed components to a metallurgical lab. A $500-1,000 lab analysis on a bearing that failed in a $50,000 gearbox is money well spent if it identifies a fixable root cause.

    Document and Act

    Write a concise RCA report: problem statement, investigation findings, root causes identified, and recommended corrective actions with responsible persons and target dates. Keep it under two pages. Long reports don’t get read.

    Track corrective actions to completion. An RCA that identifies a root cause but doesn’t result in implemented corrective action is academic exercise. Enter actions into your CMMS or action tracking system and review them monthly until closed.

    Share findings across the plant. If a root cause applies to similar equipment, extend the corrective action to all affected assets. A gearbox failure caused by contaminated oil doesn’t just affect one gearbox — it’s a signal to check lubrication practices across the facility.

    Building an RCA Culture

    The hardest part of RCA is cultural. In many plants, failures are treated as inevitable — “stuff breaks.” Blame follows, and people learn to hide problems rather than investigate them. An effective RCA program requires a no-blame environment where the goal is to fix the system, not punish the person.

    Make RCA results visible. Post summaries on maintenance area bulletin boards. Discuss findings in weekly maintenance meetings. Celebrate when a root cause fix prevents a repeat failure. Over time, the team internalizes the idea that understanding why things fail is more valuable than just fixing them fast.

Get Started

Request a Free Reliability Assessment

Tell us about your equipment and facility. Our reliability team will review your situation and recommend a tailored reliability program — no obligation.

Free initial assessment
Response within 1 business day
No obligation or commitment

No obligation. Typical response within 24 hours.

Ready to Solve Your Reliability Problem?

Submit your equipment details and a reliability specialist will review your situation.

Claim Your Free Assessment →