Home / Resources / FMEA for Maintenance Teams: A Step-by-Step Approach to Failure Mode and Effects Analysis
Guide 9 min read

FMEA for Maintenance Teams: A Step-by-Step Approach to Failure Mode and Effects Analysis

By Rob Calloway, Director of Reliability Engineering

Why FMEA Belongs in Maintenance, Not Just Engineering

Failure Mode and Effects Analysis started in aerospace — specifically, the U.S. military’s MIL-P-1629 procedure from 1949. It migrated to automotive manufacturing through the AIAG standard. But FMEA has a direct and practical application in industrial maintenance that most plants underuse.

Maintenance teams deal with failure modes every day. They replace bearings, rebuild pumps, rewire motors, and fix leaks. The problem is that these repairs are almost always reactive — responding to failures after they occur. FMEA flips this around by systematically identifying failure modes before they happen and prioritizing action based on risk.

FMEA Terminology: Speaking the Same Language

Before you can run an FMEA, everyone on the team needs to agree on what the terms mean.

  • Function — What the item or system is supposed to do. Express functions in terms of performance requirements: “Deliver 200 GPM at 80 psig” rather than “Pump liquid.”
  • Failure Mode — A specific way the item can fail to perform its function. Failure modes should be specific enough to identify a maintenance action. “Bearing failure” is too broad. “Inner race spalling due to fatigue” tells you something actionable.
  • Failure Effect — The consequence of the failure mode at the local, system, and plant level. What does the operator see? What production is lost? What safety risks exist?
  • Failure Cause — The root mechanism driving the failure mode. Contamination, overloading, improper installation, material defect — these are causes.
  • Current Controls — What maintenance or monitoring is currently in place to detect or prevent this failure mode?

The Risk Priority Number: Useful but Imperfect

Traditional FMEA ranks risk using the Risk Priority Number (RPN) — the product of Severity, Occurrence, and Detection ratings, each scored 1-10.

  • Severity (S) — How bad is the effect if the failure occurs? A score of 1 means negligible impact. A 10 means potential safety hazard or catastrophic production loss.
  • Occurrence (O) — How likely is this failure mode? Based on historical frequency or engineering judgment when data isn’t available. A 1 means virtually impossible. A 10 means almost certain.
  • Detection (D) — How likely is it that current controls will detect the failure before it reaches the customer (or in maintenance terms, before it causes a functional failure)? A 1 means current controls will almost certainly detect it. A 10 means the failure is undetectable with current methods.

RPN = S x O x D. Maximum possible score is 1,000. Scores above 100-200 typically warrant action, but there’s no universal threshold.

The RPN Problem

RPN treats all three factors equally. A failure mode with Severity 10, Occurrence 1, Detection 1 (RPN = 10) gets the same score as one with Severity 1, Occurrence 2, Detection 5 (RPN = 10). But the first one is a potential catastrophe that current controls catch well. The second is a trivial failure happening occasionally. Equal RPN, completely different risk profiles.

The AIAG/VDA joint FMEA standard published in 2019 addresses this with an Action Priority (AP) table that uses severity-occurrence-detection combinations rather than multiplication. If you’re setting up a new FMEA program, adopt the AP approach from the start. If you’re already using RPN, supplement it with a rule: any failure mode with Severity of 9 or 10 requires action regardless of RPN.

Running an FMEA Session: Practical Steps

Pre-Session Preparation

Don’t walk into an FMEA session cold. Prepare by gathering:

  • Equipment drawings and manuals
  • CMMS work order history for the equipment (minimum 2 years)
  • Operating procedures
  • Existing PM task lists
  • Any previous failure investigation reports

Pre-populate the FMEA worksheet with known functions and failure modes from the work order history. This gives the team a starting point rather than a blank page.

The Session

Team composition matters. Include an operator who runs the equipment daily, a maintenance technician who works on it, a reliability engineer or planner, and a facilitator who keeps the process on track. The facilitator doesn’t need to be an equipment expert — their job is to manage the process and challenge assumptions.

Work through the equipment systematically. Start with the primary function and work through each component or subsystem. For each failure mode identified:

  1. Describe the failure effect clearly. What does the operator experience? What happens to production? Any safety or environmental consequences?
  2. Identify the failure cause or mechanism.
  3. Document current detection and prevention controls.
  4. Score Severity, Occurrence, and Detection.
  5. Calculate RPN or determine Action Priority.
  6. Assign recommended actions for high-risk items with responsible person and target date.

Limit sessions to 2-3 hours. Fatigue sets in quickly with this level of detailed analysis. Multiple shorter sessions produce better results than marathon sessions.

Common Pitfalls

Going too deep. Analyzing at the individual fastener level is a waste of time for most industrial equipment. Stay at the component level — bearings, seals, impellers, windings — unless a specific sub-component has a known problematic failure mode.

Inflating Occurrence scores. Teams that haven’t seen a specific failure mode tend to rate it as unlikely. Check your data. A failure that happens once every three years on a single machine might not seem frequent, but across 50 similar machines in the plant, that’s 17 failures per year.

Ignoring Detection. Many teams rush through detection scoring. This is where the actionable insight lives. A high-severity, high-occurrence failure mode with good detection (low D score) is being managed. The same failure mode with poor detection (high D score) is a ticking bomb.

No follow-through. Recommended actions that never get implemented waste everyone’s time. Track FMEA actions in your CMMS or action tracking system. Review progress monthly until all high-priority actions are closed.

Turning FMEA Results Into Maintenance Strategy

FMEA results directly inform your maintenance program:

  • High Severity + Poor Detection = Add a predictive maintenance task to improve detection. Vibration monitoring, oil analysis, or thermography for the specific failure mode.
  • High Severity + High Occurrence = Address the root cause. Change materials, modify operating procedures, or redesign the component. Maintenance alone can’t fix a design or operational problem.
  • Low Severity + Any Occurrence = Run-to-failure is likely appropriate. Don’t waste resources preventing failures that don’t matter.
  • High Detection (failure easily caught early) = Validate that your current detection methods are actually being performed and acted upon. A vibration program that collects data but doesn’t analyze it offers zero detection capability despite being listed as a control.

Revisit your FMEA when significant changes occur — new operating conditions, equipment modifications, or failure events that weren’t captured in the original analysis. A living FMEA that evolves with your equipment is a strategic asset. A static FMEA that sits in a folder is just paperwork.

Get Started

Request a Free Reliability Assessment

Tell us about your equipment and facility. Our reliability team will review your situation and recommend a tailored reliability program — no obligation.

Free initial assessment
Response within 1 business day
No obligation or commitment

No obligation. Typical response within 24 hours.

Ready to Solve Your Reliability Problem?

Submit your equipment details and a reliability specialist will review your situation.

Claim Your Free Assessment →