What Is Failure Mode and Effects Analysis?
Failure mode and effects analysis (FMEA) is a structured, systematic methodology for identifying every way a component, subsystem, or system can fail, evaluating the consequences of each failure, and prioritizing maintenance or design actions to mitigate the highest-risk failure modes. It is both a reliability consulting tool and a risk management framework — one that transforms equipment maintenance from experience-based guesswork into an analytically justified strategy where every maintenance task exists for a documented, defensible reason.
The methodology originated in the U.S. military in the late 1940s as MIL-P-1629, later formalized as MIL-STD-1629A, “Procedures for Performing a Failure Mode, Effects and Criticality Analysis.” It was adopted by NASA during the Apollo program, entered the automotive industry through Ford Motor Company in the 1970s, and has since become a foundational tool in aerospace (SAE ARP5580), automotive (SAE J1739, AIAG-VDA FMEA Handbook), medical device manufacturing (ISO 14971), and industrial maintenance and reliability consulting. Its endurance across seven decades and multiple industries reflects a fundamental truth: understanding how things fail is the prerequisite to preventing failure efficiently.
At its core, FMEA works through a disciplined sequence: define the function of each component or system, identify every mode in which that function can be lost or degraded, determine the effect of each failure mode on the system and the operation, assess the severity of the consequence, estimate the likelihood of occurrence, evaluate the ability of current controls to detect the failure before it reaches the consequence stage, and combine these factors into a risk priority that directs maintenance and design resources toward the failure modes that matter most.
FMEA vs. FMECA: Understanding the Distinction
The terms FMEA and FMECA (Failure Mode, Effects, and Criticality Analysis) are frequently used interchangeably, but they represent different levels of analytical depth. FMEA identifies failure modes, their effects, and a qualitative or semi-quantitative risk ranking. FMECA extends the analysis by incorporating a formal criticality assessment — typically using quantitative failure rate data to calculate a criticality number for each failure mode based on the probability of the failure mode occurring, the conditional probability that the failure mode will result in the identified effect, and the operating time or number of cycles. MIL-STD-1629A defines both procedures, with the criticality analysis (CA) as a distinct step that can be added to the base FMEA.
In industrial maintenance applications, the FMEA with Risk Priority Number (RPN) scoring is the most widely used approach because it provides actionable prioritization without requiring the detailed quantitative failure rate data that full FMECA demands. The RPN methodology uses three factors — severity, occurrence, and detection — each rated on a scale of 1 to 10, multiplied together to produce a composite score ranging from 1 to 1,000. This scoring system, while imperfect in its mathematical properties (an RPN of 100 could result from a 10-10-1 combination or a 5-5-4 combination, with very different risk implications), provides a practical framework for comparing and prioritizing hundreds or thousands of failure modes across a complex asset base.
The RPN Scoring Framework
Severity (S) rates the consequence of the failure mode on a 1-10 scale, where 1 represents a negligible effect with no meaningful impact on operations and 10 represents a catastrophic consequence involving potential injury, regulatory violation, or complete loss of a critical system. Severity is the one factor that cannot be reduced by maintenance — it is inherent to the failure mode and the system design. A catastrophic turbine blade liberation event has a severity of 9 or 10 regardless of how frequently it occurs or how well it is monitored. This is why high-severity failure modes receive disproportionate attention even when their occurrence probability is low.
Occurrence (O) rates the likelihood of the failure mode developing during the assessment period, typically on a 1-10 scale where 1 represents an extremely unlikely event (less than 1 in 1,000,000 operating hours) and 10 represents a near-certain occurrence (failure is almost inevitable during the assessment interval). Occurrence ratings are based on historical failure data where available, supplemented by engineering judgment, industry reliability databases (such as OREDA for offshore and process equipment), and OEM-published failure rate data. In maintenance applications, occurrence ratings reflect both the inherent reliability of the component and the effectiveness of the current maintenance strategy in preventing the failure mode.
Detection (D) rates the probability that the failure mode will be detected by current controls — inspections, condition monitoring, process alarms, operator observations — before it progresses to the consequence described in the effects column. A rating of 1 means the failure mode is almost certain to be detected before any consequence occurs; a rating of 10 means there is no current means of detecting the failure before it reaches its full consequence. Detection is the factor most directly influenced by maintenance strategy: implementing vibration monitoring on a critical pump bearing changes the detection rating from perhaps 8 (no monitoring, failure detected only at functional failure) to 2 or 3 (developing defect detected months before functional failure).
The RPN is calculated as S x O x D. A failure mode with high severity, moderate occurrence, and poor detection — say 9 x 4 x 8 = 288 — demands immediate attention. A failure mode scoring 2 x 2 x 5 = 20 is appropriately managed with existing controls or accepted as a run-to-failure candidate.
Functional Failure Analysis: The Foundation of Meaningful FMEA
The quality of an FMEA depends entirely on the quality of the functional analysis that precedes it. Before failure modes can be identified, the functions of the equipment must be explicitly defined — not in generic terms but in specific, quantified performance standards. A centrifugal pump’s function is not simply “to pump fluid.” Its function is “to deliver 500 GPM of process water at 150 PSIG discharge pressure with less than 2% flow variation under normal operating conditions.” This specificity matters because failure modes are defined as the loss or degradation of function. A pump that delivers 400 GPM instead of 500 GPM has experienced a functional failure even though it is still running — and the failure modes, effects, and appropriate responses for a capacity degradation are entirely different from those for a complete cessation of flow.
This functional approach also reveals hidden failures — failure modes that are not evident to the operating crew under normal conditions but which eliminate a protective function that will be needed in an abnormal situation. A standby pump that has seized due to corrosion in its casing has experienced a hidden failure: the operators may not know it has failed until they need it. A pressure relief valve that has corroded shut has failed in a hidden mode that will only become apparent when overpressure protection is needed. Hidden failure modes are among the most dangerous in any industrial system, and they are systematically identified only through the functional analysis discipline that FMEA imposes.
What Are the Signs Your Facility Needs Failure Mode and Effects Analysis?
FMEA delivers the greatest value in facilities where the current maintenance strategy has evolved organically rather than analytically — where PM tasks were established based on OEM manuals, vendor recommendations, or predecessor experience rather than a systematic evaluation of what actually fails and why. The following indicators suggest that your maintenance program will benefit from FMEA-driven optimization.
- Your preventive maintenance program is based primarily on manufacturer recommendations and time-based intervals, with limited consideration of operating context, failure history, or actual equipment criticality
- PM tasks are consuming significant labor hours, but their effectiveness at preventing failures is uncertain — you are performing maintenance activities without clear evidence that they address the failure modes actually occurring on your equipment
- The same types of failures recur despite what appears to be an adequate preventive maintenance program, suggesting that the PM tasks are not targeting the correct failure modes or are not detecting degradation before functional failure
- You are implementing or expanding a predictive maintenance program and need to determine which monitoring technologies should be applied to which equipment based on dominant failure modes rather than blanket coverage
- Capital modification or redesign projects are being proposed, and you need a systematic evaluation of failure risk to justify the investment and ensure the modification addresses the actual problem
- Your facility is pursuing reliability-centered maintenance (RCM) implementation and needs the failure mode identification and consequence evaluation that forms the analytical core of the RCM process per SAE JA1011 and JA1012
- New equipment is being commissioned, and you want to establish an optimized maintenance strategy from day one rather than defaulting to generic PM templates and adjusting after failures occur
- Regulatory requirements — OSHA Process Safety Management, EPA Risk Management Program, FDA CGMP — require documented risk assessment for equipment whose failure could affect safety, environmental compliance, or product quality
- Your maintenance team disagrees about the appropriate strategy for critical equipment, and you need an evidence-based framework to resolve competing opinions and establish a defensible maintenance plan
- Insurance underwriters or risk engineers have recommended more rigorous failure risk evaluation for high-consequence equipment, and you need a methodology that produces documented, auditable results
Our Failure Mode and Effects Analysis Approach
Our FMEA methodology is rooted in the reliability-centered maintenance (RCM) decision logic established in SAE JA1011 and refined through practical application across hundreds of industrial asset types. We apply FMEA not as an academic exercise that produces impressive documentation but as a working analytical tool that produces specific, implementable maintenance task recommendations — each task justified by a documented failure mode, a quantified risk, and a clear rationale for why that task is the most effective and cost-efficient response.
Operating Context Definition
Every FMEA begins with a clear definition of the operating context in which the equipment functions. The same pump model operating in clean water service at ambient temperature and steady-state conditions will experience fundamentally different failure modes than an identical pump in slurry service at elevated temperature with frequent start-stop cycling. OEM failure data and generic FMEA templates do not account for these contextual differences. Our analysis does. We define the operating context — process conditions, environmental exposure, operating profile, quality of incoming utilities, and the consequences of failure specific to the equipment’s role in your process — before the first failure mode is identified.
Functional Analysis and System Boundary Definition
We define each asset’s functions in specific, measurable terms and establish clear system boundaries that identify what is included in the analysis and what is addressed by adjacent systems. Functions are categorized as primary (the reason the asset exists), secondary (additional requirements such as containment, environmental compliance, structural support), and protective (functions that exist to mitigate the consequences of other failures, such as relief valves, emergency shutoffs, and backup systems). This categorization is essential because protective functions generate hidden failure modes that must be addressed with failure-finding tasks at intervals calculated from the relationship between the hidden failure mode’s failure rate and the required availability of the protective function.
Failure Mode Identification
For each function, we identify every reasonably likely failure mode — the specific mechanism by which the function can be lost or degraded. We draw on multiple sources: facility-specific failure history from CMMS records and maintenance team experience, industry failure databases, OEM technical bulletins and known failure modes, applicable standards and codes, and the engineering analysis of the equipment’s materials, design, and operating stresses. Our goal is completeness without absurdity — we identify every failure mode that is reasonably likely given the operating context while avoiding hypothetical scenarios that have no practical relevance.
Each failure mode is classified as evident or hidden. Evident failures are apparent to the operating crew during normal duties — the pump stops, the pressure gauge reads zero, the conveyor belt stops moving. Hidden failures are not apparent under normal conditions and will only be revealed by a specific test, inspection, or the occurrence of the situation the hidden function was designed to protect against. This classification directly determines the type of maintenance task appropriate for each failure mode. Hidden failures require scheduled failure-finding tasks; evident failures are addressed through condition-based, scheduled restoration, scheduled discard, or redesign strategies depending on their consequences and the availability of effective detection methods.
Risk Evaluation and Prioritization
We evaluate each failure mode using the RPN methodology with severity, occurrence, and detection scales calibrated to your facility’s specific risk tolerance and consequence definitions. Our scales are not generic 1-10 ratings from a textbook — they are customized to reflect your facility’s production economics, safety standards, environmental permit conditions, and regulatory obligations. A severity rating of 8 means the same thing across every asset in the analysis, tied to specific, defined consequences relevant to your operation.
We supplement RPN scoring with a severity-occurrence risk matrix that addresses one of the RPN method’s recognized limitations: the inability to distinguish between high-severity/low-occurrence and low-severity/high-occurrence combinations that produce identical RPNs. A failure mode with a severity of 10 and occurrence of 2 (catastrophic but rare) requires a fundamentally different response than a failure mode with severity 2 and occurrence 10 (trivial but constant), even though both contribute an S x O product of 20. Our risk matrix captures this distinction and ensures that high-severity failure modes receive appropriate attention regardless of their RPN ranking.
Maintenance Task Selection
The output of our FMEA is not a risk register — it is a set of specific maintenance task recommendations for each failure mode that exceeds the acceptable risk threshold. Task selection follows the RCM decision logic: condition-based maintenance is the preferred strategy when an applicable and effective condition monitoring technique exists (the P-F interval is long enough to be useful, the failure progression is detectable, and the cost of monitoring is justified by the consequence). Where condition monitoring is not technically feasible or economically justified, scheduled restoration or discard tasks are evaluated based on the existence of an identifiable wear-out age for the failure mode. Where no preventive task is applicable and effective, the options are redesign (for safety and environmental consequences) or run-to-failure (for economic consequences where the cost of prevention exceeds the cost of failure).
Each recommended task specifies the task description, the failure mode it addresses, the recommended interval, the craft and skill level required, the reference documents and procedures needed, and the specific condition indicators that should trigger escalation from routine monitoring to corrective action.
Systems and Equipment Typically Covered
Process-Critical Rotating Equipment
Centrifugal and positive displacement pumps, compressors, fans, blowers, turbines, agitators, mixers, and centrifuges. FMEA on rotating equipment typically identifies 15-40 failure modes per asset depending on complexity, spanning bearing systems, sealing systems, shaft assemblies, impellers and rotors, lubrication systems, coupling and alignment, driver interfaces, and instrumentation. The analysis consistently reveals failure modes that are not addressed by standard OEM-recommended PM tasks — particularly failure modes related to operating context factors such as cavitation, hydraulic instability, and thermal cycling that generic maintenance templates do not contemplate.
Electrical Power and Distribution Systems
Transformers, circuit breakers, switchgear, motor control centers, protective relays, variable frequency drives, and power cables. Electrical system FMEA addresses insulation degradation, connection integrity, protective device coordination, and the hidden failure modes inherent in protective functions (relays that fail to trip, breakers that fail to open, transfer switches that fail to operate). These hidden failures are particularly consequential in electrical systems because a failed protective device may not be detected until a fault condition occurs — at which point the protection is needed and unavailable.
Pressure Equipment and Piping Systems
Pressure vessels, heat exchangers, columns, reactors, piping systems, and associated relief and safety devices. FMEA on pressure equipment integrates with risk-based inspection (RBI) methodologies per API 580/581, with failure modes categorized by degradation mechanism (internal corrosion, external corrosion, stress corrosion cracking, fatigue, creep, erosion, hydrogen damage) and consequences evaluated in terms of safety, environmental release, and production impact. The analysis supports the determination of inspection intervals, inspection methods, and monitoring locations that target the specific degradation mechanisms active in each service environment.
Safety Instrumented Systems and Protective Devices
Emergency shutdown valves, pressure relief devices, fire and gas detection systems, safety interlocks, and safety instrumented functions (SIFs). FMEA on safety systems is particularly critical because these systems exist solely to mitigate the consequences of other failures — they are, by definition, hidden-function equipment whose failure will not be apparent until the protective function is demanded. IEC 61511 and ISA 84 standards require systematic failure analysis of safety instrumented systems, and FMEA provides the structured methodology to identify dangerous failure modes, determine appropriate proof test intervals based on target safety integrity levels (SIL), and document the analysis in a format that satisfies functional safety audit requirements.
Material Handling and Conveying Systems
Belt conveyors, screw conveyors, bucket elevators, pneumatic conveying systems, overhead cranes, and hoists. These systems present unique FMEA challenges because their failure modes often involve structural fatigue, belt tracking and splice integrity, drive system degradation, and environmental factors (dust, moisture, impact loading) that accelerate degradation beyond rates predicted by generic reliability data.
HVAC, Utility, and Support Systems
Compressed air systems, cooling water systems, steam generation and distribution, HVAC systems, and water treatment plants. While individually these assets may not rank as the highest-criticality equipment in a facility, their failure modes frequently affect multiple production systems simultaneously. FMEA on utility systems often reveals that the maintenance strategy underestimates the cascading consequence of failure — a compressed air system failure that halts an entire packaging line, or a cooling water pump failure that forces a process unit shutdown.
What Results Do Companies Typically See?
FMEA-driven maintenance optimization produces measurable results across maintenance efficiency, equipment reliability, and maintenance cost management. The improvements are most dramatic in facilities that have been operating with legacy PM programs based on OEM recommendations and institutional habit, because the analytical rigor of FMEA consistently reveals both over-maintenance (unnecessary tasks consuming labor without reducing risk) and under-maintenance (unaddressed failure modes generating avoidable failures).
FMEA analysis typically finds that 15-25% of existing PM tasks do not address any failure mode that is reasonably likely in the actual operating context, while 10-20% of active failure modes are not addressed by any current maintenance activity.
- PM task list optimization of 20-40%. FMEA analysis typically finds that 15-25% of existing PM tasks do not address any failure mode that is reasonably likely in the actual operating context, while 10-20% of active failure modes are not addressed by any current maintenance activity. The net result is a PM program that is both more focused (fewer tasks with no clear purpose) and more complete (new tasks that address previously unmanaged risks). Total PM labor hours may increase, decrease, or stay roughly the same — but the labor is reallocated from low-value activities to high-value activities that directly mitigate identified risks.
- Unplanned failure reduction of 25-50% on analyzed equipment. When maintenance tasks are specifically targeted at the failure modes that actually occur on your equipment in your operating context, the probability of those failure modes progressing to functional failure decreases substantially. Facilities that implement FMEA-derived maintenance strategies on their critical and semi-critical assets consistently report meaningful reductions in unplanned failure events within 12-24 months of implementation.
- Condition monitoring program justification and optimization. FMEA provides the analytical basis for determining which assets should receive predictive maintenance monitoring, which technologies are applicable to the dominant failure modes, and what monitoring intervals are appropriate given the P-F intervals for each failure mode. This prevents both the waste of monitoring assets where condition-based detection is ineffective and the risk of failing to monitor assets where detectable, high-consequence failure modes are active.
- Spare parts inventory rationalization. When the failure modes for each critical asset are documented with their expected degradation rates and the components they affect, spare parts stocking decisions can be based on failure risk rather than tradition. High-consequence failure modes with long lead time replacement components justify stocking; low-consequence failure modes on equipment with readily available parts do not. FMEA findings consistently identify opportunities to reduce spare parts investment by 10-20% while simultaneously improving parts availability for the repairs that matter most.
- Documented risk basis for maintenance strategy. In regulated industries and in facilities pursuing ISO 55001 asset management certification, FMEA provides the documented, auditable evidence that maintenance activities are based on systematic risk assessment rather than arbitrary schedules or subjective judgment. This documentation satisfies regulatory and audit requirements, supports management of change processes when maintenance strategies are revised, and provides the basis for informed discussions between maintenance, operations, and management about risk acceptance and resource allocation.
- Design and procurement improvement. FMEA findings frequently identify design deficiencies or material selection issues that no amount of maintenance can overcome. When a failure mode is driven by inadequate design margin, inappropriate material selection, or a component specification that does not match the service environment, the FMEA documents this finding and supports the capital investment case for redesign or upgrade. Similarly, recurring failure modes traced to procurement specification gaps — generic bearing replacements where application-specific bearings are required, for example — lead to specification improvements that reduce failure recurrence at the procurement level rather than the maintenance level.
- Organizational alignment on maintenance strategy. One of the least quantifiable but most frequently cited benefits of the FMEA process is the alignment it creates across maintenance, operations, reliability consulting, and management. The cross-functional team involved in the analysis develops a shared understanding of how equipment fails, what the consequences are, and why specific maintenance activities exist. This shared understanding reduces the friction between departments competing for maintenance resources, provides a common language for discussing risk, and builds organizational support for maintenance investments that are backed by documented analysis rather than opinion.
Facilities that implement FMEA-derived maintenance strategies on their critical and semi-critical assets consistently report 25-50% reductions in unplanned failure events within 12-24 months of implementation.
The long-term value of FMEA extends beyond the initial analysis. As your facility accumulates operating experience, failure data, and condition monitoring results, the FMEA becomes a living document that is updated to reflect new knowledge. Failure modes that were rated as low-occurrence based on limited data may need re-evaluation as operating hours accumulate. New failure modes may be identified as equipment ages or process conditions change. The analytical framework established by the initial FMEA provides the structure for incorporating this evolving knowledge into an increasingly refined and effective maintenance strategy — one that improves continuously because it is built on a foundation of systematic analysis rather than static assumptions.