Pumps and Systems, February 2007

A large refinery or power plant may operate up to a dozen units of the same design, all installed 20 or 30 years earlier. In this mature equipment environment, some level of reactivity is a fact of life, at least initially. The trick is to react proactively, with tools that raise the plant's long term equipment reliability level.

A mature plant that has been neglected for some time requires that problem areas be addressed first and that they be moved through a series of steps:

1. Reactive Response

Obviously, equipment that is broken must be fixed first. To be effective, even in a reactive mode, the maintenance operation must be able to identify, prioritize, plan and schedule work; operate and maintain a CMMS that drives a reasonable time-driven Preventive Maintenance schedule; manage materials and other resources to support the execution of work.

If the maintenance organization needs training to reach this level of operation, then provide the time and resources. This is the necessary starting point (Stage I) for many companies.

2. Proactive Reactivity

Once PM and repairs are being scheduled and executed effectively, implement a Reliability Continuous Improvement (RCI) initiative that focuses on perennial equipment problems and stamps them out in a series of short projects that enhance the equipment as it is being repaired.

In the mature equipment world, perennial problems typically develop where equipment has a combination of problems. RCI addresses this situation by analyzing the mix of material, equipment, engineering, training, and other causes that contribute to the perennial problems. Properly planned - and with unrelenting follow-through on projects - this initiative can eliminate the 10 percent of equipment problems that cause 90 percent of production losses. This is the second step (Stage II) for many companies.

Proactive tools typically include condition monitoring, further craft skills enhancements, refinement of equipment history capture, and the development of true asset healthcare strategies for key equipment. With mature equipment, a survey of potential failure modes may have to be postponed to pursue solutions to actual failure modes.

To start, use Cost of Unreliability (CoUR) analysis and safety reporting to identify the equipment that causes the most grief. CoUR combines some costs that are already being tracked, such as maintenance labor and material, lost production, and expedited shipping, overtime, equipment rental, etc. Assign these costs to the equipment groups that cause them. The result can be expressed as a graph like Figure 1. Note that the X-axis is not time, but rather it is the company's "fleet" of equipment.

Figure 1. Equipment Failure Pareto - CoUR (Cost of Unreliability). Figure 1. Equipment Failure Pareto - CoUR (Cost of Unreliability).

Pareto's rule nearly always applies to this kind of distribution. Five to 20 percent of all equipment will be causing 80 to 90 percent of CoUR. Start here for two reasons: 1) it delivers the most improvement for the resources spent; 2) it helps the rest of the organization understand that reliability engineering focuses on the same goals they do - safety, quality, productivity, and cost management. Once the most significant problem equipment is identified, the RCI steps are set into motion:

  1. Identify Problem Equipment. The top 10 pieces of problem equipment should be on the tip of the tongue at all times, or at least on a card in the desk. Related financial information should be kept here as well. If CoUR directs the effort to the problem equipment, then CoUR also begins the business case for fixing it.
  2. Schedule Interventions for Problem Equipment. The reliability engineer is in a good position to understand all the nuances of what must be done and to select a team with the knowledge to do it. Financial, safety, and productivity information helps the organization understand the urgency of what reliability engineering is trying to do.
  3. Facilitate Interventions. Reliability engineering involves several skills that are pivotal to diagnosing prescriptions for problem equipment. This includes driving and moderating the discussion of problems and solutions efficiently and accurately between sometimes partisan operations groups who try to assign the blame for equipment problems. The good reliability team is a clearinghouse for information and a training function to all these groups. Because the financial information the team needs comes from CoUR, keep it associated with the issues the team is addressing.
  4. Develop Action Steps. This crossfunctional team should perform the necessary gap and root cause analysis (RCA) to determine what steps will solve immediate problems. It also determines where the current asset strategies (maintenance plans, schedules, etc.) miss important problems and modifies those strategies so that predictive tools will address known failure modes. Associate the financial data with the fixes to provide a business case for each, or for groups of fixes.

Because problem equipment in a mature environment typically has multiple causes that create repetitive failures, teams that look for one true cause of all problems get frustrated. A versatile facilitator can guide and enable the team to analyze multiple root causes by broadening RCA procedures to treat "bundles" of related problems. By finding multiple root causes for repetitive failures, the team can generate multiple action steps for establishing reliability. A modified FMEA approach can provide the missing tool for this analysis and also provide an ideal training ground for building teamwork between the maintenance and production staffs by having them share the same steps in their approach to asset healthcare.

  1. Follow Up on Action Steps. After analyzing the problem equipment, the team creates a list of things that must happen for the problems to be solved. To really improve the reliability of the plant, they must see that the action steps are actually completed and the results delivered. This is where most plans fail. A list of things that ought to happen is a necessary step, but the job isn't done until someone actually makes them happen.
  2. Document and Deliver Results. If the team's follow-up has been the key to fixing equipment that has malfunctioned for years, then they have earned the right to carry the results to top management and take partial credit for them. This is an opportunity to reward those people who helped make it happen, and makes the next intervention less painful and more productive.
  3. Use Information and Results to Drive the Annual Budget Process. Here the financial analysis really pays off. If the team fixes important equipment problems, they save the organization from repeating those perennial problems. It is therefore appropriate for them to take credit for the change and encourage the business unit to include the new financial picture in its annual budgeting process. Reliability has secured its proper place in the organization when management asks, "What savings do you have for us next year, and what do you need to make it happen?"

In Figure 2, this sequence of RCI steps is characterized as a flywheel driving RCI in the business unit. Once this flywheel begins to turn, formal reports on progress toward solving the equipment problems that have plagued the area should become a monthly event.

Figure 2. The Reliability FlywheelFigure 2. The Reliability Flywheel.

As time goes on, problem equipment will become a smaller part of the organizational picture. The focus of RCI gradually shifts from intervention in problem equipment to embedding normal reliability engineering practices throughout the plant. It makes sense, however, to maintain the equipment Pareto (see Figure 1) as a guide in this effort as well. It always pays to start with the equipment that needs the most help.

All the above must take place in the context of an annual planning/continuous improvement structure that ensures follow-up of action steps in equipment improvement and documents the value of the work that takes place. This approach helps provide the resources and recognition needed to drive RCI on the shop floor.

3. Proactive Asset Healthcare

Once the organization is humming along effectively on the strength of sound asset healthcare strategies for known problem equipment, time becomes available for reliability engineers to develop equipment criticality information and pursue prevention of future failure modes.

By this stage, the organization will have developed faith in the ability of maintenance to drive asset healthcare for known failure modes. Reliability engineering's shift in focus to future failure modes will no longer seem irrelevant to the maintenance and production organizations. FMEA, Life Cycle Analysis, and equipment design participation by reliability engineers will seem natural - once the emergencies stop.