Carefully following a proven process, team knowledge and management support are 
critical for success.

Root cause analysis is a core objective in many facilities, yet problems continue to hinder significant improvements. The difficulty could be the abundance of problems that must be solved and/or the quality of the root cause analyses may not be adequate, which allow problems to recur. This article will cover the wide range of views regarding what root cause analysis is and how it should be performed. The systems required for root cause analysis to be a productive business process in an organization will also be discussed.

Problem Solving

Many methods of problem solving are available. They range from home-grown processes developed by the facility to vendor-provided methods. The methods may be easy-to-use or somewhat complex. They also vary in their ability to eliminate the recurrence of problems or equipment failures.

Some methods are called problem solving and some are called root cause analysis (RCA). These different terms require definitions. Without the definitions, a problem solving method could inaccurately be called troubleshooting, problem solving or RCA interchangeably. The following definitions help clarify the methods:

  • Troubleshooting is a process of elimination (trial & error)—eliminating the potential causes of a problem.
  • Problem solving is a systematic search for the source of a problem so that it can be solved.
  • RCA is examining problems down to their latent root causes, which may include deficiencies in management systems and restraining cultural norms that allowed the failure to occur.

Troubleshooting is neither problem solving nor root cause analysis because it is not a systematic approach. Troubleshooting is a form of trial and error, and its ability to solve problems is solely dependent on the troubleshooter’s skill and experience with the problem.

Problem solving fits the blueprint for continuous improvement but leaves the depth or shallowness of the investigation up to the user. All too often, problem solving stops too early in the investigation process to eliminate a problem’s entire failure mechanism. In many cases, problem solving identifies the physical root causes of a problem, but it is not designed to uncover latent system issues.

RCA investigates a problem to a depth in which the physical, human and system deficiencies are exposed for resolution. This depth will eliminate a problem’s recurrence, and the corrections can be leveraged in other areas where the same system problems exist (see Figure 1).

Root cause analysis versus other problem solving methodsFigure 1. Root cause analysis versus other problem solving methods

Most organizations do not compile a mission statement for the problem solving process because they believe the mission should be apparent. Organizations that do compile a mission usually wrap it into another program requirement, such as continuous improvement.

Continuous improvement is a term used often in problem investigation as a part of the mission or in some cases as the mission. As with many other terms, continuous improvement has many interpretations. When a problem is solved, it is often considered to have met the continuous improvement requirements. However, how often does the same problem repeat at some later date? If the problem recurs, does the solution still qualify as continuous improvement?

The answer depends on the definition of problem solving used by the organization. Many problem solving methods meet continuous improvement interpretations by simply returning operations to an uninterrupted work process and postponing the return of the problem to another time.

Should incremental repairs be performed as continuous improvement or does labeling these in this way encourage employees to only improve problems slightly and settle for mediocrity?

Problem elimination is a more in-depth problem solving mission. The failure mechanism must be identified and eliminated so that little to no chance of a recurrence is possible. This also meets the continuous improvement requirements but on a quantum improvement basis rather than an incremental improvement basis.

Both definitions of continuous improvement are important when problems are divided into a two-track approach for failure avoidance. A two-track approach is proactive because an opportunity analysis tool separates problems into two categories: significant few and random many.

The “significant few” problems are 20 percent of the issues that result in 80 percent of the losses spent for repairs. The “random many” problems are the remainder of the problems, which account for only 20 percent of losses (see Figure 2).

Problem eventsFigure 2. Problem events

Significant issues should always be solved for elimination, and all other problems solved for incremental continuous improvement gains. What cannot be eliminated should be prevented and/or fail-safe. Often, problem mechanisms can be eliminated but only if operators take the time to conduct an in-depth RCA investigation.

Problem Solving for Elimination Is More Than a Method

Many problem solving methods are available for purchase and are comprised of a step-by-step method for problem resolution. Some are as simple as asking the question “Why?” five times to reach the cause. Some are about sitting as a group or team and brainstorming the reasons that a problem occurs and leaving with a solution to implement. Others require rigorous cause-and-effect analysis including logic trees with hypothesis verifications, which uncover several levels of root causes. No matter which method is adopted by an organization, more than just the method will be required for successful problem solving, particularly if the mission is to eliminate recurrence.

The first consideration is whether the support systems are in place. Problem elimination means an in-depth investigation. When conducting an in-depth investigation, technical support for proving and disproving hypotheses is necessary. Other support, such as providing the time and resources to perform an investigation, must also be considered.

Next, is there a standard for evidence collection? Do all employees know what to do after an incident occurs? In many cases, they do not, which creates a barrier to problem elimination. The hourly workforce is rarely trained to perform RCA. They are trained to fix the problem, discard the failed components and start up the equipment. The discarded data may never be recovered, and a successful analysis becomes nearly impossible.

Another consideration is: does the investigator know and follow the chosen investigative method? Most methods are designed to obtain the correct answers, but they are seldom implemented and practiced as designed.

The investigator’s ability to read fractured component surfaces for electrical and mechanical components speeds up an analysis because they do not have to wait for a third party metallurgical analysis to be completed. Basic knowledge in this area will be successful in solving approximately 80 percent of the analyses conducted.

One method alone cannot provide this internal knowledge. A solid understanding of why humans make mistakes is also necessary when hypothesizing about human involvement in an incident. Human error is manageable when the human error drivers are understood. Identifying deficient management systems or latent root causes will uncover human error drivers.

When the investigator is a practiced “true lead investigator” supported by management, the success from the elimination of significant problems will add to the bottom line, and that is where success is measured.