Got Risk? Using FMEA to Improve Quality Assurance

Balancing Cost, Quality and Delivery

Let's talk about risk-based testing. I should start by saying that this is a sensitive subject. It’s why a career in quality is not for the faint of heart. The fact is that cost, quality and delivery are at constant odds in every industry, including software. Quality professionals play a key role in this balancing act. And, whenever things become "unbalanced," we quality professionals are often the bad guy—or at least we feel that way.

For a long time, I felt sorry for myself. Why couldn’t people understand how difficult my job was? Fighting so hard each and every day against those who might sacrifice quality in the name of cost and/or delivery, standing up for truth and justice and...well, okay, so there were no villains, I’m not a superhero and it wasn’t quite so dramatic. But keeping the perfect balance between quality and delivery is challenging!

Over the years, however, it's become evident that there is no correlation between the amount of time we spend testing and the quality of the products we deliver. What a relief, right? Well, not really. It doesn't mean ensuring quality is easier; it just means "more of the same" isn't the answer. It's much more complex than that.

I've learned that quality assurance is not about how much we test, it’s how we test. Truth be told, quite often it’s not about testing at all (more on that in a future post).

Risk-Based Testing

So, how do we balance cost, quality and delivery at Minitab? Very delicately. I'd love to say that we nail it every time but I'd be lying—and one thing a quality professional needs is rigorous honesty. Again, being a quality improvement professional is not for the faint of heart! But I can say that Minitab takes quality very seriously, and everyone who works here plays a part in continuously improving it.

While this subject can cover many posts, we’ll start with Risk-Based testing (RBT). Since there is a nearly infinite number of test cases for any feature or release, running every possible test would be both cost- and time-prohibitive. The role of a quality professional is to determine which tests should be run, how often they should be run, and how they are run. I liken this to my days in manufacturing when we utilized sampling plans for inspection. We couldn't inspect everything, but we could be very smart about what we did inspect.

Failure Modes and Effects Analysis, or FMEA

At Minitab, we use RBT to develop test strategies based on the risk of failure, which is a function of the probability of occurrence and severity. To assist our efforts, we utilize Minitab Engage's Failure Modes and Effects Analysis (FMEA).

I first began using an FMEA while working in the medical device industry in the early 90s. It worked well then and, almost 20 years later, it’s still one of my go-to tools. Of course, Engage has made it much easier to manage than it was 20 years ago! As illustrated below, FMEA is used to help prioritize failures based on seriousness, frequency of occurrence, and ease of detection. Based on this assessment, clear strategies for risk mitigation can be identified and implemented.

Excerpt from an FMEA conceived by the Minitab Team:

Step#	Process Map - Activity	Potential Failure Mode	Potential Failure Effects	SEV	OCC	Current Controls	DET	RPN
10	Dialog Testing	Access key validation isn’t automated for each dialog	The hot keys are initially validated within development. If a developer makes a change that breaks a hot key assignment, it may not be immediately detected.	3	2	Initial manual testing and follow up manual regression testing of changes for each dialog. General automated testing of access keys.	5	30

Actions Recommended	Responsibility	Target End Date	Actions Taken	Actual End Date
An evaluation will be done to determine if some automated testing can be performed	JRoan (Test Architect)	3/1/2011	Automated monitoring of dialog file to detect changes in assignments.	2/15/2011

Revised metrics
SEV	OCC	DET	RPN
3	2	1	6

Steps for Completing the FMEA

1) In Process Map - Activity, enter each process step, feature or type of activity. In the example above, it's dialog testing at a feature level.

2) In Potential Failure Mode, identify ways the process can fail for each activity. Multiple failure modes may exist. In the example above, we don't have test scripts running to validate the functionality of each access key in each dialog.

3) In Potential Failure Effects, enter potential failure effects for each failure mode. Any failure mode can have multiple failure effects. The potential failure above is that if a developer makes a change (once the dialog has been verified), and that change breaks the access key, the failure may not be detected.

4) In SEV (Severity Rating), estimate the severity of each failure effect. Use a 1 to 10 scale, where 10 signifies high and 1 signifies low. This is a relative assignment. In our world, the access keys would have a lower severity than, for example, the statistical or graphical results. We assigned this a 3 severity rating.

5) In OCC (Occurrence Rating), estimate the probability of occurrence of the cause. Use a 1 to 10 scale, where 10 signifies high frequency (guaranteed ongoing problem) and 1 signifies low frequency (extremely unlikely to occur). In our example, the probability of making an access key change after the initial assignment is low, but not impossible. We assigned this an occurrence rating of 2.

6) In Current Control, enter the manner in which the failure causes/modes are detected or controlled. In the dialog testing, we do manually validate each access key, automate the general validation of access keys, and perform quick tests.

7) In DET (Detection Rating), evaluate the ability of each control to detect or control the failure cause/mode. Use a 1 to 10 scale, where 10 signifies poor detection/control (the customer will almost surely receive a flawed output) and 1 signifies high detection/control (almost certain detection, generally finding the cause before it has a chance to create the failure mode). In our example, the probability of detecting the problem wasn't high but we may find it due to the manual testing performed throughout. We assigned a detection rating of 5.

8) Evaluating the RPN: The RPN (Risk Priority Number) is the product of the SEV, OCC, and DET scores. The RPN is the overall score for a combination of Mode/Effect/Cause. The higher the RPN, the more severe, more frequent, or less controlled a potential problem is, indicating a greater need for immediate attention. In our example, the RPN was 30 and we decided to evaluate and determine if any automation could be performed. Our test architect identified an alternative solution that would improve our probability of detection.

9) Once corrective action has been taken, enter the new SEV, OCC, and DET values to calculate a revised RPN. In our example, the RPN was reduced to a very acceptable 6!

Making FMEA Easy

The FMEA analysis tool in Engage has provided a great way to analyze, manage and communicate our risk-based assessments and resulting test strategies. In the example above, an automated system was created to alert QA of changes to the files. This was a cost-effective risk mitigation strategy for the issue at hand. The FMEA provided a structure to guide the teams through the analysis and then document the reasoning behind the decision.

The FMEA also is maintained and available as a record of these types of decisions—I don’t have to tell any quality professional how helpful that is in preventing future “Why exactly did we decide to do that?” discussions. I just wish Engage had been around when I started doing FMEAs 20 years ago!