Equivalence Testing for Quality Analysis (Part II): What Difference Does the Difference Make?

magnifying glassMy previous post examined how an equivalence test can shift the burden of proof when you perform hypothesis test of the means. This allows you to more rigorously test whether the process mean is equivalent to a target or to another mean.

Here’s another key difference: To perform the analysis, an equivalence test requires that you first define, upfront, the size of a practically important difference between the mean and the target, or between two means.

Truth be told, even when performing a standard hypothesis test, you should know the value of this difference. Because you can’t really evaluate whether your analysis will have adequate power without knowing it. Nor can you evaluate whether a statistically significant difference in your test results has significant meaning in the real world, outside of probability distribution theory.

But since a standard t-test doesn’t require you to define this difference, people often run the analysis with a fuzzy idea, at best, of what they’re actually looking for. It’s not an error, really. It’s more like using a radon measuring device without knowing what levels of radon are potentially harmful.

Defining Equivalence Limits: Your Call

How close does the mean have to be to the target value or to another mean for you to consider them, for all practical purposes, “equivalent”?  

The zone of equivalence is defined by a lower equivalence and/or an upper equivalence limit. The lower equivalence limit (LEL) defines your lower limit of acceptability for the difference. The upper equivalence limit (UEL) defines your upper limit of acceptability for the difference. Any difference from the mean that falls within this zone is considered unimportant.

In some fields, such as the pharmaceutical industry, equivalence limits are set by regulatory guidelines. If there aren’t guidelines for your application, you’ll need to define the zone of equivalence using knowledge of your product or process.

Here’s the bad news: There isn’t a statistician on Earth who can help you define those limits. Because it isn’t a question of statistics. It’s a question of what size of a difference produces tangible ramifications for you or your customer.

A difference of 0.005 mg from the mean target value? A 10% shift in the process mean?  Obviously, the criteria aren't going to be the same for the diameter of a stent and the diameter of a soda can.

Equivalence Test in Practice

Here's a quick example of a 1-sample equivalence test, adapted from Minitab Help.To follow along, you can download the revised data here. If you don't have Minitab, download a free trial version here.

Suppose a packaging company wants to ensure that the force needed to open its snack food bags is within 10% of the target value of 4.2N (Newtons). From previous testing, they know that a force lower than 10% below the target causes the bags to open too easily and reduces product freshness.A force above 10% of the target makes the bags too difficult to open. They randomly sample 100 bags and measure the force required to open each one.

To test whether the mean force is equivalent to the target, they choose Stat > Equivalence Tests > 1-Sample and fill in the dialog box as shown below:

Tip: Use the Multiply by Target box when you want to define the equivalence limits for a difference in terms of a percentage of the target. In this case, the lower limit is 10% less than the target. The upper limit is 10% higher than the target. If you want to represent the equivalence limits in absolute terms, rather than as percentages, simply enter the actual values for your equivalence limits and don't check the Multiply by Target box.

When you click OK, Minitab displays the following results:

One-Sample Equivalence Test: Force

Difference: Mean(Force) - Target

Difference        SE     95% CI     Equivalence Interval
   0.14270  0.067559  (0, 0.25487)      (-0.42, 0.42)

CI is within the equivalence interval. Can claim equivalence.

Null hypothesis:         Difference ≤ -0.42 or Difference ≥ 0.42
Alternative hypothesis:  -0.42 < Difference < 0.42
α level:                 0.05

Null Hypothesis     DF  T-Value  P-Value
Difference ≤ -0.42  99   8.3290    0.000
Difference ≥ 0.42   99  -4.1046    0.000

The greater of the two P-Values is 0.000. Can claim equivalence.

Because the confidence interval for the difference falls completely within the equivalence limits, you can reject the null hypothesis that the mean differs from the target. You can claim that the mean and the target are equivalent.

Notice that if you had used a standard 1-sample t-test to analyze these data, the output would show a statistically significant difference between the mean and the target (at a significance level of  0.05):

One-Sample T: Force

Test of μ = 4.2 vs ≠ 4.2
Variable    N    Mean   StDev  SE Mean       95% CI          T      P
Force         100  4.3427  0.6756   0.0676  (4.2086, 4.4768)  2.11  0.037

These two sets of results aren't really contradictory, though.

The equivalence test has simply defined "equality" between the mean and the target in broader terms, using the values you entered for the equivalence zone. The standard t-test has no knowledge of what "practically significant' means. So it can only evaluate the difference from the target in terms of statistical significance.

In this way, an equivalence test is "naturally smarter" than a standard t-test. But it's your knowledge of the process or product that allows an equivalence test to evaluate the practical significance of a difference, in addition to its statistical significance.

Learn More about Equivalence Testing

There are four types of equivalence tests newly available in Minitab. To learn more about each test, choose Help > Help. Click the Index tab, scroll down to Equivalence testing, and click Overview.


Name: Jared • Thursday, April 24, 2014

Hi, great post. I often run into situations where I am attempting to establish that something is the same as OR better than some baseline condition.

Can the equivalence test handle this situation?


Name: Patrick • Friday, April 25, 2014

Hi Jared,

Absolutely. You can perform what’s called a noninferiority test to test that the mean is not less than a target value (or another mean).

For example, suppose you want to test whether a mean is the same or better than the baseline value of 5 units. In Minitab 17, choose Stat > Equivalence Tests > 1-Sample. In Sample, enter the column of data, in Target, enter the baseline value.

From What do you want to determine (Alternative hypothesis), choose Test mean > target. If you want to make sure that the alternative hypothesis captures your baseline value, just enter a target value that’s just slightly smaller than the baseline. For example, you can use a target of 4.9999 to make sure that a mean value of 5 will be interpreted as “noninferior” for your test.

You could also set it up this way. Enter the baseline value of 5 in Target. Then, for Alternative hypothesis, choose “Test mean – target > lower limit”. For lower limit, enter 0 (or a value slightly lower than 0, such as -0.0001, to capture your target value).Using either set up, you should get the same result.

To set up a nonsuperiority test, use a similar strategy: for the alternative hypothesis, choose Test mean < target or Test mean – target < upper limit.

Using a 2-sample equivalence or paired equivalence test, you can use a similar strategy to test whether the mean of one population is noninferior (or nonsuperior) to the mean of another population.

Thanks for the great question!

Name: George • Monday, May 5, 2014

Why does the 95%CI of the difference give a different value from the 2-t test vs. the 2-sample equivalence test? Aren't both supposed to be statements of the same thing?

Name: Patrick • Tuesday, May 6, 2014

Thanks for your comment. It was very observant of you to notice the disparity between the 95% confidence interval (CI) for the difference between two population means when using the 2-sample t test and the 95% CI for the difference obtained when using the 2-sample equivalence test.

To see the CI formula used for the 2-sample t test, choose Help > Methods and Formulas >Statistics > Basic Statistics> 2-sample t> Confidence interval.

To see the CI formula used for the 2-sample equivalence test, choose Help > Methods and Formulas >Statistics >Equivalence tests > 2-sample Equivalence Tests> Confidence interval.

The reason for this disparity is that the construction of the confidence interval for the equivalence of two population means uses the additional information of the lower and upper limits of the equivalence interval for the two means. This 1-alpha% confidence interval is specifically derived for the alpha-level equivalence test of two population means, which has different null and alternative hypotheses from the standard 2-sample t test. Because of the usage of the additional information, the (1-alpha)x100% confidence interval for an equivalence test is in most cases tighter than the (1-alpha)x 100% confidence interval for the standard 2–sample t test. For more statistical details, please refer to the following two articles:

Hsu, J.C., Hwang, J.T.G., Liu, H. K., and Ruberg, S. J. (1994). Confidence Intervals Associated with Tests for Bioequivalence. Biometrika 81, 103-114.

Berger, R.L. and Hsu, J.C. (1996). Bioequivalence Trials, Intersection-Union Tests and Equivalence Confidence Sets. Statistical Science. Vol 11, 283-319

Thanks for the great question....and thanks to Dr. Yanling Zuo, senior research and design statistician at Minitab, for her assistance with this response!

Name: Marleny • Wednesday, September 10, 2014

Hi, great post!
I need a little help with one question... Can you perform a non-inferiority/equivalence test in a retrospective study, to compare 2 groups regarding the same surgical intervention, but one different previous variable (prior to that intervention)? If not, should I perform a superiority test anyway (even if it was not the initial goal?)?
Thanks for your help!

Name: patrick • Tuesday, September 16, 2014

Hi, thanks for your comment! TIf I understand what you're asking, you want to compare surgery outcomes for 2 groups undergoing the same procedure using equivalence tests. You certainly can do that--but how useful the results will be depend on how your study was set up. For example, is the different variable for the two groups prior to the intervention another condition, in addition to the variable that defines them as separate groups? For example, let's say one group is women and one group is men, and this additional variable is having undergone some pre-surgical nutritional supplementation (for the women). You can perform an equivalence test to compare the surgery outcomes for men and women--but the problem is you won't be able to interpret your results clearly. If the outcomes are not equivalent, there may be both gender and/or nutritional therapy effects--you won't know which. If the results show equivalence, you can't say the surgery has equivalent outcomes for men and women, but only that it has equivalent outcomes for men and women who undergo nutritional therapy. So, if I understand your question correctly, the issue is not whether you can perform an equivalence test. It's really that you need to be mindful of that additional variable when you interpret your results.

Thanks for reading and commenting!

blog comments powered by Disqus