True or false: When comparing a parameter for two sets of measurements, you should always use a hypothesis test to determine whether the difference is statistically significant.
The answer? (drumroll...) True!
To understand this paradoxical answer, you need to keep in mind the difference between samples, populations, and descriptive and inferential statistics.
Descriptive Statistics and Populations
Consider the fictional countries of Glumpland and Dolmania.
The population of Glumpland is 8,442,012. The population of Dolmania is 6,977,201. For each country, the age of every citizen (to the nearest tenth), is recorded in a cell of a Minitab worksheet.
Using Stat > Basic Statistics > Display Descriptive Statistics we can quickly calculate the mean age of each country.
It looks like Dolmanians are, on average, more youthful than Glumplanders. But is this difference in means statistically significant?
To find out, we might be tempted to evaluate these data using a 2-sample t-test.
Except for one thing: there's absolutely no point in doing that.
That's because these calculated means are the means of the entire populations. So we already know that the population means differ.
Another example. Suppose a baseball player gets 213 hits in 680 at bats in 2015, and 178 hits in 532 at bats in 2016.
Would you need a 2-proportions test to determine whether the difference in batting averages (.313 vs .335) is statistically significant? Of course not.
You've already calculated the proportions using all the data for the entire two seasons. There's nothing more to extrapolate. And yet you often see a hypothesis test applied in this type of situation, in the mistaken belief that if there's no p-value, the results aren't "solid" or "statistical" enough.
But if you've collected every possible piece of data for a population, that's about as solid as you can get!
Inferential Statistics and Random Samples
Now suppose that draconian budget cuts have made it infeasible to track and record the age of every resident in Glumpland and Dolmania. What can they do?
Quite a lot, actually. They can apply inferential statistics, which is based on random sampling, to make reliable estimates without those millions of data values they don't have.
To see how it works, use Calc > Random Data > Sample from columns in Minitab. Randomly sample 50 values from the 8,422,012 values in column C1, which includes the ages of the entire population of Glumpland. Then use descriptive statistics to calculate the mean of the sample.
Here are the results for one random sample of 50:
Descriptive Statistics: GPLND (50)
The sample mean, 52.37 is slightly less than the true mean age of 53 for the entire population of Glumpland. What about another random sample of 50?
Descriptive Statistics: GPLND (50)
Hmm. This sample mean of 54.11 slightly overshoots the true population mean of 53.
Even though the sample estimates are in the ballpark of the true population mean, we're seeing some variation. How much variation can we expect? Using descriptive statistics alone, we have no inkling of how "close" a sample estimate might be to the truth.
Enter...the Confidence Interval
To quantify the precision of a sample estimate for the population, we can use a powerful tool in inferential statistics: the confidence interval.
Suppose you take random samples of size 5, 10, 20, 50, and 100 from Glumpland and Dolmania using Calc > Random Data > Sample from columns. Then use Graph > Interval Plot > Multiple Ys to display the 95% confidence intervals for the mean of each sample.
Here's what the interval plots look like for the random samples in my worksheet.
Your plots will look different based on your random samples, but you should notice a similar pattern: The sample mean estimates (the blue dots) tend to vary more from the population mean as the sample sizes decrease. To compensate for this, the intervals "stretch out" more and more, to ensure the same 95% overall probability of "capturing" the true population mean.
The larger samples produce narrower intervals. In fact, using only 50-100 data values, we can closely estimate the mean of over 8.4 million values, and get a general sense of how precise the estimate is likely to be. That's the incredible power of random sampling and inferential statistics!
To display side-by-side confidence intervals of the mean estimates for Glumpland and Dolmania, you can use an interval plot with groups.
Now, you might be tempted to use these results to infer whether there's a statistically significant difference in the mean age of the populations of Glumpland and Dolmania. But don't. Confidence intervals can be misleading for that purpose.
For that, we need another powerful tool of inferential statistics...
Enter...the hypothesis test and p-value
The 2-sample t-test is used to determine whether there is a statistically significant difference in the means of the populations from which the two random samples were drawn. The following table shows the t-test results for each pair of same-sized samples from Glumpland and Dolmania. As the sample size increases, notice what happens to the p-value and the confidence interval for the difference between the population means.
Again, the confidence intervals tend to get wider as the samples get smaller. With smaller samples, we're less certain of the precision of the estimate for the difference..
In fact, only for the two largest random samples (N=50 and N=100) is the p-value less than a 0.05 level of significance, allowing us to conclude that the mean ages of Glumplanders and Dolmanians are statistically different. For the three smallest samples (N=20, N=10, N=5), the p-value is greater than 0.05, and confidence interval for each of these small samples includes 0. Therefore, we cannot conclude that there is difference in the population means.
But remember, we already know that the true population means actually do differ by 5.4 years. We just can't statistically "prove" it with the small samples. That's why statisticians bristle when someone says, "The p-value is not less than 0.05. Therefore, there's no significant difference between the groups." There might very well be. So it's safer to say, especially with small samples, "we don't have enough evidence to conclude that there's a significant difference between the groups."
It's not just a matter of nit-picky semantics. It's simply the truth, as you can see when you take random samples of various sizes from the same known populations and test them for a difference.
If you have a random sample, you should always accompany estimates of statistical parameters with a confidence interval and p-value, whenever possible. Without them, there's no way to know whether you can safely extrapolate to the entire population. But if you already know every value of the population, you're good to go. You don't need a p-value, a t-test, or a CI—any more than you need a clue to determine whats inside a box, if you already know what's in it.