Data analysis gives you the keys to how to manufacture the best product, provide the best services, or answer an academic research question. I’ll share practical tidbits that may help you do just that. Continue Reading »
Data mining can be helpful in the exploratory phase of an
analysis. If you're in the early stages and you're just figuring
out which predictors are potentially correlated with your response
variable, data mining can help you identify candidates. However,
there are problems associated with using data mining to select
In my previous post, we used data mining to settle on
the following... Continue Reading
mining uses algorithms to explore correlations in data sets. An
automated procedure sorts through large numbers of variables and
includes them in the model based on statistical significance alone.
No thought is given to whether the variables and the signs and
magnitudes of their coefficients make theoretical sense.
We tend to think of data mining in the context of big data, with
its huge... Continue Reading
performed multiple linear regression and have settled on a model
which contains several predictor variables that are statistically
significant. At this point, it’s common to ask, “Which variable is
This question is more complicated than it first appears. For one
thing, how you define “most important” often depends on your
subject area and goals. For another, how you collect... Continue Reading
Analysis of variance (ANOVA) can determine whether the means of
three or more groups are different. ANOVA uses F-tests to
statistically test the equality of means. In this post, I’ll show
you how ANOVA and F-tests work using a one-way ANOVA example.
But wait a minute...have you ever stopped to wonder why you’d
use an analysis of variance to determine whether
means are different? I'll also show how... Continue Reading
In statistics, t-tests are a type of hypothesis test that allows
you to compare means. They are called t-tests because each t-test
boils your sample data down to one number, the t-value. If you
understand how t-tests calculate t-values, you’re well on your way
to understanding how these tests work.
In this series of posts, I'm focusing on concepts rather than
equations to show how t-tests work.... Continue Reading
T-tests are handy hypothesis tests in statistics when you want to
compare means. You can compare a sample mean to a hypothesized or
target value using a one-sample t-test. You can compare the means
of two groups with a two-sample t-test. If you have two groups with
paired observations (e.g., before and after measurements), use the
How do t-tests work? How do t-values fit in? In this... Continue Reading
Likert scales are commonly associated with surveys and are used in
a wide variety of settings. You’ve run into the Likert scale if
you’ve ever been asked whether you strongly agree, agree, neither
agree or disagree, disagree, or strongly disagree about something.
The worksheet to the right shows what five-point Likert data look
like when you have two groups.
Because Likert item data are... Continue Reading
P values have been around for nearly a century and they’ve been
the subject of criticism since their origins. In recent years, the
debate over P values has risen to a fever pitch. In particular,
there are serious fears that P values are misused to such an extent
that it has actually damaged science.
In March 2016, spurred on by the growing concerns, the American
Statistical Association (ASA) did... Continue Reading
In statistics, there are things you need to do so you can trust
your results. For example, you should check the sample size, the
assumptions of the analysis, and so on. In regression analysis, I
always urge people to check their residual plots.
In this blog post, I present one more thing you should do so you
can trust your regression results in certain
circumstances—standardize the continuous... Continue Reading
In the world of linear models, a hierarchical model contains all
lower-order terms that comprise the higher-order terms that also
appear in the model. For example, a model that includes the
interaction term A*B*C is hierarchical if it includes these terms:
A, B, C, A*B, A*C, and B*C.
Fitting the correct regression model can be as
much of an art as it is a science. Consequently, there's not always
a... Continue Reading
If you perform linear regression analysis, you might need to
compare different regression lines to see if their constants and
slope coefficients are different. Imagine there is an established
relationship between X and Y. Now, suppose you want to determine
whether that relationship has changed. Perhaps there is a new
context, process, or some other qualitative change, and you want to
determine... Continue Reading
Control charts are a fantastic tool. These charts plot your
process data to identify common cause and special cause variation.
By identifying the different causes of variation, you can take
action on your process without over-controlling it.
Assessing the stability of a process can help you determine
whether there is a problem and identify the source of the problem.
Is the mean too high, too low,... Continue Reading
approaches, you are probably taking the necessary steps to protect
yourself from the various ghosts, goblins, and witches that are prowling
around. Monsters of all sorts are out to get you, unless they’re
sufficiently bribed with candy offerings!
I’m here to warn you about a ghoul that all statisticians and
data scientists need to be aware of: phantom degrees of freedom.
These phantoms... Continue Reading
Speaker John Boehner resigning, Kevin McCarthy quitting before the
vote for him to be Speaker, and a possible government shutdown in
the works, the Freedom Caucus has certainly been in the news
frequently! Depending on your political bent, the Freedom Caucus
has caused quite a disruption for either good or bad.
Who are these politicians? The Freedom Caucus is a group of
approximately 40... Continue Reading
An exciting new study sheds light on the relationship between P
values and the replication of experimental results. This study
highlights issues that I've emphasized repeatedly—it is crucial to
interpret P values correctly, and significant
results must be replicated to be trustworthy.
The study also supports my disagreement with the decision
by the Journal of Basic and Applied Social Psychology to
b... Continue Reading
Repeated measures designs don’t fit our impression of a typical
experiment in several key ways. When we think of an experiment, we
often think of a design that has a clear distinction between the
treatment and control groups. Each subject is in one, and only one,
of these non-overlapping groups. Subjects who are in a treatment
group are exposed to only one type of treatment. This is the... Continue Reading
analysis, overfitting a model is a real problem. An overfit model
can cause the regression coefficients, p-values, and R-squared to be misleading. In this post,
I explain what an overfit model is and how to detect and avoid this
An overfit model is one that is too complicated for your data
set. When this happens, the regression model becomes tailored to
fit the quirks and... Continue Reading
Minitab is the leading provider of software and services for quality
improvement and statistics education. More than 90% of Fortune 100 companies
use Minitab Statistical Software, our flagship product, and more students
worldwide have used Minitab to learn statistics than any other package.
Minitab Inc. is a privately owned company headquartered in State College,
Pennsylvania, with subsidiaries in the United Kingdom, France, and
Australia. Our global network of representatives serves more than 40
countries around the world.