# Statistics Help

Blog posts and articles that offer tips about the statistics used in lean and six sigma quality improvement projects.

If you need to assess process performance relative to some specification limit(s), then process capability is the tool to use. You collect some accurate data from a stable process, enter those measurements in Minitab, and then choose Stat > Quality Tools > Capability Analysis/Sixpack or Assistant > Capability Analysis. Now, what about sorting the data? I’ve been asked “why does Cpk change when I... Continue Reading
In the world of linear models, a hierarchical model contains all lower-order terms that comprise the higher-order terms that also appear in the model. For example, a model that includes the interaction term A*B*C is hierarchical if it includes these terms: A, B, C, A*B, A*C, and B*C. Fitting the correct regression model can be as much of an art as it is a science. Consequently, there's not always a... Continue Reading

### 7 Deadly Statistical Sins Even the Experts Make

Do you know how to avoid them?

If you perform linear regression analysis, you might need to compare different regression lines to see if their constants and slope coefficients are different. Imagine there is an established relationship between X and Y. Now, suppose you want to determine whether that relationship has changed. Perhaps there is a new context, process, or some other qualitative change, and you want to determine... Continue Reading
When you work in data analysis, you quickly discover an irrefutable fact: a lot of people just can't stand statistics. Some people fear the math, some fear what the data might reveal, some people find it deadly dull, and others think it's bunk. Many don't even really know why they hate statistics—they just do. Always have, probably always will.  Problem is, that means we who analyze data need to com... Continue Reading
Back when I was an undergrad in statistics, I unfortunately spent an entire semester of my life taking a class, diligently crunching numbers with my TI-82, before realizing 1) that I was actually in an Analysis of Variance (ANOVA) class, 2) why I would want to use such a tool in the first place, and 3) that ANOVA doesn’t necessarily tell you a thing about variances. Fortunately, I've had a lot more... Continue Reading
Control charts are a fantastic tool. These charts plot your process data to identify common cause and special cause variation. By identifying the different causes of variation, you can take action on your process without over-controlling it. Assessing the stability of a process can help you determine whether there is a problem and identify the source of the problem. Is the mean too high, too low,... Continue Reading
As Halloween approaches, you are probably taking the necessary steps to protect yourself from the various ghosts, goblins, and witches that are prowling around. Monsters of all sorts are out to get you, unless they’re sufficiently bribed with candy offerings! I’m here to warn you about a ghoul that all statisticians and data scientists need to be aware of: phantom degrees of freedom. These phantoms... Continue Reading
In Part 3 of our series, we decided to test our 4 experimental factors, Club Face Tilt, Ball Characteristics, Club Shaft Flexibility, and Tee Height in a full factorial design because of the many advantages of that data collection plan. In Part 4 we concluded that each golfer should replicate their half fraction of the full factorial 5 times in order to have a high enough power to detect... Continue Reading
I read trade publications that cover everything from banking to biotech, looking for interesting perspectives on data analysis and statistics, especially where it pertains to quality improvement. Recently I read a great blog post from Tony Taylor, an analytical chemist with a background in pharmaceuticals. In it, he discusses the implications of the FDA's updated guidance for industry analytical... Continue Reading
An exciting new study sheds light on the relationship between P values and the replication of experimental results. This study highlights issues that I've emphasized repeatedly—it is crucial to interpret P values correctly, and significant results must be replicated to be trustworthy. The study also supports my disagreement with the decision by the Journal of Basic and Applied Social Psychology to b... Continue Reading
Repeated measures designs don’t fit our impression of a typical experiment in several key ways. When we think of an experiment, we often think of a design that has a clear distinction between the treatment and control groups. Each subject is in one, and only one, of these non-overlapping groups. Subjects who are in a treatment group are exposed to only one type of treatment. This is the... Continue Reading
When I started out on the blog, I spent some time showing some data sets that would be easy to illustrate statistical concepts. It’s easier to show someone how something works with something familiar than with something they’ve never thought about before. Need a quick illustration to share with someone about how to summarize a variable in Minitab? See if they have a magazine on their desk, and... Continue Reading
Whatever industry you're in, you're going to need to buy supplies. If you're a printer, you'll need to purchase inks, various types of printing equipment, and paper. If you're in manufacturing, you'll need to obtain parts that you don't make yourself.  But how do you know you're making the right choice when you have multiple suppliers vying to fulfill your orders?  How can you be sure you're... Continue Reading
In regression analysis, overfitting a model is a real problem. An overfit model can cause the regression coefficients, p-values, and R-squared to be misleading. In this post, I explain what an overfit model is and how to detect and avoid this problem. An overfit model is one that is too complicated for your data set. When this happens, the regression model becomes tailored to fit the quirks and... Continue Reading
Statisticians say the darndest things. At least, that's how it can seem if you're not well-versed in statistics.  When I began studying statistics, I approached it as a language. I quickly noticed that compared to other disciplines, statistics has some unique problems with terminology, problems that don't affect most scientific and academic specialties.  For example, dairy science has a highly... Continue Reading
If you've read the first two parts of this tale, you know it started when I published a post that involved transforming data for capability analysis. When an astute reader asked why Minitab didn't seem to transform the data outside of the capability analysis, it revealed an oversight that invalidated the original analysis.  I removed the errant post. But to my surprise, the reader who helped me... Continue Reading
Last time, I told you how I had double-checked the analysis in a post that involved running the Johnson transformation on a set of data before doing normal capability analysis on it. A reader asked why the transformation didn't work on the data when you applied it outside of the capability analysis.  I hadn't tried transforming the data that way, but if the transformation worked when performed as... Continue Reading
Every now and then I’ll test my Internet speed at home using such sites as http://speedtest.comcast.net.  My need to perform these tests could stem from the cool-looking interfaces they employ on their site, as they display the results using analog speedometers and RPM meters. They could also stem from the validation that I need in "getting what I am paying for," although I realize that there are... Continue Reading
Last month the ESPN series Outside the Lines reported on major league pitchers suffering serious injuries from being struck in the head by line drives, and efforts MLB is making towards having protective gear developed for pitchers. You can view the report here if you'd like: A couple of things jump out at me from the clip: The overwhelming majority of pitchers are not interested in wearing... Continue Reading
Previously, I’ve written about how to interpret regression coefficients and their individual P values. I’ve also written about how to interpret R-squared to assess the strength of the relationship between your model and the response variable. Recently I've been asked, how does the F-test of the overall significance and its P value fit in with these other statistics? That’s the topic of this post! In... Continue Reading