Adventures in Statistics

Thanks to my desire to understand the deeper mechanics that lie behind what we observe in the world, I suppose it’s natural that I love data analysis. Observation is great, but I can only observe a small slice of reality. I really want to understand the larger picture and know how it all works. Data analysis gives you the keys to do just this whether you are studying how to manufacture the best product, provide the best services, or answering an academic research question.

I’m Jim Frost and I came to Minitab with a background in a wide variety of academic research. My role was the “data/stat...

Expanding the Role of Statistics to Areas Traditionally Dominated by Expert Judgment

Should this doctor consult a regression model?

In a previous post, I wrote about how the field of statistics is more important now than ever before due to the modern deluge of data. Because you’re reading Minitab's statistical blog, I’ll assume that we’re in agreement that statistics allows you to use data to understand reality. However, I’d also bet that you’re picturing important but “typical” statistical studies, such as studies where Six Sigma analysts determine which factors affect product quality. Or perhaps medical studies, like determining the effectiveness of flu shots.

In this post,...

What Are the Effects of Multicollinearity and When Can I Ignore Them?

Multicollinearity is problem that you can run into when you’re fitting a regression model, or other linear model. It refers to predictors that are correlated with other predictors in the model. Unfortunately, the effects of multicollinearity can feel murky and intangible, which makes it unclear whether it’s important to fix.

My goal in this blog post is to bring the effects of multicollinearity to life with real data! Along the way, I’ll show you a simple tool that can remove multicollinearity in some cases.


 My goal in this blog post is to bring multicollinearity to life with real data about...

When Should I Use Confidence Intervals, Prediction Intervals, and Tolerance Intervals

In statistics, we use a variety of intervals to characterize the results. The most well-known of these are confidence intervals. However, confidence intervals are not always appropriate. In this post, we’ll take a look at the different types of intervals that are available in Minitab, their characteristics, and when you should use them.

I’ll cover confidence intervals, prediction intervals, and tolerance intervals. Because tolerance intervals are the least-known, I’ll devote extra time to explaining how they work and when you’d want to use them.

What are Confidence Intervals?

A confidence...

Great Presidents Revisited: Does History Provide a Different Perspective?

Recently, Patrick Runkel blogged about using regression models to explain how historians ranked the U.S. presidents. Given that I both love regression and that I’ve written about using regression to predict U.S. presidential elections, I wanted to take Patrick up on his challenge to improve upon his model.

My goal isn’t merely to predict the eventual ranking for any President. Instead, I’m much more interested in a fascinating question behind this analysis. Is the public’s contemporary assessment of the president consistent with the historical perspective, or do they differ?

With this in mind,...

When is Easter . . . for the next 2086 years?

Spring is in the air, and Easter is coming up soon! Easter occurs on March 31, 2013, and I’ve heard people exclaim that it’s early this year. I never really remember the date of Easter from one year to the next, but I had vague memories of it being in March not too long ago. Like any good statistician, I started wondering about the distribution of Easter dates. What dates are more common and which are less common? Is Easter in March really that unusual?

Even after reading the official definition of when Easter occurs, I still wasn’t clear about the date range. Easter occurs on the Sunday that...

Using Data Analysis to Assess Fatality Rates in Star Trek: The Original Series

I’m a Star Trek fan and a statistics fan. So, I’m thrilled to finally have the opportunity to combine the two into a blog post! In the original Star Trek series with Captain Kirk, the crew members of the U.S.S. Enterprise who wear red shirts have a reputation for dying more frequently than those who wear blue or gold shirts. Wearing a red shirt appears to be the kiss of death! In this blog, we’ll conduct several hypothesis tests to determine whether this is true.

Matthew Barsalou published an article in Significance that studies this from a statistical perspective. Barsalou is also a guest...

Why Statistics Is Important

"There are three kinds of lies: lies, damned lies, and statistics."

I’m sure you’ve heard this most vile expression, which was popularized by Mark Twain among others. This dastardly phrase impugns the reputation of statistics. The implication is that statistics can bolster a weak argument, or that statistics can be used to prove anything.

I’ve had enough of this expression, and here’s the rebuttal! In fact, I’ll make the case that statistics is not the problem, but the solution!

Mistakes Can Happen

First, let’s stipulate that an unscrupulous person canintentionally manipulate the results to favor...

Flu Shot Followup: Assessing the Long-Term Benefits of Flu Vaccination

In my last post, I wrote about the 60% effectiveness rate for flu shots that news media commonly report. The effectiveness is actually a relative measure of the reduction in your flu risk if you’re vaccinated. Relative measures are hard to interpret without additional information. With that in mind, I reanalyzed the data to put it in absolute terms. I found that if you get a flu shot, your average annual risk of getting the flu drops from 7.0% to 1.9%, which is a 5.1% reduction.

I’ve received several requests to look at this over a longer timeframe. After all, flu shots aren’t a one-time thing....

How Effective Are Flu Shots?

This flu season has been worse than normal. The Centers for Disease Control and Prevention (CDC) data show that the flu has struck early and hard. Influenza cases shot up during December rather than the more usual January or February, and 47 states report wide-spread influenza cases.

I get a flu shot every year even though I know they’re not perfect. I figure they’re a relatively easy and inexpensive way to reduce the chance of having a miserable week.

I’ve heard on various news media that their effectiveness is about 60%. But what does 60% effectiveness mean, exactly? How much does this...

The Monty Hall Problem and the Importance of Checking Your Assumptions

The "Monty Hall Problem" in statistics comes from the old TV game show Let’s Make a Deal. In the game, Monty asks you to guess which door a prize is hidden behind. This problem is interesting because the solution is so counter-intuitive that it's tripped up some of the world’s leading mathematicians! In fact, this problem's mind-bending quality reminds me of the nature of optical illusions.

I’ve always been fascinated by optical illusions because some of them remain very compelling even after you know the truth about them. For example, even though I know square A is the same shade of grey as...

How to Test Your Discrete Distribution

In my last post we looked at different discrete distributions and how you can use them. This time, I’ll show you how to determine whether your data follow a specific discrete distribution. (Read here to see how to identify the distribution of your continuous data.)

Before we start testing discrete distributions, we need to distinguish between two general cases. In some cases, it is more important to:

  • Check the assumptions (binary data)
  • Perform a goodness-of-fit test

Checking Assumptions for Distributions that Use Binary Data

For the distributions of binary data, you primarily need to determine...

Understanding and Using Discrete Distributions

Previously, I’ve written about how to use Minitab to identify the distribution of your continuous data. That blog post prompted several questions about how to use and identify discrete distributions.

If you are a quality improvement analyst who works with counts of defects or pass/fail inspections, you may be particularly interested in these types of discrete distributions. In this blog, I’ll show you how to use discrete distributions in Minitab statistical software. My next blog will show you how to determine whether your data follow a specific discrete distribution.

Continuous versus Discrete...

Predicting the U.S. Presidential Election: Evaluating Two Models (Part Two)

Yesterday, I presented a model that uses Dow Jones data to predict the winner in Presidential elections that have an incumbent. Today, I test a model that uses S&P 500 data. (Here are the data for today's blog that you can use in Minitab Statistical Software.)

Model 2: The Three Month Change in the S&P 500

The second model is presented by Sam Stovall, Chief Equity Strategist at S&P Capital IQ in his paper, “The Presidential Predictor: Stock Price Performances Have Typically Presaged Victors.” Unlike the Dow Jones study, this paper was written vaguely and presented unhelpful statistics. Also, the...

Predicting the U.S. Presidential Election: Evaluating Two Models (Part One)

You may have read about statistical models that claim to predict the outcome of the upcoming Presidential election. It’s easy to imagine that these models are complicated and contain many demographic, sociological, economic, and political factors. However, I was surprised to read in this article that two simple models supposedly generate accurate predictions.

Both of these models use stock market data. One model is based on the Dow Jones and the other on the S&P 500. Statistics are best when they are a hands-on experience, so while neither study included the data, I obtained both the stock...

How to Be a Ghost Hunter with a Statistical Mindset

I’m very much a data, empirical, science type of guy. So, it might be a surprise to learn that I’ve gone ghost hunting a half-dozen times over the past 3 years. Now, I’m not a paranormal enthusiast. I’m definitely a skeptic. However, in my view, being skeptical about something does not preclude collecting data about it. I also have friends I trust completely who are sure they’ve experienced paranormal activity. Plus, I don’t need much of an excuse to try something new and unusual!

Three of us skeptical ghost hunters have spent the night by ourselves in a variety of supposedly haunted prisons...

U.S. Job Growth: Assessing the Numbers and Making Predictions

On Friday, October 5, the Bureau of Labor Statistics (BLS) released the September Employment Situation Summary, and it was a doozy! Right before the Presidential election, the unemployment rate dramatically dropped to below 8% for the first time since January 2009. Because of the startling nature and contradictory information contained in the report, these job numbers were received with some skepticism.

Previously, I’ve blogged about changes in the quarterly GDP and how important it is to understand the larger context of the inherent variability of the changes, the imprecise nature of the...

Using Statistics to Analyze Words: Digging Deeper

In my last blog, I showed how it’s possible to statistically assess the structure of a message and determine its capacity to convey information. We saw how my own words fit the patterns that are present in communications that are optimized for conveying information. However, these were fairly rough assessments to illustrate the fundamentals of information theory. 

In this post, I’ll use more sophisticated analyses to more precisely determine whether my blog content fits the ideal distribution. Along the way, we’ll have some interesting discussions about the vagaries of dolphin, human,...

Using Statistics to Analyze Words: Detecting the Signature of Information

Science television shows are the main reason that we have cable TV in my house! We recently saw a show in which researchers recorded dolphin squeaks to determine whether their sounds are a real language. The researchers claimed that word usage in all human languages follows a specific distribution, and they were going to determine whether dolphin sounds follow the same distribution.

It turns out that they do.

This led the scientists to conclude that the dolphin's language has the capacity to efficiently convey information, just like human language. Upon further research, I found that some SETI...

Why Anecdotal Evidence is Unreliable: The case against raspberry ketones and Dr. Oz

My wife occasionally watches The Dr. Oz Show and I’ve gotten drawn into it. I’ve noticed how Mehmet Oz frequently recommends dietary supplements for many conditions. And, he mentions that the show has a research department that checks everything out.

Previously, I’ve written about how scientists have a hard enough time assessing the benefits and risks of vitamins. So, I wondered, how was it possible to draw conclusions about supplements? After all, many of them haven’t had clinical trials with humans.

The final straw was when Dr. Oz called raspberry ketones "the number 1 miracle in a bottle to...

The Veepstakes Revisited: Using the Solution Desirability Matrix to Understand Why Romney Chose Paul Ryan

In my blog post from about 3 weeks ago, I used the Solution Desirability Matrix in Quality Companion to simulate how Mitt Romney might choose his VP candidate. This past weekend, Mitt Romney ultimately chose Paul Ryan, the Wisconsin U.S. Representative. In this blog, I’ll take a look at how the previous analysis fared and see what we can learn from it.

At the time of my initial analysis, there were about 20 potential candidates and I included the top 12 in the matrix. The top 5 were:

  • Tim Pawlenty
  • Rob Portman
  • Pat Toomey
  • Paul Ryan
  • Bobby Jindal

Paul Ryan came in at #4, and I’m actually pretty happy with...