So what stops eligible U.S. voters from showing up at the polls on election day?  I’m not going to be able to tease out all the possible factors associated with variability in voter turnout in one blog post.

But you can often be forgiven in statistics if you clearly state at the onset that your main objective is a preliminary exploration rather than a final, conclusive analysis. So let’s explore.

## Does temperature affect voter turnout?

Weather is sometimes suggested as a possible influence on voter turnout. Can you imagine the impact on the election if Hurricane Sandy had hit just one week later than it did?

But does it take the storm of a century to influence turnout? Might even more subtle weather variations, such as temperature drops, be associated with changes in turnout?

To find out, I collected data on the mean election day temperature in the largest city of each state and the state’s voter turnout rate (using VEP) for each U.S. presidential election from 1980 through 2008. Here’s a Minitab Statistical Software scatterplot with a regression line based on data from two U.S. states:

Look at the trend…it seems that a lower mean temperature on election day might be associated with a higher turnout. That makes sense, right? When you get cold, you make an extra effort to vote because those polling stations are sooo warm and toasty inside. Sometimes they even have free coffee or hot chocolate!

I hope you’re jumping out of your chair and pulling your hair out now.

Not just because I implied that correlation equals causation, not just because the data don’t hug that fit line closely, but also because by lumping data for two states (NC and MN) together on one scatterplot, I did something even more misleading than your average political attack ad.

To see why, look what happens when I display the same data using a categorical variable to differentiate the data for each state (choose Graph > Scatterplot > With Regression and Groups).

Hmm…now we’ve got the opposite trend shown by the same data! Lower temperature seems to be loosely associated with lower turnout. Which scatterplot would you vote for—with or without groups?

## In data analysis, beware of influence from special interest groups

Why does overlooking a group effect make such a big difference in these trends?

First, the predictor. Having lived in both states, I can say that a 50-degree day in November is a much different input in North Carolina than it is in Minnesota. And it's likely to have a different effect: Minnesotans might get out their beach towels; North Carolinians might get out their down jackets.

Second, the average turnout response for each state over this period is very different: about 72% for Minnesota, about 51% for North Carolina. Conglomerating the data without using a grouping variable masks the real trends in the relationship between temperature and turnout rate.

In this case, it was clear the data were from two different states at the onset. But a hidden group effect can be much more insidious, because you might not even be aware of groups in your data.

For example, suppose your company tracks the defective rate of a product in response to different inputs. If the data is from different facilities with different conditions that uniquely affect the inputs, with different historical defective rates, you could fall into the same trap if you analyze all of the data without considering a grouping variable for facility.

One thing that should make you suspicious that hidden groups might be lurking beneath the surface is the presence of separate clusters of data in the scatterplot—as you can see in the first (top) scatterplot above.

## How can I standardize data to account for a group effect?

So I have a group effect. Do I have to display 50 separate scatterplots, one for each state, to evaluate the voter turnout vs mean temperature data for the U.S.?

That sounds like a hassle. And it doesn't give me nearly enough data in each plot to confidently identify a trend. Any outlier (like the high value in the scatterplot for North Carolina) will have an unduly large influence on the regression line.

There's another option. To account for the differences between the two groups, I can use Minitab's Calc > Standardize tool to standardize the data. For each location, I choose the option to subtract the overall mean temperature on all of the election days from the temperature on each election day. That gives me a measure of how abnormally cold or warm it was on that day. Similarly, for the response, the overall mean VEP turnout rate on all of the election days is subtracted from the turnout rate (VEP)  for each election day to indicate whether the turnout was relatively low or high for that year.

After the data is standardized to account for differences between the states, here's what it looks like on a Minitab scatterplot:

Now the trend for the combined data is consistent with the trend for each group. An increase in temperature from the mean seems to be associated with an increase in voter turnout. Notice the two separate clusters of data are gone.

It's still not much data to go on. But if I standardize the data for every state to account for the group effect, I can display the data for all 50 states on one scatterplot to examine the trend.

Unfortunately I don't have time to do that right now; it's election day and I need to check the outside temperature.

It won't affect whether I vote. But it might help me decide whether to grab a beach towel or a down coat before I head to the polls.

Note: If you'd like to follow along and create these graphs in MInitab, the data sets for this post are here.