Why Do So Many U.S. Voters Go MIA on Election Day? (Part II)

vote imageSo what stops eligible U.S. voters from showing up at the polls on election day?  I’m not going to be able to tease out all the possible factors associated with variability in voter turnout in one blog post. 

But you can often be forgiven in statistics if you clearly state at the onset that your main objective is a preliminary exploration rather than a final, conclusive analysis. So let’s explore.

Does temperature affect voter turnout?

Weather is sometimes suggested as a possible influence on voter turnout. Can you imagine the impact on the election if Hurricane Sandy had hit just one week later than it did?

But does it take the storm of a century to influence turnout? Might even more subtle weather variations, such as temperature drops, be associated with changes in turnout? 

To find out, I collected data on the mean election day temperature in the largest city of each state and the state’s voter turnout rate (using VEP) for each U.S. presidential election from 1980 through 2008. Here’s a Minitab Statistical Software scatterplot with a regression line based on data from two U.S. states:

scatterplot no groups

Look at the trend…it seems that a lower mean temperature on election day might be associated with a higher turnout. That makes sense, right? When you get cold, you make an extra effort to vote because those polling stations are sooo warm and toasty inside. Sometimes they even have free coffee or hot chocolate!

I hope you’re jumping out of your chair and pulling your hair out now.

Not just because I implied that correlation equals causation, not just because the data don’t hug that fit line closely, but also because by lumping data for two states (NC and MN) together on one scatterplot, I did something even more misleading than your average political attack ad.

To see why, look what happens when I display the same data using a categorical variable to differentiate the data for each state (choose Graph > Scatterplot > With Regression and Groups).

scatterplot with groups

Hmm…now we’ve got the opposite trend shown by the same data! Lower temperature seems to be loosely associated with lower turnout. Which scatterplot would you vote for—with or without groups?

In data analysis, beware of influence from special interest groups

Why does overlooking a group effect make such a big difference in these trends?

First, the predictor. Having lived in both states, I can say that a 50-degree day in November is a much different input in North Carolina than it is in Minnesota. And it's likely to have a different effect: Minnesotans might get out their beach towels; North Carolinians might get out their down jackets.

Second, the average turnout response for each state over this period is very different: about 72% for Minnesota, about 51% for North Carolina. Conglomerating the data without using a grouping variable masks the real trends in the relationship between temperature and turnout rate.

In this case, it was clear the data were from two different states at the onset. But a hidden group effect can be much more insidious, because you might not even be aware of groups in your data.

For example, suppose your company tracks the defective rate of a product in response to different inputs. If the data is from different facilities with different conditions that uniquely affect the inputs, with different historical defective rates, you could fall into the same trap if you analyze all of the data without considering a grouping variable for facility.

One thing that should make you suspicious that hidden groups might be lurking beneath the surface is the presence of separate clusters of data in the scatterplot—as you can see in the first (top) scatterplot above.

How can I standardize data to account for a group effect?

So I have a group effect. Do I have to display 50 separate scatterplots, one for each state, to evaluate the voter turnout vs mean temperature data for the U.S.?

That sounds like a hassle. And it doesn't give me nearly enough data in each plot to confidently identify a trend. Any outlier (like the high value in the scatterplot for North Carolina) will have an unduly large influence on the regression line.

There's another option. To account for the differences between the two groups, I can use Minitab's Calc > Standardize tool to standardize the data. For each location, I choose the option to subtract the overall mean temperature on all of the election days from the temperature on each election day. That gives me a measure of how abnormally cold or warm it was on that day. Similarly, for the response, the overall mean VEP turnout rate on all of the election days is subtracted from the turnout rate (VEP)  for each election day to indicate whether the turnout was relatively low or high for that year.

After the data is standardized to account for differences between the states, here's what it looks like on a Minitab scatterplot:

scatterplot  of normalized temps

Now the trend for the combined data is consistent with the trend for each group. An increase in temperature from the mean seems to be associated with an increase in voter turnout. Notice the two separate clusters of data are gone.

It's still not much data to go on. But if I standardize the data for every state to account for the group effect, I can display the data for all 50 states on one scatterplot to examine the trend.

Unfortunately I don't have time to do that right now; it's election day and I need to check the outside temperature.

It won't affect whether I vote. But it might help me decide whether to grab a beach towel or a down coat before I head to the polls.

Note: If you'd like to follow along and create these graphs in MInitab, the data sets for this post are here.


Name: tamoghna • Tuesday, November 6, 2012

Hi Patrick, lot of useful ideas in this article.Would love to follow along with the data sets you used ( P1 + P2) . Can you please make them available for download?

Name: Patrick • Tuesday, November 6, 2012

Hi Tamoghna,
Happy to hear it's useful info--thanks for reading! I added a Note with a link to the Minitab project that contains the data sets and the graphs.

Just in case you also want instructions for creating the graphs:

Using the Original Data Worksheet:

Scatterplot 1: Graph > Scatterplot > With Regression. Enter % turnout for Y and Mean Temp for X.

Scatterplot 2: Graph > Scatterplot > With Regression and Groups Enter % turnout for Y and Mean Temp for X. In Categorical variables for grouping, enter State.

Using the Standardize Data Worksheet:
Choose Calc > Standardize. Check Subtract Mean. Input the column Mean Temp_1 (for North Carolina). Store results in separate column. Do the same thing for % Turnout. Then repeat for the Minnesota temperature and turnout columns.

Using the Combined Standardized Data Worksheet:
Scatterplot 3: Graph > Scatterplot > With Regression. Enter % turnout for Y and Mean Temp for X.

Name: tamoghna • Tuesday, November 6, 2012

Woohoo!! Thank you so much for the data set and detailed instruction Patrick. It's such an important concept to understand. Will be looking forward for your next blog post.

best regards

Name: tamoghna • Tuesday, November 6, 2012

Another question Patrick!

when we should use other available options to standardize; such as

subtract mean divided by SD
divide by SD
subtract first value and divided by second

Name: Patrick • Wednesday, November 7, 2012

An excellent question. I hope I can explain this clearly in a comment.

OPTION 1 Subtract mean and divide by SD (the default) gives you the classic standardization of a variable (it’s what’s typically thought of when you say “standardize” in statistics). It gives you the difference from the mean represented in number of standard deviations. For example, a standardized value of -2 means the value is 2 standard deviations below the mean. A value of 1.5 means the value is 1.5 standard deviations above the mean, and so on. It’s also called a Z-score or Z value and is used to compare differences in means using a Z test. That is, it allows you to directly compare two values from different normal distributions by representing both values in terms of a standard normal distribution with mean 0 and standard deviation 1. So basically it provides a way to standardize values so they can be compared directly. There are lots of other reasons it’s used as well. You can also easily calculate probabilities based on Z scores…which in turn allows you to calculate p-values. So this option standardizes a variable based on both differences in mean and variability.

OPTION 3 The option of dividing by the standard deviation gives you a way to standardize variability. For example, suppose I’m tracking variability of gold and silver prices annually over a 10-year period. Gold fluctuates more wildly than silver, so this option allows me to compare how much the price of each precious metal has fluctuated in relation to its average variation over the years. So this option helps me to identify “abnormal” fluctuations in variability for each group. In regression, raw residuals represent the variability of each data value from the fit line. Standardized residuals are residual values divided by the standard deviation to provide a better sense of its “unusualness” as an outlier.

OPTION 4 Subtract first value and divide by second is basically the same as the first (default) option. But it allows you to standardize by using a mean and standard deviation other than that calculated in your sample data. Maybe you want to use a historical mean and standard deviation value rather than the value in your sample data.

OPTION 5 Transforms and rescales the data based on the scale range that you specify. You might use this if you want to compare scores on different scales by rescaling them to use the same scale.

By the way, I could have used the first (default) option to standardize temperature (and turnout) in these examples. By doing so, I’d be saying that I think the temperature in each state needed to be standardized in terms of its variability as well, not just by its difference from the mean. That might make sense if one state’s November 6 temps were so much more variable than another’s that a given temperature change would be perceived differently by the people in each state. I preferred to use option 2 to keep the scatterplot scale more intuitive and because I didn’t think this variability was an important factor in how U.S. voters might respond to temperature changes on election day—although you could argue it might be for comparing states like Hawaii, which shows much less variability in temperature in November, so smaller fluctuations in its temperature might be perceived by its voters differently than in other states. But in the examples in this post, if you use option 1 instead of option 2 to standardize the temperature and turnout values for these 2 states and then graph the results on a scatterplot (it’s a good exercise—try it!) you’ll get pretty much same results—so in this instance it wasn’t important to standardize the variability as well.

Sorry this explanation is sooooo long. Hope it helps.

Name: tamoghna • Thursday, November 8, 2012

Oh wow!!

Thank you so much for the detailed explanation. I would love to paly around with all the available options and see how it is affecting the model.

Thank you so much Patrick!

Your passion is contagious.

blog comments powered by Disqus