Statistical Model Predicts The Hunger Games DVD Sales. Real or Not Real?

If you're one of the gazillion people who have read The Hunger Games, then you’re quite familiar with “Real or Not Real?” And if you haven't read it, I'm guessing you've at least heard about this best-selling trilogy.

The Hunger Games movie, like the book, has been a huge success, grossing over $400 million domestically. I recently saw an ad for the DVD to be released on August 18 and it got me thinking. Could I use statistics to predict DVD sales? If my model below is as good as it looks, then the answer is "yes."

Or should I say, “real”?

Creating a Statistical Model: Where to Begin?

I rarely read books – statistics books excluded, of course – and I rarely go to the movies, yet I've read and seen all the Harry Potter and Twilight films, and now The Hunger Games. Presuming I'm not an outlier, I decided to focus specifically on movies that are based on books. I also limited my data collection to the last 10 years because I didn’t want DVD prices over time to skew my results (and used data to justify this decision). Plus, DVD movies weren’t available in the U.S. until 1996, and it took a while for DVD players to become a known quantity in the average household.

So we’re looking to predict initial DVD sales (defined as those in the first 4 weeks after release) for books-turned-movies released in the last decade. Once I defined the scope of my data collection, I identified the following potentially important factors:

  • Movie domestic gross
  • Movie rating (PG, PG-13, R)
  • Tomato meter – % of rottentomatoes.com critics who gave movie a positive review
  • Tomato audience - % of rottentomatoes.com users who gave movie 3.5 stars or higher
  • Whether or not the movie is a sequel
  • Number of theaters showing the movie at widest release
  • Movie budget
  • Number of book reviews on Amazon.com
  • Percentage of 5-star ratings for book reviews on Amazon.com

In addition to using www.rottentomatoes.com and www.amazon.com, I used www.boxofficemojo.com to collect data on Movie Domestic Gross, Theaters and Movie Budget, and www.the-numbers.com to collect data on DVD sales.

Finding Significant Factors in the Model

A common approach for finding the best statistical model is to include all of the factors and interactions you think might be important, analyze the data, remove the term with the highest p-value, then repeat the process until you're left with only significant factors and interactions. This is called "reducing the model."

I used Minitab's Stat > ANOVA > General Linear Model (or you could use Stat > Regression > General Regression) to run the analysis, which included two-way interactions (e.g., Theaters*Budget) that I thought might be important. To reduce the model, I started by removing the insignificant two-way interactions followed by the main effects.

Here’s the model I ended up with. As you can see, all p-values are less than an alpha-level of 0.05 and thus are signficant. The Theaters p-value is not signficant, but I left it in the model since it's part of interactions that are.

Wow! R-squared is very high at 92.77%, which is great. However, I noticed that Minitab flagged an unusual observation with a standardized residual of -3.30. Wary of any such values below -3 or above +3, I took a closer look at this row of data: it's for Twilight: Breaking Dawn (Part 1).

The first Twilight DVD made over $124 million in the first 4 weeks, the second DVD brought in over $142 million, and the third made over $128, but this fourth DVD made only $81 million. Could it be that all of the Twihards are so over Bella and Edward by now that DVD sales have suffered? Or are they waiting for the last movie to come out so they can buy the extra-special super-duper deluxe edition that includes both Breaking Dawn films (and maybe even a novelty set of Edward Cullen fangs)?

I’ll go with the latter (the waiting part, not the fangs) since it seems unlikely that any Twihards who have stuck with it this long would bail out right before the grand finale. I decided to remove this observation from the analysis, leaving me with this final final model:

Look at that R-squared now! 

Had I been using this model to predict the outcome for some vital company process, I would have started from scratch with the full model and reduced it all over again. But since we’re only talking about predicting the success of The Hunger Games DVD, I think I’ll save myself the time and stick with this model--especially since all p-values are still significant and the R-squared statistic is so high.

Using the Statistical Model to Predict DVD Sales

And now it’s time to make my prediction. I collected The Hunger Games data for all of the significant factors:

  • Movie Domestic Gross = $406,079,037
  • Tomato Meter = 85%
  • Theaters = 4137
  • Budget = $78 million
  • Amazon.com Book Reviews = 8183
  • Movie Rating = PG-13

Then I plugged these values into Minitab. Per the Fit value, The Hunger Games DVD sales should exceed $330 million in the first four weeks after its release!  And I can be 95% confident that the actual value will fall somewhere between $295 and $370 million.

In summary, we started with a bunch of factors, whittled down the model to include the important ones, and then made our prediction. Now all we have to do is wait and see if, to coin a phrase, "the odds are ever in my favor."


7 Deadly Statistical Sins Even the Experts Make

Do you know how to avoid them?

Get the facts >


Name: James • Wednesday, August 15, 2012

You "whittled down" the model, not "widdled down."

Name: Michelle Paret • Thursday, August 16, 2012

James, thank you for catching this and for reading so closely! It's now fixed.

Name: Dave • Friday, October 19, 2012

Came across this blog posting today. This was a great article about how to build predictive models. But I didn't see a follow up now that it is past the first 4 weeks of DVD sales.

I went to your resource site of the-numbers.com to get the DVD sales and saw tha the first four week had sales of $97 million. This is short of your prediction of $330 million with a 95% CI of $295 to $370 million.

This may be a good example of listening to your outliers. You took out Twilight Breaking Dawn Part 1 because you had a theory that people were waiting for Part 2 before buying this one. However, this outlier may have helped better predict the Hunger Games DVD release.

In looking at the-numbers.com, they have the following note, "Sales tracking currently includes only the DVD versions of the movie. Blu-ray sales are not included at this time." I wonder if DVD sales fell way short because more people are buying Blu-Ray or digital copies. I would be curious to find out if Blu-Ray sales numbers are negatively correlated with DVD sales. Perhaps including a factor such as the average number of Blu-Ray players per household at time of release may account for the Twilight (and now Hunger Games) outliers.

However, great job in trying to predict the Hunger Games DVD sales. It's always fun to go back and see if your predictions were right or wrong.

Name: Michelle • Monday, October 22, 2012

Dave, thanks for reading. Unfortunately, the predicted DVD sales for The Hunger Games were well above the actual sales. So the model was not as good as I initially thought.

I like your idea about including Blu-Ray players per household in the model. And as you point out, this may also be a good example of listening to the outliers. Only two Harry Potter movies and the Twilight movies made more in DVD sales in the first 4 weeks. So The Hunger Games DVD did sell very well (not including Blu-Ray sales), just not as well as either the prediction or some Harry Potters and the Twilight DVDs.

I also think it's fun to go back and see if the predictions were right or wrong - it's just always more amusing in the case that they were!

Name: Lori • Thursday, December 13, 2012

I enjoy your blog and have learned from you. You make it interesting and fun. I am not familiar with mini-tab so I was wondering how you executed the last step of your analysis in mini-tab. Applying the predictive model to the known factors of budget,gross receipts, star rating,etc in the mini-tab program.

Name: Michelle • Friday, December 14, 2012

Lori, I'm glad you enjoyed the post. After coming up with the final model, I went to Stat > ANOVA > GLM > Options and entered the Hunger Games values under 'Prediction intervals for new observations' to get the predicted value. (Or, you could use Stat > Regression > General Regression > Prediction.)

If you have Minitab questions in the future and want immediate assistance, please don't hesitate to take advantage of our free tech support. Visit minitab.com/contacts for contact info.

blog comments powered by Disqus