If you're one of the gazillion people who have read The Hunger Games, then you’re quite familiar with “Real or Not Real?” And if you haven't read it, I'm guessing you've at least heard about this best-selling trilogy.
The Hunger Games movie, like the book, has been a huge success, grossing over $400 million domestically. I recently saw an ad for the DVD to be released on August 18 and it got me thinking. Could I use statistics to predict DVD sales? If my model below is as good as it looks, then the answer is "yes."
Or should I say, “real”?
Creating a Statistical Model: Where to Begin?
I rarely read books – statistics books excluded, of course – and I rarely go to the movies, yet I've read and seen all the Harry Potter and Twilight films, and now The Hunger Games. Presuming I'm not an outlier, I decided to focus specifically on movies that are based on books. I also limited my data collection to the last 10 years because I didn’t want DVD prices over time to skew my results (and used data to justify this decision). Plus, DVD movies weren’t available in the U.S. until 1996, and it took a while for DVD players to become a known quantity in the average household.
So we’re looking to predict initial DVD sales (defined as those in the first 4 weeks after release) for books-turned-movies released in the last decade. Once I defined the scope of my data collection, I identified the following potentially important factors:
- Movie domestic gross
- Movie rating (PG, PG-13, R)
- Tomato meter – % of rottentomatoes.com critics who gave movie a positive review
- Tomato audience - % of rottentomatoes.com users who gave movie 3.5 stars or higher
- Whether or not the movie is a sequel
- Number of theaters showing the movie at widest release
- Movie budget
- Number of book reviews on Amazon.com
- Percentage of 5-star ratings for book reviews on Amazon.com
In addition to using www.rottentomatoes.com and www.amazon.com, I used www.boxofficemojo.com to collect data on Movie Domestic Gross, Theaters and Movie Budget, and www.the-numbers.com to collect data on DVD sales.
Finding Significant Factors in the Model
A common approach for finding the best statistical model is to include all of the factors and interactions you think might be important, analyze the data, remove the term with the highest p-value, then repeat the process until you're left with only significant factors and interactions. This is called "reducing the model."
I used Minitab's Stat > ANOVA > General Linear Model (or you could use Stat > Regression > General Regression) to run the analysis, which included two-way interactions (e.g., Theaters*Budget) that I thought might be important. To reduce the model, I started by removing the insignificant two-way interactions followed by the main effects.
Here’s the model I ended up with. As you can see, all p-values are less than an alpha-level of 0.05 and thus are signficant. The Theaters p-value is not signficant, but I left it in the model since it's part of interactions that are.
Wow! R-squared is very high at 92.77%, which is great. However, I noticed that Minitab flagged an unusual observation with a standardized residual of -3.30. Wary of any such values below -3 or above +3, I took a closer look at this row of data: it's for Twilight: Breaking Dawn (Part 1).
The first Twilight DVD made over $124 million in the first 4 weeks, the second DVD brought in over $142 million, and the third made over $128, but this fourth DVD made only $81 million. Could it be that all of the Twihards are so over Bella and Edward by now that DVD sales have suffered? Or are they waiting for the last movie to come out so they can buy the extra-special super-duper deluxe edition that includes both Breaking Dawn films (and maybe even a novelty set of Edward Cullen fangs)?
I’ll go with the latter (the waiting part, not the fangs) since it seems unlikely that any Twihards who have stuck with it this long would bail out right before the grand finale. I decided to remove this observation from the analysis, leaving me with this final final model:
Look at that R-squared now!
Had I been using this model to predict the outcome for some vital company process, I would have started from scratch with the full model and reduced it all over again. But since we’re only talking about predicting the success of The Hunger Games DVD, I think I’ll save myself the time and stick with this model--especially since all p-values are still significant and the R-squared statistic is so high.
Using the Statistical Model to Predict DVD Sales
And now it’s time to make my prediction. I collected The Hunger Games data for all of the significant factors:
- Movie Domestic Gross = $406,079,037
- Tomato Meter = 85%
- Theaters = 4137
- Budget = $78 million
- Amazon.com Book Reviews = 8183
- Movie Rating = PG-13
Then I plugged these values into Minitab. Per the Fit value, The Hunger Games DVD sales should exceed $330 million in the first four weeks after its release! And I can be 95% confident that the actual value will fall somewhere between $295 and $370 million.
In summary, we started with a bunch of factors, whittled down the model to include the important ones, and then made our prediction. Now all we have to do is wait and see if, to coin a phrase, "the odds are ever in my favor."