Opening Ceremonies for Bubble Plots and Poisson Regression
Minitab Statistical Software includes a graphical analysis called the Bubble Plot.
This exploratory tool is great for visualizing the relationships among three variables on a single plot.
To see how it works, consider the total medal count by country from the recently completed 2014 Olympic Winter Games. Suppose I want to explore whether there might be a possible association between the number of medals a country won and its maximum elevation. For that, I could use a simple scatterplot, right?
But say I want to throw a third variable into the mix, such as GDP per capita. (After all, a new bobsled costs tens of thousands dollars--a bit more than the dented plastic saucer I used as a kid. Even a top-of-the-line curling stone can set you back over $10,000. So maybe, just maybe, the wealth of a country may relate to its total medal count.)
To show these three variables simultaneously on the same plot, I'll choose Graph > Bubbleplot > Simple. Then I'll indicate which variables will be displayed on the plot as X, Y, and the bubble size.
When I click OK, Minitab displays the bubble plot below:
Tip: Depending on your data, you might want to change the relative size of the bubbles and add jitter (offset the points) to increase the legibility of the plot. The bubbles above have been slightly reduced in size and jittered.
How Do I Interpret All That Suds?
Interpret the X and Y variables on the horizontal and vertical scales as you would a scatterplot. On this plot, you can see the bubbles rising as you move from left to right on the plot. So it appears that as the country's GDP per capita increases, the total medal count increases.
There's one data point (bubble) that seems to buck this trend. Using the brushing feature allows me to easily identify it in the worksheet--it's Russia. The top medal winner in the games, but with a relatively low GDP per capita.
To explore the relationship between maximum elevation in the country, which is the bubble-size variable, with the two other variables, look for a consistent change in the the size of the bubbles as you move along either the vertical (Y) or the horizontal (X) axis.
As you move from left to right, the size of the bubbles seems fairly random. This suggests there's no strong relationship between the GDP per capita of a country and its maximum elevation, which makes perfect sense.
But as you move up the vertical scale, it does look like the bubbles generally seem to get slightly larger. Most of the smaller bubbles seem to fall below the 10-medal mark. I've brushed the 3 bubbles that buck this trend--they're "outliers" by virtue of their different sizes compared to their neighbors.
The small bubble near the top right represents the Netherlands--a relatively flat country with a maximum elevation of only 887 meters--yet nevertheless a top medal winner. The two larger bubbles at the bottom left are China and Kazakhstan, whose tallest peaks, at 8850 m and 7010 m, respectively, are higher than those in any of the other medal-winning countries.
The bubble plot shows some interesting descriptive trends for this set of data. But suppose this data had been a random sample collected from a larger population. To analyze these relationships statistically, and determine whether they hold for the entire population, you’d need to perform a more rigorous analysis.
That is, are the associations between GDP per capita and medal count, and maximum elevation and medal count, statistically significant in a regression analysis?
Caveat: Because regression is an inferential analysis, it requires a random sample of data from a population. For the sake of illustration, we'll pretend the Sochi Olympic data is a representative sample of all the modern Winter Olympic games. We don't know that that's the case, obviously, so consider the results speculative, at best.
New in 17: Poisson Regression Analysis
Notice that the response variable in this case is total medal counts which, strictly speaking, is not continuous data. (You can't win 3.37 Olympic medals.) That means that standard linear regression, which is typically performed on a continuous response variable, is not the best tool to analyze these data.
Luckily, Release 17 of Minitab now includes Poisson regression analysis, which is specifically designed to evaluate the relationship between explanatory variables and a count response.
To evaluate the relationship between maximum elevation and GDP per capita with total medal count, choose Stat > Regression > Poisson Regression > Fit Poisson Model. Minitab produces the following output:
Poisson Regression Analysis: Medals versus Max Elevation (m), GDP/capita
Source DF Adj Dev Adj Mean Chi-Square P-Value
Regression 2 78.61 39.306 78.61 0.000
Max Elevation (m) 1 21.46 21.457 21.46 0.000
GDP/capita 1 60.45 60.446 60.45 0.000
Error 23 112.72 4.901
Total 25 191.33
R-Sq R-Sq(adj) AIC
41.09% 40.04% 219.32
It appears that both maximum elevation and GDP per capita are significant predictors of medal count, explaining about 40% of the variation.
Unfortunately, due to overdispersion, this Poisson model shows a significant lack of fit. To improve the model fit, you need to identify additional significant predictors and add them to the model. Which you can do by being a kid again--and chasing the trail of floating bubbles on the bubble plot.
Try It Yourself
If you'd like to experiment with the Bubble plot, Poisson regression, and other features in Minitab, download the free 30-day trial. After downloading the trial version, click here to get the data used in this post. Have fun!