# Opening Ceremonies for Bubble Plots and Poisson Regression

Minitab Statistical Software includes a graphical analysis called the Bubble Plot.

This exploratory tool is great for visualizing the relationships among three variables on a single plot.

To see how it works, consider the total medal count by country from the recently completed 2014 Olympic Winter Games. Suppose I want to explore whether there might be a possible association between the number of medals a country won and its maximum elevation. For that, I could use a simple scatterplot, right?

But say I want to throw a third variable into the mix, such as GDP per capita. (After all, a new bobsled costs tens of thousands dollars--a bit more than the dented plastic saucer I used as a kid. Even a top-of-the-line curling stone can set you back over \$10,000. So maybe, just maybe, the wealth of a country may relate to its total medal count.)

To show these three variables simultaneously on the same plot, I'll choose Graph > Bubbleplot > Simple. Then I'll indicate which variables will be displayed on the plot as X, Y, and the bubble size.

When I click OK, Minitab displays the bubble plot below:

Tip: Depending on your data, you might want to change the relative size of the bubbles and add jitter (offset the points) to increase the legibility of the plot. The bubbles above have been slightly reduced in size and jittered.

## How Do I Interpret All That Suds?

Interpret the X and Y variables on the horizontal and vertical scales as you would a scatterplot. On this plot, you can see the bubbles rising as you move from left to right on the plot. So it appears that as the country's GDP per capita increases, the total medal count increases.

There's one data point (bubble) that seems to buck this trend. Using the brushing feature allows me to easily identify it in the worksheet--it's Russia. The top medal winner in the games, but with a relatively low GDP per capita.

To explore the relationship between maximum elevation in the country, which is the bubble-size variable, with the two other variables, look for a consistent change in the the size of the bubbles as you move along either the vertical (Y) or the horizontal (X) axis.

As you move from left to right, the size of the bubbles seems fairly random. This suggests there's no strong relationship between the GDP per capita of a country and its maximum elevation, which makes perfect sense.

But as you move up the vertical scale, it does look like the bubbles generally seem to get slightly larger. Most of the smaller bubbles seem to fall below the 10-medal mark. I've brushed the 3 bubbles that buck this trend--they're "outliers" by virtue of their different sizes compared to their neighbors.

The small bubble near the top right represents the Netherlands--a relatively flat country with a maximum elevation of only 887 meters--yet nevertheless a top medal winner. The two larger bubbles at the bottom left are China and Kazakhstan, whose tallest peaks, at 8850 m and 7010 m, respectively, are higher than those in any of the other medal-winning countries.

The bubble plot shows some interesting descriptive trends for this set of data. But suppose this data had been a random sample collected from a larger population. To analyze these relationships statistically, and determine whether they hold for the entire population, you’d need to perform a more rigorous analysis.

That is, are the associations between GDP per capita and medal count, and maximum elevation and medal count, statistically significant in a regression analysis?

Caveat: Because regression is an inferential analysis, it requires a random sample of data from a population. For the sake of illustration, we'll pretend the Sochi Olympic data is a representative sample of all the modern Winter Olympic games. We don't know that that's the case, obviously, so consider the results speculative, at best.

## New in 17: Poisson Regression Analysis

Notice that the response variable in this case is total medal counts which, strictly speaking, is not continuous data. (You can't win 3.37 Olympic medals.) That means that standard linear regression, which is typically performed on a continuous response variable, is not the best tool to analyze these data.

Luckily, Release 17 of Minitab now includes Poisson regression analysis, which is specifically designed to evaluate the relationship between explanatory variables and a count response.

To evaluate the relationship between maximum elevation and GDP per capita with total medal count, choose Stat > Regression  > Poisson Regression > Fit Poisson Model.  Minitab produces the following output:

-----------------------------------------

Poisson Regression Analysis: Medals versus Max Elevation (m), GDP/capita

Deviance Table

Regression                     2     78.61           39.306           78.61         0.000
Max Elevation (m)      1     21.46           21.457            21.46        0.000
GDP/capita                 1     60.45          60.446            60.45         0.000
Error                              23   112.72           4.901
Total                               25   191.33

Model Summary

Deviance   Deviance
41.09%  40.04%      219.32

-------------------------------

It appears that both maximum elevation and GDP per capita are significant predictors of medal count, explaining about 40% of the variation.

Unfortunately, due to overdispersion, this Poisson model shows a significant lack of fit. To improve the model fit, you need to identify additional significant predictors and add them to the model. Which you can do by being a kid again--and chasing the trail of floating bubbles on the bubble plot.

## Try It Yourself

Name: Sudipto • Monday, March 3, 2014

I ran the analysis and found that the residual vs order plot shows a downward trend. How should I adjust the model?

Name: Patrick • Tuesday, March 4, 2014

Glad to hear you tried out the new bubble plot and the Poisson regression analysis using the sample data! And good for you for checking the assumptions of the regression analysis by displaying the residual plots.

Remember, as explained in the caveat of the post, the data set is not actually a random sample. Therefore, the residual vs order plot is showing an order effect--as it should. If you look in the worksheet, you'll see that the data is ordered by number of total medals won, from greatest to least. That's what's creating the order effect you see.

If you wanted to use the data to illustrate a Poisson model that did not violate the randomness assumption, simply reorder the values in the worksheet so they are not listed by the size of the medal count.

Here's an easy way to do that using the Sort command. With the worksheet active, choose Data > Sort. In Sort column(s) enter columns C1-C4. In By Column, enter C1 (Country). Near the bottom, select Store sorted data in Original column(s). Then click Ok.

Minitab sorts the columns by the alphabetical first letter of each country. This removes the order effect by total medals. Now re-run the Poisson regression analysis. The residuals vs order plot will no longer show the downward trend you originally saw.

Thanks for reading, commenting and trying this out!