Using Marginal Plots, aka "Stuffed-Crust Charts"
In my last post, we took the red pill and dove deep into the unarguably fascinating and uncompromisingly compelling world of the matrix plot. I've stuffed this post with information about a topic of marginal interest...the marginal plot.
Margins are important. Back in my English composition days, I recall that margins were particularly prized for the inverse linear relationship they maintained with the number of words that one had to string together to complete an assignment. Mathematically, that relationship looks something like this:
Bigger margins = fewer words
In stark contrast to my concept of margins as information-free zones, the marginal plot actually utilizes the margins of a scatterplot to provide timely and important information about your data. Think of the marginal plot as the stuffed-crust pizza of the graph world. Only, instead of extra cheese, you get to bite into extra data. And instead of filling your stomach with carbs and cholesterol, you're filling your brain with data and knowledge. And instead of arriving late and cold because the delivery driver stopped off to canoodle with his girlfriend on his way to your house (even though he's just not sure if the relationship is really working out: she seems distant lately and he's not sure if it's the constant cologne of consumables about him, or the ever-present film of pizza grease on his car seats, on his clothes, in his ears?)
...anyway, unlike a cold, late pizza, marginal plots are always fresh and hot, because you bake them yourself, in Minitab Statistical Software.
I tossed some randomly-generated data around and came up with this half-baked example. Like the pepperonis on a hastily prepared pie, the points on this plot are mostly piled in the middle, with only a few slices venturing to the edges. In fact, some of those points might be outliers.
If only there were an easy, interesting, and integrated way to assess the data for outliers when we make a scatterplot.
Boxplots are a useful way look for outliers. You could make separate boxplots of each variable, like so:
It's fairly easy to relate the boxplot of C1 to the values plotted on the y-axis of the scatterplot. But it's a little harder to relate the boxplot of C2 to the scatterplot, because the y-axis on the boxplot corresponds to the x-axis on the scatterplot. You can transpose the scales on the boxplot to make the comparison a little easier. Just double-click one of the axes and select Transpose value and category scales:
That's a little better. The only thing that would be even better is if you could put each boxplot right up against the scatterplot...if you could stuff the crust of the scatterplot with boxplots, so to speak. Well, guess what? You can! Just choose Graph > Marginal Plot > With Boxplots, enter the variables and click OK:
Not only are the boxplots nestled right up next to the scatterplot, but they also share the same axes as the scatterplot. For example, the outlier (asterisk) on the boxplot of C2 corresponds to the point directly below it on the scatterplot. Looks like that point could be an outlier, so you might want to investigate further.
Marginal plots can also help alert you to other important complexities in your data. Here's another half-baked example. Unlike our pizza delivery guy's relationship with his girlfriend, it looks like the relationship between the fake response and the fake predictor represented in this scatterplot really is working out:
In fact, if you use Stat > Regression > Fitted Line Plot, the fitted line appears to fit the data nicely. And the regression analysis is highly significant:
Regression Analysis: Fake Response versus Fake Predictor The regression equation is Fake Response = 2.151 + 0.7723 Fake Predictor S = 2.12304 R-Sq = 50.3% R-Sq(adj) = 49.7% Analysis of Variance Source DF SS MS F P Regression 1 356.402 356.402 79.07 0.000 Error 78 351.568 4.507 Total 79 707.970
But wait. If you create a marginal plot instead, you can augment your exploration of these data with histograms and/or dotplots, as I have done below. Looks like there's trouble in paradise:
Like the poorly made pepperoni pizza, the points on our plot are distributed unevenly. There appear to be two clumps of points. The distribution of values for the fake predictor is bimodal: that is, it has two distinct peaks. The distribution of values for the response may also be bimodal.
Why is this important? Because the two clumps of toppings may suggest that you have more than one metaphorical cook in the metaphorical pizza kitchen. For example, it could be that Wendy, who is left handed, started placing the pepperonis carefully on the pie and then got called away, leaving Jimmy, who is right handed, to quickly and carelessly complete the covering of cured meats. In other words, it could be that the two clumps of points represent two very different populations.
When I tossed and stretched the data for this example, I took random samples from two different populations. I used 40 random observations from a normal distribution with a mean of 8 and a standard deviation of 1.5, and 40 random observations from a normal distribution with a mean of 13 and a standard deviation of 1.75. The two clumps of data are truly from two different populations. To illustrate, I separated the two populations into two different groups in this scatterplot:
This is a classic conundrum that can occur when you do a regression analysis. The regression line tries to pass through the center of the data. And because there are two clumps of data, the line tries to pass through the center of each clump. This looks like a relationship between the response and the predictor, but it's just an illusion. If you separate the clumps and analyze each population separately, you discover that there is no relationship at all:
Regression Analysis: Fake Response 1 versus Fake Predictor 1 The regression equation is Fake Response 1 = 9.067 - 0.1600 Fake Predictor 1 S = 1.64688 R-Sq = 1.5% R-Sq(adj) = 0.0% Analysis of Variance Source DF SS MS F P Regression 1 1.609 1.60881 0.59 0.446 Error 38 103.064 2.71221 Total 39 104.673
Regression Analysis: Fake Response 2 versus Fake Predictor 2 The regression equation is Fake Response 2 = 12.09 + 0.0532 Fake Predictor 2 S = 1.62074 R-Sq = 0.3% R-Sq(adj) = 0.0% Analysis of Variance Source DF SS MS F P Regression 1 0.291 0.29111 0.11 0.741 Error 38 99.818 2.62679 Total 39 100.109
If only our unfortunate pizza delivery technician could somehow use a marginal plot to help him assess the state of his own relationship. But alas, I don't think a marginal plot is going to help with that particular analysis. Where is that guy anyway? I'm getting hungry.