dcsimg
 

So Why Is It Called "Regression," Anyway?

Did you ever wonder why statistical analyses and concepts often have such weird, cryptic names?

One conspiracy theory points to the workings of a secret committee called the ICSSNN. The International Committee for Sadistic Statistical Nomenclature and Numerophobia was formed solely to befuddle and subjugate the masses. Its mission: To select the most awkward, obscure, and confusing name possible for each statistical concept.

A whistle-blower recently released the following transcript of a secretly recorded ICSSNN meeting:

"This statistical analysis seems pretty straightforward…"

“What does it do?”

“It describes the relationship between one or more 'input' variables and an 'output' variable. It gives you an equation to predict values for the 'output' variable, by plugging in values for the input variables."

“Oh dear. That sounds disturbingly transparent.”

“Yes. We need to fix that—call it something grey and nebulous. What do you think of 'regression'?”

“What’s 'regressive' about it? 

“Nothing at all. That’s the point!”

Re-gres-sion. It does sound intimidating. I’d be afraid to try that alone.”

“Are you sure it’s completely unrelated to anything?  Sounds a lot like 'digression.' Maybe it’s what happens when you add up umpteen sums of squares…you forget what you were talking about.”

“Maybe it makes you regress and relive your traumatic memories of high school math…until you  revert to a fetal position?”

“No, no. It’s not connected with anything concrete at all.”

“Then it’s perfect!”

 “I don’t know...it only has 3 syllables. I’d feel better if it were at least 7 syllables and hyphenated.”

“I agree. Phonetically, it’s too easy…people are even likely to pronounce it correctly. Could we add an uvular fricative, or an interdental retroflex followed by a sustained turbulent trill?”

The Real Story: How Regression Got Its Name

Conspiracy theories aside, the term “regression” in statistics was probably not a result of the workings of the ICSSNN. Instead, the term is usually attributed to Sir Francis Galton.

Galton was a 19th century English Victorian who wore many hats: explorer, inventor, meteorologist, anthropologist, and—most important for the field of statistics—an inveterate measurement nut. You might call him a statistician’s statistician. Galton just couldn’t stop measuring anything and everything around him.

During a meeting of the Royal Geographical Society, Galton devised a way to roughly quantify boredom: he counted the number of fidgets of the audience in relation to the number of breaths he took (he didn’t want to attract attention using a timepiece). Galton then converted the results on a time scale to obtain a mean rate of 1 fidget per minute per person. Decreases or increases in the rate could then be used to gauge audience interest levels. (That mean fidget rate was calculated in 1885. I’d guess the mean fidget rate is astronomically higher today—especially if glancing at an electronic device counts as a fidget.)

Galton also noted the importance of considering sampling bias in his fidget experiment:

“These observations should be confined to persons of middle age. Children are rarely still, while elderly philosophers will sometimes remain rigid for minutes.”

But I regress…

Galton was also keenly interested in heredity. In one experiment, he collected data on the heights of 205 sets of parents with adult children. To make male and female heights directly comparable, he rescaled the female heights, multiplying them by a factor 1.08. Then he calculated the average of the two parents' heights (which he called the “mid-parent height”) and divided them into groups based on the range of their heights. The results are shown below, replicated on a Minitab graph.

For each group of parents, Galton then measured the heights of their adult children and plotted their median heights on the same graph.

Galton fit a line to each set of heights, and added a reference line to show the average adult height (68.25 inches).

Like most statisticians, Galton was all about deviance. So he represented his results in terms of deviance from the average adult height.

Based on these results, Galton concluded that as heights of the parents deviated from the average height (that is as they became taller or shorter than the average adult), their children tended to be less extreme in height. That is, the heights of the children regressed to the average height of an adult.

He calculated the rate of regression as 2/3 of the deviance value. So if the average height of the two parents was, say, 3 inches taller than the average adult height, their children would tend to be (on average) approximately 2/3*3 = 2 inches taller than the average adult height.

Galton published his results in a paper called “Regression towards Mediocrity in Hereditary Stature.

So here’s the irony: The term regression, as Galton used it, didn't refer to the statistical procedure he used to determine the fit lines for the plotted data points. In fact, Galton didn’t even use the least-squares method that we now most commonly associate with the term “regression.” (The least-squares method had already been developed some 80 years previously by Gauss and Legendre, but wasn’t called “regression” yet.) In his study, Galton just "eyeballed" the data values to draw the fit line.

For Galton, “regression” referred only to the tendency of extreme data values to "revert" to the overall mean value. In a biological sense, this meant a tendency for offspring to revert to average size ("mediocrity") as their parentage became more extreme in size. In a statistical sense, it meant that, with repeated sampling, a variable that is measured to have an extreme value the first time tends to be closer to the mean when you measure it a second time. 

Later, as he and other statisticians built on the methodology to quantify correlation relationships and to fit lines to data values, the term “regression” become associated with the statistical analysis that we now call regression. But it was just by chance that Galton's original results using a fit line happened to show a regression of heights. If his study had showed increasing deviance of childrens' heights from the average compared to their parents, perhaps we'd be calling it "progression" instead.

So, you see, there’s nothing particularly “regressive” about a regression analysis.

And that makes the ICSSNN very happy.

Don't Regress....Progress

Never let intimidating terminology deter you from using a statistical analysis. The sign on the door is often much scarier than what's behind it. Regression is an intuitive, practical statistical tool with broad and powerful applications.

If you’ve never performed a regression analysis before, a good place to start is the Minitab Assistant. See Jim Frost’s post on using the Assistant to perform a multiple regression analysis. Jim has also compiled a helpful compendium of blog posts on regression.

And don’t forget Minitab Help. In Minitab, choose Help > Help. Then click Tutorials > Regression, or  Stat Menu >  Regression.

Sources

Bulmer, M. Francis Galton: Pioneer or Heredity and Biometry. Johns Hopkins University Press, 2003.

Davis, L. J. Obsession: A History. University of Chicago Press, 2008.

Galton, F. “Regression towards Mediocrity in Hereditary Stature.”  http://galton.org/essays/1880-1889/galton-1886-jaigi-regression-stature.pdf

Gillham, N. W. A  Life of Sir Francis Galton. Oxford University Press, 2001.

Gould, S. J. The Mismeasure of Man. W. W. Norton, 1996.

7 Deadly Statistical Sins Even the Experts Make

Do you know how to avoid them?

Get the facts >

Comments

blog comments powered by Disqus