How Olympic Judging Stacks Up, Part I

Minitab Blog Editor 10 August, 2012

You may have read my recent article applying statistical analysis to how judges did in evaluated performances at two previous Olympic events. If so, perhaps you found yourself wondering how other events stack up…

Anticipating a desire to see the “cleanest” and “dirtiest” judging performances, I pulled all of the data I could find on every event from the 2008 Beijing Olympics that is judged on a continuous scale. In this post, I will examine which events* showed judging bias, and quantify that bias to provide some comparison between sports.

* Data on the individual judging scores could not be found for the synchronized diving events, Men’s and Women’s Vault, Men’s and Women’s All-Around Gymnastics, or any Rhythmic Gymnastics events. I would be very interested if anyone has this data.

Before we dive in, remember that R-squared is the percentage of the Total Sum of Squares that is accounted for by the terms in the model. Or, put more loosely, it's the percentage of variation in the data explained by the model. I'll use R-squared to state what percentage of the variation in scores for an event is accounted for by judging bias by taking the Sum of Squares for any significant judging terms and dividing by the Total Sum of Squares. I’ll call that number “Judging Contribution”.

Below is a table of all events analyzed, with the following information:

  • Judging Contribution
  • P-value for the “Judge” term if it was significant
  • P-value for the “Judge*Athlete” interaction if it was significant, or NA if the data do not allow that term to be tested
  • R-Sq (adj) for the model

Event

Judge Contribution

Judge p-value

Judge*Athlete p-value

R-sq(adj)

Diving - Men's 10M Platform

0.0%

*

*

89.63%

Diving - Women's 10M Platform

0.0%

*

*

94.36%

Diving - Men's 3M Springboard

0.9%

0.000

*

86.03%

Diving - Women's 3M Springboard

2.4%

0.002

0.003

92.27%

Gym - Men's Horizontal Bar

0.0%

*

NA

76.33%

Gym - Women's Beam

0.0%

*

NA

91.68%

Gym - Men's Parallel Bars

0.0%

*

NA

87.28%

Gym - Women's Uneven Bars

4.1%

0.040

NA

85.38%

Gym - Men's Rings

13.4%

0.002

NA

74.01%

Gym - Women's Floor

2.2%

0.050

NA

91.56%

Gym - Men's Pommel

0.0%

*

NA

87.38%

Gym - Men's Floor

0.0%

*

NA

92.79%

Synch Swimming - Team TM

0.0%

*

NA

95.91%

Synch Swimming - Team AI

0.0%

*

NA

97.66%

Synch Swimming - Duet TM

0.0%

*

NA

94.60%

Synch Swimming - Duet AI

0.0%

*

NA

93.02%

Trampoline - Men's

0.0%

*

NA

80.64%

Trampoline - Women's

0.0%

*

NA

98.39%

Equestrian - Individual Dressage Free

0.0%

*

NA

79.68%

Equestrian - Individual Dressage GP

1.5%

0.000

NA

92.15%

Equestrian - Individual Dressage GP Special

0.0%

*

NA

85.59%

There’s plenty of information there to be consumed, but I’ll provide some high-level overviews of what I learned from these results…

  • For the most part, Olympic judging is pretty good considering it involves human beings subjectively evaluating often-unique performances without the benefit of multiple viewing angles or replay…just think of how quickly an Olympic dive happens or, conversely, how long a gymnastics performance might be and how difficult it is to accurately judge that performance. And yet most events have no evidence of judging bias and a very high R-Squared value, indicating there aren’t other factors missing.
  • It's troubling, however, that outside of Diving the interaction between Judge and Athlete could not be evaluated, and this term represents the most scandalous of judging biases – one in which particular judges favor (or disfavor) specific athletes.
  • Although only one event showed a significant judging bias, it should be noted that in dressage events the judges are placed in different positions around the ring and the judge effect is completely confounded with and mathematically indistinguishable from a “position” effect.
  • The Men’s Rings event in Gymnastics is worrisome, as not only did judging differences account for 13.2% of the variation in the data (the next highest event was 4.1%), but the overall model R-Squared value was also the lowest of all events at 74.1%.  I'll examine this in more detail in a future post.

So there you have it…6 out of the 21 events examined showed judging bias in the 2008 Olympics, and that bias accounted for as little as 0.9% or as much as 13.2% of the total variation in scores.  Although there were certainly plenty of events that appear well-judged, there is definitely room for improvement in the 2012 London Games!