dcsimg
 

5 Critical Tools for Your Lean Deployment

As your customers demand more and organizations fight to stay competitive, the battle to create more value with fewer resources is ever present. How can you leverage Lean tools to thrive in your business?

Save Your Seat

How Olympic Judging Stacks Up, Part I

You may have read my recent article applying statistical analysis to how judges did in evaluated performances at two previous Olympic events. If so, perhaps you found yourself wondering how other events stack up…

Anticipating a desire to see the “cleanest” and “dirtiest” judging performances, I pulled all of the data I could find on every event from the 2008 Beijing Olympics that is judged on a continuous scale. In this post, I will examine which events* showed judging bias, and quantify that bias to provide some comparison between sports.

* Data on the individual judging scores could not be found for the synchronized diving events, Men’s and Women’s Vault, Men’s and Women’s All-Around Gymnastics, or any Rhythmic Gymnastics events. I would be very interested if anyone has this data.

Before we dive in, remember that R-squared is the percentage of the Total Sum of Squares that is accounted for by the terms in the model. Or, put more loosely, it's the percentage of variation in the data explained by the model. I'll use R-squared to state what percentage of the variation in scores for an event is accounted for by judging bias by taking the Sum of Squares for any significant judging terms and dividing by the Total Sum of Squares. I’ll call that number “Judging Contribution”.

Below is a table of all events analyzed, with the following information:

  • Judging Contribution
  • P-value for the “Judge” term if it was significant
  • P-value for the “Judge*Athlete” interaction if it was significant, or NA if the data do not allow that term to be tested
  • R-Sq (adj) for the model

Event

Judge Contribution

Judge p-value

Judge*Athlete p-value

R-sq(adj)

Diving - Men's 10M Platform

0.0%

*

*

89.63%

Diving - Women's 10M Platform

0.0%

*

*

94.36%

Diving - Men's 3M Springboard

0.9%

0.000

*

86.03%

Diving - Women's 3M Springboard

2.4%

0.002

0.003

92.27%

Gym - Men's Horizontal Bar

0.0%

*

NA

76.33%

Gym - Women's Beam

0.0%

*

NA

91.68%

Gym - Men's Parallel Bars

0.0%

*

NA

87.28%

Gym - Women's Uneven Bars

4.1%

0.040

NA

85.38%

Gym - Men's Rings

13.4%

0.002

NA

74.01%

Gym - Women's Floor

2.2%

0.050

NA

91.56%

Gym - Men's Pommel

0.0%

*

NA

87.38%

Gym - Men's Floor

0.0%

*

NA

92.79%

Synch Swimming - Team TM

0.0%

*

NA

95.91%

Synch Swimming - Team AI

0.0%

*

NA

97.66%

Synch Swimming - Duet TM

0.0%

*

NA

94.60%

Synch Swimming - Duet AI

0.0%

*

NA

93.02%

Trampoline - Men's

0.0%

*

NA

80.64%

Trampoline - Women's

0.0%

*

NA

98.39%

Equestrian - Individual Dressage Free

0.0%

*

NA

79.68%

Equestrian - Individual Dressage GP

1.5%

0.000

NA

92.15%

Equestrian - Individual Dressage GP Special

0.0%

*

NA

85.59%

There’s plenty of information there to be consumed, but I’ll provide some high-level overviews of what I learned from these results…

  • For the most part, Olympic judging is pretty good considering it involves human beings subjectively evaluating often-unique performances without the benefit of multiple viewing angles or replay…just think of how quickly an Olympic dive happens or, conversely, how long a gymnastics performance might be and how difficult it is to accurately judge that performance. And yet most events have no evidence of judging bias and a very high R-Squared value, indicating there aren’t other factors missing.

  • It's troubling, however, that outside of Diving the interaction between Judge and Athlete could not be evaluated, and this term represents the most scandalous of judging biases – one in which particular judges favor (or disfavor) specific athletes.

  • Although only one event showed a significant judging bias, it should be noted that in dressage events the judges are placed in different positions around the ring and the judge effect is completely confounded with and mathematically indistinguishable from a “position” effect.

  • The Men’s Rings event in Gymnastics is worrisome, as not only did judging differences account for 13.2% of the variation in the data (the next highest event was 4.1%), but the overall model R-Squared value was also the lowest of all events at 74.1%.  I'll examine this in more detail in a future post.

So there you have it…6 out of the 21 events examined showed judging bias in the 2008 Olympics, and that bias accounted for as little as 0.9% or as much as 13.2% of the total variation in scores.  Although there were certainly plenty of events that appear well-judged, there is definitely room for improvement in the 2012 London Games!

Comments

Name: Cliff • Sunday, November 4, 2012

I have a student who wants to a statistical project on gymnastic scores. Where did you get the data on the individual judges. (I think it would be really good for her to work on the recent London Olympics data).
I'll take a look at your other article as well see whether I can see a good method appropriate for my student.


Name: Joel • Wednesday, November 7, 2012

Cliff - thanks for the interest! Unfortunately in London they did not show individual judges scores for gymnastics or pretty much any other event. I find it disturbing that they don't want people to see the actual scores, perhaps because analysis like this can expose irregularities.

If you'd like my data from past Olympics, send me an email at jsmith at minitab dot com and I'll be happy to pass it along. For the 2008 games NBC's Olympic website also has individual scores for most events and was a great resource.

- Joel


Name: Rodney Cox • Monday, July 22, 2013

I found your articles extremely interesting. I am a judo kata judge responsible for training other judges in my area. Kata is judged much like gymnastics, except it is almost completely technical merit between a pair judged against a theoretically perfect standard. Between 15 and 21 techniques are judged by 5 judges in each kata, each kata being considered a separate contest. 40 years ago I could have handled the statistics, but I am out of practice. Could you suggest an approach to assist me in evaluation judge's competence, given that they can only be compared with other judges of varying competence. Thank you in advance.


Name: Joel • Tuesday, July 30, 2013

Rodney-

That would be really interesting to look at! Let's communicate by email...contact me at jsmith at Minitab's domain. Do you happen to have raw judging scores from a kata?

- Joel


blog comments powered by Disqus