Has Figure Skating Judging Improved? What Do the Numbers Say?

In my recent article on judging in the Olympics, I included an analysis of the controversial 2002 Pairs Figure Skating results. As a result of that scandal, the International Skating Union (ISU) changed the rules for judging competitions to eliminate judging inconsistencies and prevent future scandals.

In the new system, pairs are judged on Grade of Execution—which is scored differently and not discussed here—and Program Components. For Program Components there are five components that nine different judges score. Seven of those judges are randomly chosen for each skater (the judges don’t know if their scores will be selected or not) and after the high and low scores are removed, the remaining scores are averaged.  Additionally, the judges’ scores are reported to fans anonymously and sorted in random order for each skater.

Keeping the judge’s names anonymous and randomizing them for each skater makes detecting irregularities impossible (and therefore I don’t understand how this prevents further scandal), because we can’t compare any one judge’s scores for one skater with their scores for another. But all hope is not lost for analyzing the data, which look like this:


First I treat Judge as a nested term within Skater, meaning that although they may have the same labels in the worksheet we are only going to look at the variation among judges within each skater and not between skaters. Additionally, we will treat Judge as a random factor as opposed to a fixed factor—this means that we are interested in the variation among judges generally and not how high or low specific judges score.  Running a model this way, we get the following ANOVA table:


So all factors are very significant, and with an R-Sq(adj) of 98.19%, we have explained most of the variation in the data and have a good model of the differences in scores.  ]

So has judging improved since the scandal?  To compare, I’m going to look at what percentage of the overall Sum of Squares (SS) is accounted for by judging bias or inconsistency—this is similar to what we look at with R-Sq, but using only judging terms instead of all terms in the model:

2002 – 2.08% was due to judging bias or inconsistency

2010 – 6.72% was due to judging bias or inconsistency

So with new rules in place to “prevent” further judging inconsistencies, the most recent Olympics had over three times the judging bias of the 2002 Olympics…which had a known judging scandal!

What would you conclude?



blog comments powered by Disqus