Score for Score

Posts tagged "From the Academy"

Today’s article is a piece of history as much as a piece of academic research. In 1988, Ansorge and Sheer explored the extent to which gymnastics judges are nationally biased in a paper entitled "International Bias Detected in Judging Gymnastic Competition at the 1984 Olympic Games."

The 1980s were the height of corruption in gymnastics judging. The FIG turned a blind eye to numerous allegations of score fixing and shady backroom deals. But the authors of this study are gymnerds after my own heart: they weren’t convinced by the anecdotal evidence and they wanted to see the numbers. In their words, "In spite of the rhetoric, however, the existence of international bias in gymnastics remains unsubstantiated by scientific inquiry."

The Data

This study uses scores from compulsories and optionals during both the men’s and women’s team competitions in Los Angeles. Four judges for each event were selected randomly from a pool. For men, there was a pool of 26 judges including two from each country that had qualified a team to the Olympics. For women, there was a pool of 16 with one from the country of each qualifying team.

The authors take advantage of this judging arrangement. Specifically, they can look for cases where a judge is evaluating a gymnast from her own country, and compare that judge’s scores to the scores of the other three judges on that event.

The Questions

The authors are interested two separate questions:

  1. Are judges biased towards gymnasts from their own country? Specifically, when a judge evaluates gymnasts from her own country, is her score higher than the average score given by the other three judges?

  2. Are judges biased against gymnasts from countries that are in direct competition with their own country? For each team competing, the team ranked directly above and below that team can be seen as its as direct competitors. When a judge evaluates gymnasts directly competiting with her own country, is her score lower than the average score given by the other three judges?

The Results

To answer these questions, the authors count the number of cases where a judge’s score is above, below, or equal to the average of the other three judges’ score in a relevant scenario.

They then apply a sign test to these numbers. Intuitively, for the first question, this test estimates how likely we’d be to see the results observed in Los Angeles if judges are equally inclined to score a gymnast from their home countries higher or lower than the rest of the judges’ scores. Similarly, for the second question, it estimates how likely the real results are if judges really aren’t biased against their country’s direct competition. (It’s worth noting that the data doesn’t really satisfy the assumptions that this test requires — the scores for different routines aren’t independent — but that’s never stopped social scientists before.)

In both cases, there was strong evidence for bias. These results are in the table below.

Judges were statistically significantly more likely to overscore gymnasts from their own country, and significant more likely to underscore gymnasts from their country’s competitors. The evidence is right there.

The Implications

In my fantasies, this sort of research immediately drew the FIG’s attention and spurred the long and ongoing fight for fairer scoring forward. In reality, I have no idea if anyone took note. One of the authors contributed a long article on scoring in men’s gymnastics to a 1991 issue of Technique, the USGF’s official magazine, highlighting national bias among other issues — so at least someone in the gymnastics world must have cared.

This study is so compelling for its simplicity. The authors asked simple questions that can be directly answered by the data. They had access to numbers from a high-stakes international competition, lending extra weight to their findings. And they didn’t make outsized claims about the implications of their results. Of course, there are numerous sources of bias that we can’t examine using this simple method of comparing one judge to the others — including the sort of handshake deals in which gymnasts’ scores were determined before a single skill was even performed. But because national bias is an important issue in and of itself, none of that matters for the purposes of this study.

While this paper is an interesting piece of history, I have to wonder whether the same levels of national bias exist today. This has actually has been studied using slightly different methods (here), and the short answer is that bias still seems to exist. But if you want to hear my full thoughts on the study, you’ll have to wait for another edition of From the Academy!

Tags: From the Academy

Today, I’d like to feature an article that contains interesting academic research and interesting information about how the FIG works. This article, "Judging the Judges: A General Framework for Evaluating the Performance of International Gymnastics Judges" by Mercier and Heiniger (2018), is about the statistics behind the method that the FIG uses to evaluate judges’ performance.

An objective measure of which judges are right and which judges are wrong is enormously useful as a check against bias in scoring. When judging falls prey to bias — intentional or otherwise — the credibility of gymnastics as a sport suffers. I was vaguely aware that the FIG has some mechanism in place to address this important issue, but before reading this article, I knew very little about it.

The FIG’s Judge Evaluation Program (JEP) is designed to measure whether a judge’s execution score for a given routine is out of line with the scores from other judges. This is not a straightforward concept to measure; in particular, the authors find that "out of line" has to be defined differently for good scores and bad scores, and that it varies from event to event. Using e-score data from all the FIG competitions of last quad, the authors develop an appropriate metric and defend their methodological choices. Let’s dive in to the details.

Identifying the Correct Score

The biggest challenge with measuring bias in gymnastics is the absence of a correct score: we only know which scores are wrong if we know which score is right. These authors decide that the ground truth — what they call the "control score" — is the median score of all judges who viewed a given routine.

There are some notable issues with this definition, most of which the authors discuss. Most importantly, this makes it impossible to measure errors due to biases that all judges viewing a given routine share. For example, the control score will be too low if all judges expect the first routine in team finals to be the worst, and the control score will be too high if all judges expect a flawless routine based on the gymnast’s reputation. Nevertheless, at the end of the day, we need some definition of the control score. The median is better than some and as good as any.

Measuring Divergence from the Correct Score

With the control score for each routine set, the authors set out to measure each judge’s deviation from that control score. In particular, they want to compare each judge’s error — that is, the difference between his e-score and the control score — to a measure of average variability such as the standard deviation of all errors.

The authors find that we need to be very careful in how we define average variability. As shown in Figure 1 below, there’s a lot variability in the e-scores given to bad routines than in the e-scores given to good routines. In statistical terms, this means that the data is heteroskedastic.

This means that we want to estimate variability as a function of the control score. The authors do this by drawing a line of best fit over the relationship between the variance of judges’ errors and the control score, as shown in Figure 2 above. They do this separately for each event.

So based on the apparatus and the control score, the authors can calculate the "intrinsic judging error variability" for each routine. This number measures how much variability in judges’ e-scores we should expect based on the apparatus and the quality of the routine.

By dividing a judge’s error by the intrinsic judging error variability, we arrive at a measure how far off base the judge was. The authors call this the "marking score." If the marking score is zero, it means that the judge was perfectly aligned with the control score; values further from zero values indicate bigger issues with the judging. Using this method, a judge who gives a routine a 9.4 instead of a 9.9 is considered more out of line than a judge who gives a routine a 6.1 instead of 6.6 — because we expect more variability in scores for a bad routine than for a good routine.

Interestingly, these marking scores vary a lot by event, as shown in the plot below. The judging is least consistent on pommel horse and most consistent on women’s vault. Overall, it is much more consistent on women’s events than on men’s.

Is a Correct Ranking Good Enough?

The authors also spend some time trying to evaluate judges’ performance by measuring how well they rank the gymnasts. In particular, they use versions of Kendall’s tau, which is based on how many gymnasts you’d need to swap in one judge’s rankings to make it agree with the correct ranking. But, after ensuring that their metric abides by some intuitively compelling principles — for example, that swapping the gold and silver medalists is worse than swapping seventh place and eight place — the authors are not satisfied with their ranking metric. They find the results to be inconsistent with the marking scores described above, and they ultimately conclude that none of the ranking methods are terribly robust.

This was fine with the FIG officials, who were apparently "adamant that a theoretical judge who ranks all the gymnasts in the correct order but is either always too generous or too strict is not a good judge because he/she does not apply the Codes of Points properly." The fans might believe that a judge’s job is to rank gymnasts correctly, but the FIG does not.

The JEP in Practice

I still can’t find much information on how these the JEP is used in practice, though there are some brief references to the program in the Rules for Judges that the FIG puts out. As far as I can tell, each judge is evaluated at the end of a meet based on the average of his marking scores for all the routines he judged during the event. This emphasis on performance throughout the meet -- rather -- strongly suggests that the JEP is meant to root out judges who are systematically bad at their job, rather than specific routines that were scored questionably.

I could not find public data on the scores for individual judges from individual competitions. If this information were made available, external auditors (or bloggers!) could evaluate judges for bias for or against specific countries, specific individuals, specific routine constructions, and more.

This paper offers a fascinating window into the thought process behind the FIG's method for evaluating judges. Such windows are few and far between.

Tags: From the Academy

Time for another dive into the academic literature on scoring in gymnastics! Today, I’d like to talk about “The judging of artistry components in female gymnastics: A cause for concern?” by Pajek, Kovač, Pajek, and Leskošek (2014).

This paper examines agreement among judges about the artistry deductions taken at the 2011 World Championships in Tokyo. This paper is of particular interest to me because I just conducted a similar analysis of the artistry deductions taken at the Rio Olympics.

However, unlike me, these authors had access to score sheets showing not only the total artistry deductions taken but also the specific deductions taken. This adds a lot to our conversation about how artistry gets scored.

The Data

This paper looks at artistry deductions for for all 194 gymnasts who went up on beam in qualifying in Tokyo. Per usual, there were scores from five FIG judges for each routine.

The artistry section of the code looked a little different last quad. Artistry deductions could be taken for Insufficient variation in rhythm, sureness of performance, inappropriate gestures, or insufficient artistry of presentation (which includes a lack of creative choreography).

For each of these categories, the authors record how much of a deduction each judge took. They also record whether the deduction’s value was “missing,” though I’m not totally sure what that means — I would have assumed a missing deduction is the same as no deduction. This information is all displayed in the table below.

Even before any statistical analysis, we can see that there is not consensus amongst the judges. For example, only Judge 4 seems to make use of the “inappropriate gesture” mimic.

Having access to this data also gave the authors an unexpected insight: judges are bad at math. “Several cases were found where the judges gave artistry deductions, but calculated the sum of separate deductions in a wrong way,” they report. “The final artistry deduction was different than the sum of separate components.” Not sure if that’s hilarious or downright terrifying.

The Analysis

Armed with this information, the authors set out to examine agreement amongst the judges. They begin with some simple summary statistics, finding that some judges are stricter about artistry than others. The harshest judge took a mean deduction of 0.39 points from each gymnast, while the laxest one took just 0.18. It’s also worth noting, as I’ve said before, that these numbers are quite low when compared to the total score that gymnasts receive in the open-ended code.

The authors then compare each judge’s ranking of the gymnasts on each component of artistry. To do so, they used a metric called Kendall’s W, which measures the level of agreement among a group of people ranking the same options. It ranges from 0 to 1, with 0 indicating no agreement and 1 indicating perfect agreement.

The table below has the results of this analysis. It seems that judges almost never agreed on which gymnasts had “artistry of presentation,” as evidenced by the low Kendall’s W value of 0.054. On the other hand, there was much more agreement about which gymnasts had sureness of performance.

Finally, the authors measure whether there is less consensus about artistry deductions than about other components of execution. They find that the correlation between different judges’ artistry deductions (0.60) is significantly lower than the correlation between different judges’ total other deductions (0.73), with a p-value under 0.001. This implies that the judges are in greater agreement about the ground truth of execution deductions.

In Review

I found this article to be a very valuable contribution to the literature about gymnastics scoring. Of course, more than anything else, I’m jealous that the authors got access to this data!

The title of the paper asks whether artistry judging is a cause for concern, and the authors are giving an unambiguous "yes." In particular, the lack of agreement amongst judges on each specific component of artistry suggests that not all judges are looking for the same things when considering the level of artistry displayed in a routine.

I would love to see this sort of analysis conducted again ahead of the next round of updates to the Code of Points, which will probably occur sometime this year. By looking at the specific deductions on which there is little consensus, we can understand which concepts are poorly defined in the code. The Code of Points should be written so that any two people, given sufficient training, should reach the same conclusion about the same routines. The relatively high levels of disagreement about, say, “sureness of presentation” should be considered failures in the Code of Points.

Check back soon for more reviews of academic papers — there’s a lot out there and we’ve just scratched the surface!

Tags: From the Academy, On Artistry

It's time for another post in which we dive into academic research on scoring in gymnastics. Today, I want to talk about “Judging the Cross on Rings” by Plessner and Schallies (2005).

This study looks at judges’ ability to recognize “shape constancy,” which is cognitive psychology language for the ability to recognize the same form from different angles. Specifically, the authors are interested in the iron cross on rings, which is judged based on the angle between the gymnasts arms and his body. It’s easy enough take the right deductions when you’re facing the gymnast head on, but it's a lot harder from other angles, as illustrated in the figure below.

Of course, a gymnast’s score shouldn’t depend on where the judges are sitting, and we’d all like to think that the training process teaches judges to take the correct deduction no matter what angle they have. But do they really? That’s the question that this paper investigates: are judges better than normal people at identifying the correct angles in an iron cross when we switch up their view of the gymnast?

The Experiment

To answer this question, the authors showed pictures of gymnasts holding an iron cross to a set of gymnastics judges and a set of laypeople. Each iron cross was photographed three times — head on, from a 30 degree angle, and from a 60 degree angle — and participants were given one of the three pictures of each iron cross. Participants were asked to judge the size of the deduction - 1 to 15 degrees, 16 to 30, and 31 to 45, which correspond to the deductions in the MAG code of points.

Unfortunately, both groups’ accuracy declined as the angle of the shot got worse. While the gymnastics judges took the wrong deduction 38% of the time when showed a head on shot for one second, they were wrong 44% of the time when showed a shot form a 60 degree angle. And the size of the increase in their error rate - that is, 6 percentage points - was almost as large as that of the laypeople, who performed 7 percentage points worse on the 60 degree shots than on the head on shots. The table below shows the error rates for laypeople and judges, separated by the angle of the photograph and the length of time that the photograph was shown for.

I should note that judges’ error rate when looking at ideal head on photographs — 38% — seems pretty high to me! About four fifths of the judges who participated in this study had more than ten years of experience judging, and they still only got the deduction right 62% of the time? The more I read these studies, the more curious I am about the progress toward robot judges.

The literature also suggests, unsurprisingly, that our ability to achieve shape constancy worsens when we are paying less attention to the shape. The authors therefore consider the possibility that counting the duration of the iron cross - another key component of judging rings - might make it harder for judges to correctly identify angles. To test this, they flashed the pictures of the iron crosses for different lengths of time and asked participants to judge the duration as well as the angle. Thankfully, this hypothesis was not born out — the professional judges, experienced in multitasking, performed no worse when evaluating the duration alongside the angle.

In Review

The evidence is in: angle matters. I’m sure any judge would be happy to back this up: we hear judges talk about angles all the time in response to fans’ critiques of the accuracy of their scores.

The FIG has guidelines for where judges ought to sit in relation to the apparatus, though both the MAG and WAG codes stipulate that "variations in the seating arrangement are possible depending on the conditions available in the competition hall." (Thanks to @romacastellini for pointing me to this part of the code.) MAG E-score judges are supposed to sit at wildly different angles with wildly different views of the gymnast - check out the following illustration from the 2017-2020 code of points.

This study tells us that such an arrangement is likely to create a situation in which different judges take different deductions for the angle of a gymnast's arms on his iron cross.

Now, if every gymnast at a given meet is being judged by the same judges sitting in the same positions, then the ranking will likely ultimately still be correct. But I could see this being a big issue in situations where the judges' seating arrangements are not so standardized, and where scores need to be compared across different meets - which sounds a lot like NCAA gymnastics. Of course, this probably isn't high on the list of things to address in NCAA judging, but it's something to keep in mind.

Anyway, that's all for this trip into academia! Stay tuned for more!

Tags: From the Academy

Today, we’re going back into the world of academia to discuss another study on how gymnastics is scored. I’d like to talk about a 2012 article by Alexandra Pizzera entitled “Gymnastic Judges Benefit From Their Own Motor Experience.”

This study tests the hypothesis that judges evaluate skills more accurately when they can perform that skill themselves - in technical terms, when they have “specific motor experience” (SME) with that skill. Because judges have to get a lot of information from a very quick movement, those who have done that movement are better prepared to do so.

This question is actually quite relevant to the ongoing conversation in the gymnastics community about who should be allowed to take the brevet judging test. USAG’s requirements are remarkably strict, and usually only former national team members are invited to take the test. I had assumed that this was largely a way to keep the elite circle small, and maybe it that’s true — but it’s also worth asking whether being an elite gymnast truly does make you better qualified to judge elite gymnastics.

The Experiment

To test her hypothesis, the author asked 58 gymnastics judges to score the angle of 31 different split leaps on beam. 22 of the judges were former gymnasts who reported that they could perform the skill themselves; the rest could not.

The judges were asked to judge whether the gymnasts hit the two angles required to get credit for the leap, taking the role of a d-panel judge. They were also asked to mark the total of all the deductions that could be taken for body shape, etc., taking the role of an e-panel judge. Each of the split leaps had been scored by an expert judge in slow motion to create a “correct” reference score with which to compare all 58 participants’ scores.

The results are shown in the figure above. The left-hand side of the graph, shaded in grey, shows that judges who could perform the skill themselves correctly identified whether a skill should or shouldn’t be credited much more often. They were correct on their judgements of Angle 1 77% of the time, while judges who couldn’t perform the skill were correct just 67% of the time. Similarly, the judges who could do the skill were right about Angle 2 87% of the time, compared to 82% for other judges. These differences are statistically significant at α = 0.05.

The same pattern holds for the e-scores. The author measured the difference between each judge’s total deductions on a skill and the reference judge’s total deductions on each skill as a percent of the total reference deductions. Judges who could do the skill were closer to the reference score: they deviated by an average of 80%, while the other judges deviated by an average of 97%.

All told, this evidence is pretty strong support for the original hypothesis: judges are more accurate when they’ve performed the skill themselves.

In Review

The findings in this study make a lot sense to me: former gymnasts understand the mechanics of a skill a little better than those of us who have only ever been observers. Of course, because there’s no way to randomize which judges can perform a split leap, these results aren’t causal. Maybe all that matters is that you spend 4 hours a day in a gym before the age of 12, even if you’re just sitting there - we can’t know for sure. But I’m basically convinced that gymnastics experience really helps a judge out.

This study also provides some evidence to back up policies that allow former elites to skip the early levels of judging on the path to becoming a brevet judge. While such policies already exist in at least the US and the UK, and other countries should consider implementing them as well.

I do have to wonder if the results are generalizable - this study doesn’t just test a single event, it tests one skill on one event. Things might look a little different when we take into account the full range of gymnastics that judges are asked to score.

Finally, I want to emphasize that these results in no way suggest that only former gymnasts should be allowed to judge the sport. There were certainly some non-gymnasts who judged more accurately than some former gymnasts in this experiment, and either way the last thing we want to do is shrink the pool of potential judges. But hopefully these results will encourage more former gymnasts to stay involved in the sport by becoming a judge. Or, just maybe, this paper will encourage some judges to try out an adult gymnastics class!

Check back next week for another review!

Tags: From the Academy