by Brina July 27, 2019
Time for another dive into the academic literature on scoring in gymnastics! Today, I’d like to talk about “The judging of artistry components in female gymnastics: A cause for concern?” by Pajek, Kovač, Pajek, and Leskošek (2014).
This paper examines agreement among judges about the artistry deductions taken at the 2011 World Championships in Tokyo. This paper is of particular interest to me because I just conducted a similar analysis of the artistry deductions taken at the Rio Olympics.
However, unlike me, these authors had access to score sheets showing not only the total artistry deductions taken but also the specific deductions taken. This adds a lot to our conversation about how artistry gets scored.
The Data
This paper looks at artistry deductions for for all 194 gymnasts who went up on beam in qualifying in Tokyo. Per usual, there were scores from five FIG judges for each routine.
The artistry section of the code looked a little different last quad. Artistry deductions could be taken for Insufficient variation in rhythm, sureness of performance, inappropriate gestures, or insufficient artistry of presentation (which includes a lack of creative choreography).
For each of these categories, the authors record how much of a deduction each judge took. They also record whether the deduction’s value was “missing,” though I’m not totally sure what that means — I would have assumed a missing deduction is the same as no deduction. This information is all displayed in the table below.

Even before any statistical analysis, we can see that there is not consensus amongst the judges. For example, only Judge 4 seems to make use of the “inappropriate gesture” mimic.
Having access to this data also gave the authors an unexpected insight: judges are bad at math. “Several cases were found where the judges gave artistry deductions, but calculated the sum of separate deductions in a wrong way,” they report. “The final artistry deduction was different than the sum of separate components.” Not sure if that’s hilarious or downright terrifying.
The Analysis
Armed with this information, the authors set out to examine agreement amongst the judges. They begin with some simple summary statistics, finding that some judges are stricter about artistry than others. The harshest judge took a mean deduction of 0.39 points from each gymnast, while the laxest one took just 0.18. It’s also worth noting, as I’ve said before, that these numbers are quite low when compared to the total score that gymnasts receive in the open-ended code.
The authors then compare each judge’s ranking of the gymnasts on each component of artistry. To do so, they used a metric called Kendall’s W, which measures the level of agreement among a group of people ranking the same options. It ranges from 0 to 1, with 0 indicating no agreement and 1 indicating perfect agreement.
The table below has the results of this analysis. It seems that judges almost never agreed on which gymnasts had “artistry of presentation,” as evidenced by the low Kendall’s W value of 0.054. On the other hand, there was much more agreement about which gymnasts had sureness of performance.
Finally, the authors measure whether there is less consensus about artistry deductions than about other components of execution. They find that the correlation between different judges’ artistry deductions (0.60) is significantly lower than the correlation between different judges’ total other deductions (0.73), with a p-value under 0.001. This implies that the judges are in greater agreement about the ground truth of execution deductions.
In Review
I found this article to be a very valuable contribution to the literature about gymnastics scoring. Of course, more than anything else, I’m jealous that the authors got access to this data!
The title of the paper asks whether artistry judging is a cause for concern, and the authors are giving an unambiguous "yes." In particular, the lack of agreement amongst judges on each specific component of artistry suggests that not all judges are looking for the same things when considering the level of artistry displayed in a routine.
I would love to see this sort of analysis conducted again ahead of the next round of updates to the Code of Points, which will probably occur sometime this year. By looking at the specific deductions on which there is little consensus, we can understand which concepts are poorly defined in the code. The Code of Points should be written so that any two people, given sufficient training, should reach the same conclusion about the same routines. The relatively high levels of disagreement about, say, “sureness of presentation” should be considered failures in the Code of Points.
Check back soon for more reviews of academic papers — there’s a lot out there and we’ve just scratched the surface!
Tags: From the Academy, On Artistry
by Brina June 8, 2019
Yesterday, in Part I of my mini-series on artistry in gymnastics, I gave an overview of what artistry deductions looked like at the Rio Olympic Games. Today, I’d like to delve a little deeper and examine a different question: do the scores suggest that the Code of Points - and the judges who apply it - have a coherent understanding of artistry?
If we have a coherent understanding of artistry, then judges should be able to take artistry deductions consistently and objectively regardless of their subjective taste. There are plenty of masterpieces out there that I can recognized as great works of art even though I do not find them pleasing or moving - Picasso’s Les Demoiselles d'Avignon and Stravinsky’s Rite of Spring come to mind. In other words, they have high level of artistry even though they are not to my taste.
Gymnastics judges are not supposed to answer the question, “Is this art to my taste?” Instead, they are simply supposed to determine, “Is this art?” Now, I’m certainly not going to venture a definition of art myself— in other words, I’m not a college junior sitting in some art history seminar. But what I am going to do is examine these artistry deductions for the sort of consistency we would expect if all the judges were working off a similar implicit understanding of what is and is not art in gymnastics.
Judges Seem to Agree About Artistry Deductions
If different judges given the same routine wildly different artistry deductions, then they clearly do not have a common understanding of each deduction. If they do have that common understanding, then we’d expect to see them all take about the same number of points.
To measure this, I took the standard deviation of all five judges’ scores for each floor routine. (I’m only going to look at floor, because floor routines tend to be more varied and controversial.) You can see the distribution of the standard deviations in the figure below.
So most of the time, the average judge takes deductions that are about half a tenth off from the average deduction across all judges. In other words, the judges pretty much agree on the total amount of artistry deductions. (Of course, the average gymnast gets two tenths off in artistry deductions, so half a tenth is not nothing - but the disagreement between judges is ultimately not really impacting her score.)
Interestingly, the most controversial routine - the one with the largest spread between the judges’ artistry scores - was that of Catalina Elena Escobar Gomez of Colombia, who got injured during her routine in Rio and had to leave the floor. One judge took a full 0.4, while the rest took basically nothing, presumably out of pity. I admire that one asshole judge’s commitment to accurate scoring over feelings.
So if judges seem to have a coherent understanding of artistry, can we get any closer to figuring out what that understanding is?
Is Artistry Just Execution?
One strong possibility is that judges fall prey to the common tendency to blur artistry and execution. You see this sometimes in the discourse around certain gymnasts; for example, some people seem to think that America hasn’t had a truly artistic gymnast since Nastia Liukin.
Below, I’ve plotted the artistry deductions versus the other execution deductions on beam and floor. As you can see, the two are very closely linked. The correlation between artistry deductions and execution deductions on beam is 0.71. Roughly speaking, this means that 71% of the variation in beam artistry deductions can be explained by the variation in beam execution deductions, which is very high. On floor, the correlation is 0.50.
Now, it’s totally possible that artistry and execution are two totally separate concepts and, for whatever reason, gymnasts who are good at one tend to be good at the other. When you listen to the way Eythora Thorsdottir talks about gymnastics, it’s clear that her love of beauty in sport drives her both to perfect her extension and toe point and to treat us to an expressive performance. While some expressive gymnasts can’t quite figure out their landings and can’t stay on the beam, I could not think of anyone with fantastic performance quality but a lot of built-in form deductions.
But I don’t think the two are inextricably linked. While no highly artistic gymnast has poor execution, plenty of gymnasts with flawless execution have very little artistry. I’m not going to name names, but you all know the type, and there were plenty in Rio.
If that’s the case, we should expect to see a lot of points in the bottom right corner of these graphs - high artistry deductions, but low execution deductions. I don’t see a single point there at all.
All this leads me to conclude that the shared common understanding of artistry promoted by the Code of Points is “consistently lovely execution that contributes to a nice overall look of the routine.”
I don’t know exactly what are we scoring when we assign points to artistry in gymnastics. And I don’t know exactly what we should be scoring, either. But I think most gymnastics fans agree that there’s a gap between what we have and what we want.
by Brina June 7, 2019
On Wednesday night, I came across something wonderful: Matthew Rusk (@mrusskie93) had linked to the full results from the Rio Olympic Games. Among other things, this contains a breakdown of the e-score deductions for each gymnast into “execution” deductions and “artistry” deductions. And, after way more PDF scraping than I want to think about, I managed to get some delicious data out of it!
What Are Artistry Deductions?
The Code of Points specifies certain e-score deductions as “artistry deductions” on floor and balance beam, the two events where there is time and space for presentation and choreography.
On beam, one-tenth deductions can be taken for a lack of confidence, lack of personal style, Insufficient variation in rhythm & tempo, and lack of fluidity. On floor, there are even more such deductions. These include lack of expressiveness, gestures that don’t correspond to the music, failure to engage the audience, and inability to show a theme or character. There are further deductions for composition and musicality.
Now, you may be wondering, “What’s personal style?” “What’s fluidity?” On these topics, the Code is silent. Judges are expected to know it when they see it.
I’ve long thought that the division of points into a difficulty score and an execution score was a bit of a clever trick on the part of the FIG. By grouping together objective deductions, like “that split did not hit 180,” and subjective deductions, like “that routine lacked expressiveness,” the Code of Points manages to mask much of the subjectivity involved in judging. But in this remarkable pdf, we have everything clearly separated out.
Artistry Deductions Are Common, But Small
Let’s start out with some summary statistics. To be honest, before looking at these scores, I had no idea whether artistry deductions actually get taken - I’ve heard that it’s very common, and I’ve also heard it’s incredibly rare.
In the analysis below, I’m just going to use the scores from qualification here, since that lets us look at the largest number of gymnasts. I’m also going to use the mean deduction taken by all five e-panel judges - no dropping the highest and lowest.
Not a single gymnast escaped without at least one judge taking an artistry deduction. However, some got close: four judges found nothing to take from Nina Derwael’s floor routine, and three judges found nothing to take from Laurie Hernandez’s beam and Marine Brevet’s floor. In fact, a lot of gymternet favorites — floor routines from Eythora Thorsdottir, Rune Hermans, and Laurie Hernandez, as well as beam routines from Simone Biles, Sanne Wevers, and Pauline Schaefer - got an average deduction of less than a tenth across the five judges.
But that’s not so common. The average gymnast lost 0.22 points on floor and 0.21 points on beam due to artistry deductions.
You can see the spread in the figure above. Most gymnasts are getting under three tenths in artistry deductions, though some are hit a little harder.
But, for perspective, there wasn’t a single routine from Olympic qualifications for which the judges thought, “Wow, the lack of artistic interpretation here is as bad as putting a hand down on the beam.” The average routine got about 2.2 points in total e-score deductions on beam, and 2.0 points deducted on floor. That means that artistry deductions usually make up about 10% of the total deductions taken. It’s pretty low.
The total artistry deductions are low in large part because of the limited deductions available in the Code of Points. Very few single artistry deduction can be greater than one tenth. If I were a gymnast or a coach, I wouldn't find many incentives in the Code to prioritize artistry. A world in which artistry deductions are very common, but also very low, is possibly the least desirable from the perspective of promoting artistry in the sport. Under this system, we are in essence saying that the current level of artistry in the sport is quite poor, but we don't care enough to incentivize gymnasts to improve.
I’m sure you’re all as curious as I was about the least artistic routine. That dubious honor goes, perhaps unsurprisingly, to Hong Un Jong’s floor routine. We all know why she went to Rio. It wasn’t to express the pain of her nation through dance.
Some Countries Are More Artistic Than Others
Time to open up a can of worms. For each country with a full team, I calculated the average artistry deduction across all gymnasts. Remember, these plots show deductions, so lower is better.
That’s right, Team USA was the most artistic on balance beam, and China was the least. The Americans also had the third most artistic floor routines, while Russia was in eighth. It’s just the numbers guys, don’t shoot me. At least we can all agree that the Netherlands deserve to be up near the top.
I should point out how poorly Asian gymnasts are scored on artistry metrics. There's plenty of subjectivity involved, but I personally do not find that Japan has the worst performance quality on floor, and I have to wonder if some stereotypes about Asian personalities are at play.
However, there are also some surprising patterns here. We all know that classically trained European gymnasts with “nice lines” are too often thought of as intrinsically more artistic than gymnasts from a different mold. I've certainly noticed this pattern in listening to gymnastics commentary. However, while this notion is still very much present in the discourse, I’m not sure if it’s really being reflected in the scores. American routines certainly don’t follow that mold, and the Belgian routines are nothing if not atypical. Meanwhile, Team Russia is not being rewarded for their beautiful wrist flicks, arguably the closest thing we saw to classical style in Rio.
So does a team’s reputation play into their artistry scores? The answer is a definitive maybe.
There’s clearly more to be done with this data, but I’m going to save that for another day. Check back soon for more!