Score for Score

Posts tagged "Fun with Score Data"

Gymnastics fans love to debate the strategic merit of adding upgrades to a routine. How fierce is the competition? How consistent is the skill? When is it worth the risk? And the most tantalizing scenario is one in which a gymnast who played it safe during qualifications decides to go all-out for the finals, risking a fall for the extra few tenths in difficulty that could end up making all the difference. How common is this? I turned to the data to find out.

There’s an important reason to talk about this now: Baku. The 2020 Baku World Cup ended in a state of limbo: between qualifications and finals, the Azerbaijani government banned all sporting events in response to COVID-19, so finals never took place. A few weeks later, the FIG announced that scores from the qualification round of the 2020 Baku World Cup would be treated as the final results. The prize money, the medals, the glory, and — most importantly — the Olympic qualification points have been awarded on the basis of qualification scores. Gymnasts like Vanessa Ferrari and Emma Spence are sort of screwed by this decision.

A lot of questions can’t be answered fairly during a global pandemic, and “What to do about the aborted Baku World Cup?” is probably one of them. But the decision to treat qualification results as final struck many as particularly unfair because of gymnasts’ tendency to upgrade for finals. Gymnasts picked their qualification routines under the assumption that those routines would not determine their final placement, so the qualifications results don’t reflect a fair ranking of each gymnast putting her best foot forward. Given the chance to upgrade for finals, we might have a different set of athletes going to Tokyo 2021.

This context lends extra weight to the question: how often do gymnasts really upgrade for finals? I pulled data from the last five event world cups, and compared the difficulty scores that gymnasts got during qualifications and finals to determine which routines were upgraded. Of course, d-scores don’t perfectly measure intentional, strategic changes in routine composition: sometimes a d-score goes up because a gymnast got her double turn around a little further, or got her hips a little closer to a laid-out position. In other words, we don’t know if a higher d-score represents strategy or better execution. But it’s the best information we have.

Here are the main results:

In these five world cups, 22.5% of the routines performed during women’s event finals were more difficult than that gymnast’s qualifying routine. That suggests a lot of upgrades — or much higher levels of performance that cause way more skills to be credited. The average routine only increased in difficulty by 0.003 points between qualifications and finals, but 14.3% of routines increased in difficulty by more than 0.2 -- which can make a big difference.

I also ran the the same numbers for the men’s competition. For men, 30% of the routines in event finals were more difficult. The average MAG routine increased in difficulty by 0.02 points, but a full 22.5% of routines increased in difficulty by more than 0.2. I’m not surprised that it’s way more common for men to upgrade: there’s more of a culture of “chuck it and see what happens” in MAG.

The share of downgraded routines for both genders is quite surprising. You never hear talk of gymnasts intentionally playing it safe during event finals, so it’s likely the case that most of these “downgrades" occur due to poor execution: gymnasts don’t get credit for all their attempted turns, connections, twists, or body positions. To understand this a little better, I broke down the results by event.

First, both upgrades and downgrades are way less common on vault — you can always count on Maria Paseka to whip out something crazy for finals, but she’s the exception rather than the rule. Upgrades are also less common on bars. To my mind, there’s fewer unintentional upgrades on bars and vault — there are no turns, fewer ambiguous body positions, no questionable connections, etc. This suggests that a lot of the increase in d-scores is just a result of better execution rather than strategic decision-making.

For men, we see the most upgrades on pommel horse and high bar. I don’t follow MAG so closely, but this result tallies with my impression of where and when we hear gymnasts talk about planned upgrades. The level of consistency on still rings is astonishing. Could that event get any more boring? (*ducks*)

In summary, we see a lot of cases where gymnasts perform a higher difficulty routine during event finals. If the gymnasts that qualified to finals in Baku would have upgrade their routine more than a fifth of the time, then it’s absolutely absurd to treat qualifications results as finals. However, we can’t be sure if all the upgrades in the data are due to strategy or simply better execution. There’s only one way to know for sure which gymnasts truly plan to upgrade for finals: ask them.

Tags: Fun with Score Data, Olympic Talk

As we draw closer to Worlds team selection, one question is being asked more and more: "Is she consistent?"

I've spent a lot of time thinking about how to measure consistency in gymnastics, and I've developed a metric for doing so. You'll see these consistency stats at the top of the page for every gymnast in my database who has enough information.

You can find the details behind this metric here, but there are two things that are important to understand. First, lower consistency stats are better; higher stats indicate a more inconsistent gymnast. Second, the consistency stats attempt to measure how well a gymnast hits in competition. That means we're looking at execution deductions and neutral deductions (out of bounds, over time, etc.) and not difficulty.

Below, I've pulled the current all-around consistency stats for ten top US elites. These draw upon all the scores in my database from the past year for which I have d-score information.

GymnastCountryConsistency Stat
Aleah FinneganUSA0.499
Grace McCallumUSA0.146
Jade CareyUSA0.207
Jordan ChilesUSA0.319
Kara EakerUSA0.344
Leanne WongUSA0.109
Morgan HurdUSA0.203
Riley McCuskerUSA0.327
Simone BilesUSA0.356
Sunisa LeeUSA0.250

But with so much conversation around consistency ahead of team selection, I wanted to understand how these data-driven consistency metrics compare to the talk going 'round the gymternet. To do this, I shared a survey on Twitter asking fans to mark which gymnasts they think have a reputation for being inconsistent. The respondents are by no means a representative sample of any population -- but they are awesome group of 173 people who helped make this post possible.

The chart below shows the gymnasts' inconsistency reputation rating: of those respondents who answered yes or no, how many thought, "Yes, this gymnast has a reputation for being inconsistent." For the raw survey results and more details on the survey methodology, see this companion post.

Already, we can see some differences between the Score for Score consistency stats and the gymternet's inconsistency reputation rating. To visualize the differences, I plotted both metrics on the same chart.

As we'd expect, gymnasts with a higher consistency stat tend to have a greater reputation for inconsistency amongst the gymternet. The correlation is 0.38, indicating that 38% of the variation in gymnasts' inconsistency reputation ratings can be explained by the variation in their consistency stats.

The correlation is lower than I expected, largely due to two gymnasts. In particular, practically no one would ever consider Simone Biles an inconsistent gymnast. And she's not inconsistent - in terms of her win record. However, when we drill down and look at her execution scores and neutral deductions, her results are sort of all over the map. And she might have ended up atop the podium, but she did have a very off day at Worlds last year. It all shows up in her scores.

Aleah Finnegan is also statistically less consistent than her reputation would suggest. I'm guessing that that's largely because she's new on the elite scene and many fans haven't been following her performances much until recently. Plus, there's probably a "my sister is Sarah Finnegan" bump when it comes to having a reputation for consistent hits.

Of course, the reputations might be capturing important information that the scores themselves leave out. Which gymnasts came through in competition but looked scary in podium training? Which ones weren't put up in the team final because they weren't hitting in practice? The scores don't directly reflect any of this information.

But as we look towards the Worlds team selection, it's worth considering the numbers when we talk about a gymnast's consistency. You can't always believe what you read on Twitter.

Tags: Fun with Score Data, Survey Says

In some ways, the US Classic is my favorite meet of the year. It’s the first time since Worlds that we see top US elites - that is, the best gymnasts in the world. And almost every year, we’re treated to the first showing of a long-dormant gymnast making her comeback to elite. Classics might not be the best meet on the calendar, but it’s always one of the most highly anticipated.

Of course, all this creates a strong temptation to over-indulge in our analysis of the meet’s results. Starved of scores for months, our appetites whetted by blurry Instagram training videos, we fall upon the results of Classics like a manic pack of wolves.

So it’s worth asking the question: do the gymnasts who end up on the Worlds team usually do well at Classics? Or is it just way, way too early to matter?

To answer these questions, I took a look back at the scores from 2010 through 2018, covering the "Secret Classic" era, the "Covergirl Classic" era, the "GK Classic" era, and even the sad, lonely, unsponsored 2017 "US Classic." In particular, I think it's useful to focus on how World and Olympic team members did at this meet.

First and foremost, no one has made a World or Olympic team without competing something at Classics. No one. This makes it highly unlikely that we’ll see a successful last-ditch effort from, say Jordan Bowers or Laurie Hernandez.

However, it is by no means important to compete the all-around. No more than two-thirds of the World or Olympic team competes at Classics. In 2015, four of the six team members competed the all-around — all except Brenna Dowell and Madison Kocian. But in 2016, Aly Raisman was the only member of the Final Five who had all four events ready for Classics. The other years fall somewhere in between these extremes.

Before looking at the data, I would have said that there are two models of gymnasts who don’t do AA at classics. The first are those like Simone Biles whose spots on the team are safe as long as they don’t get injured. For these gymnasts, staying health is the priority, so getting into competition shape on all four events a month before nationals isn’t really worthwhile. The second are event specialists like Madison Kocian and Alicia Sacramone, who don’t bother preparing routines that won’t impact their fate during team selection.

However, it’s actually somewhat common for edge-case, second-tier all-rounders to compete two or three events at Classics and still make the team. For example, Gabby Douglas didn’t compete the all-around in 2011, nor did Laurie Hernandez in 2016 or Morgan Hurd in 2017. So really, really, it’s not worth reading much into who has put together all four events by Classics.

On the other hand, almost every gymnast who ends up on the team is able to show at least one world-caliber routine at Classics. There has never been more than one or two gymnasts on the team who didn’t get on the podium for at least one event at Classics. (For vault, I’m including cases where the gymnast had a top three single vault score, since only one vault is counted in team competition.) And in 2012 and 2017, every single gymnast on the Worlds/Olympic team came away from Classics with an event medal.

Finally, there is basically always someone who does well in the all-around at Classics but doesn’t end up on the Worlds team. Only once, in 2015, did all three of the top all-rounders at Classics - Biles, Raisman, and Nichols - make it to Worlds. At the other extreme, in 2017, not many did the all-around. None of the top all-rounders - Alyona Shchennikova, Abby Paulson, Kalyany Steele, and Luisa Blanco - were even really in the conversation for Doha.

Now, this year could be different. I think we can all agree that the level of competition at Classics was higher than we’ve come to expect in the past. Several gymnasts had what was arguably their best competition; Riley McCusker and Grace McCallum come to mind. With gymnasts closer to peak form, this meet might be a better signal how things will look in October than it has been in the past.

There are, however, some early signs that Tom Forester places even less emphasis on Classic than his predecessors did. Neither McCallum nor Eaker had a good showing at Classics last year, finishing off the podium on every single event. Nevertheless, both earned Worlds spots with strong showings at Pan Ams, Nationals, and the selection camp.

Team USA is as deep as it's ever been. While the girls who came out on top this weekend could easily win a gold medal at Worlds, they may not be the top contenders come October. While I don't expect to see anyone on the Worlds team who didn't show at least one internationally competitive routine at Classics, I would be shocked to see everyone who had a good day at Classics make it to Glasgow.

If you’re interested in the full data, I’ve put everything in a Google Sheet. Feel free to check it out!

Tags: Fun with Score Data

Yesterday, in Part I of my mini-series on artistry in gymnastics, I gave an overview of what artistry deductions looked like at the Rio Olympic Games. Today, I’d like to delve a little deeper and examine a different question: do the scores suggest that the Code of Points - and the judges who apply it - have a coherent understanding of artistry?

If we have a coherent understanding of artistry, then judges should be able to take artistry deductions consistently and objectively regardless of their subjective taste. There are plenty of masterpieces out there that I can recognized as great works of art even though I do not find them pleasing or moving - Picasso’s Les Demoiselles d'Avignon and Stravinsky’s Rite of Spring come to mind. In other words, they have high level of artistry even though they are not to my taste.

Gymnastics judges are not supposed to answer the question, “Is this art to my taste?” Instead, they are simply supposed to determine, “Is this art?” Now, I’m certainly not going to venture a definition of art myself— in other words, I’m not a college junior sitting in some art history seminar. But what I am going to do is examine these artistry deductions for the sort of consistency we would expect if all the judges were working off a similar implicit understanding of what is and is not art in gymnastics.

Judges Seem to Agree About Artistry Deductions

If different judges given the same routine wildly different artistry deductions, then they clearly do not have a common understanding of each deduction. If they do have that common understanding, then we’d expect to see them all take about the same number of points.

To measure this, I took the standard deviation of all five judges’ scores for each floor routine. (I’m only going to look at floor, because floor routines tend to be more varied and controversial.) You can see the distribution of the standard deviations in the figure below.

So most of the time, the average judge takes deductions that are about half a tenth off from the average deduction across all judges. In other words, the judges pretty much agree on the total amount of artistry deductions. (Of course, the average gymnast gets two tenths off in artistry deductions, so half a tenth is not nothing - but the disagreement between judges is ultimately not really impacting her score.)

Interestingly, the most controversial routine - the one with the largest spread between the judges’ artistry scores - was that of Catalina Elena Escobar Gomez of Colombia, who got injured during her routine in Rio and had to leave the floor. One judge took a full 0.4, while the rest took basically nothing, presumably out of pity. I admire that one asshole judge’s commitment to accurate scoring over feelings.

So if judges seem to have a coherent understanding of artistry, can we get any closer to figuring out what that understanding is?

Is Artistry Just Execution?

One strong possibility is that judges fall prey to the common tendency to blur artistry and execution. You see this sometimes in the discourse around certain gymnasts; for example, some people seem to think that America hasn’t had a truly artistic gymnast since Nastia Liukin.

Below, I’ve plotted the artistry deductions versus the other execution deductions on beam and floor. As you can see, the two are very closely linked. The correlation between artistry deductions and execution deductions on beam is 0.71. Roughly speaking, this means that 71% of the variation in beam artistry deductions can be explained by the variation in beam execution deductions, which is very high. On floor, the correlation is 0.50.

Now, it’s totally possible that artistry and execution are two totally separate concepts and, for whatever reason, gymnasts who are good at one tend to be good at the other. When you listen to the way Eythora Thorsdottir talks about gymnastics, it’s clear that her love of beauty in sport drives her both to perfect her extension and toe point and to treat us to an expressive performance. While some expressive gymnasts can’t quite figure out their landings and can’t stay on the beam, I could not think of anyone with fantastic performance quality but a lot of built-in form deductions.

But I don’t think the two are inextricably linked. While no highly artistic gymnast has poor execution, plenty of gymnasts with flawless execution have very little artistry. I’m not going to name names, but you all know the type, and there were plenty in Rio.

If that’s the case, we should expect to see a lot of points in the bottom right corner of these graphs - high artistry deductions, but low execution deductions. I don’t see a single point there at all.

All this leads me to conclude that the shared common understanding of artistry promoted by the Code of Points is “consistently lovely execution that contributes to a nice overall look of the routine.”

I don’t know exactly what are we scoring when we assign points to artistry in gymnastics. And I don’t know exactly what we should be scoring, either. But I think most gymnastics fans agree that there’s a gap between what we have and what we want.

Tags: Fun with Score Data, Olympic Talk, On Artistry

On Wednesday night, I came across something wonderful: Matthew Rusk (@mrusskie93) had linked to the full results from the Rio Olympic Games. Among other things, this contains a breakdown of the e-score deductions for each gymnast into “execution” deductions and “artistry” deductions. And, after way more PDF scraping than I want to think about, I managed to get some delicious data out of it!

What Are Artistry Deductions?

The Code of Points specifies certain e-score deductions as “artistry deductions” on floor and balance beam, the two events where there is time and space for presentation and choreography.

On beam, one-tenth deductions can be taken for a lack of confidence, lack of personal style, Insufficient variation in rhythm & tempo, and lack of fluidity. On floor, there are even more such deductions. These include lack of expressiveness, gestures that don’t correspond to the music, failure to engage the audience, and inability to show a theme or character. There are further deductions for composition and musicality.

Now, you may be wondering, “What’s personal style?” “What’s fluidity?” On these topics, the Code is silent. Judges are expected to know it when they see it.

I’ve long thought that the division of points into a difficulty score and an execution score was a bit of a clever trick on the part of the FIG. By grouping together objective deductions, like “that split did not hit 180,” and subjective deductions, like “that routine lacked expressiveness,” the Code of Points manages to mask much of the subjectivity involved in judging. But in this remarkable pdf, we have everything clearly separated out.

Artistry Deductions Are Common, But Small

Let’s start out with some summary statistics. To be honest, before looking at these scores, I had no idea whether artistry deductions actually get taken - I’ve heard that it’s very common, and I’ve also heard it’s incredibly rare.

In the analysis below, I’m just going to use the scores from qualification here, since that lets us look at the largest number of gymnasts. I’m also going to use the mean deduction taken by all five e-panel judges - no dropping the highest and lowest.

Not a single gymnast escaped without at least one judge taking an artistry deduction. However, some got close: four judges found nothing to take from Nina Derwael’s floor routine, and three judges found nothing to take from Laurie Hernandez’s beam and Marine Brevet’s floor. In fact, a lot of gymternet favorites — floor routines from Eythora Thorsdottir, Rune Hermans, and Laurie Hernandez, as well as beam routines from Simone Biles, Sanne Wevers, and Pauline Schaefer - got an average deduction of less than a tenth across the five judges.

But that’s not so common. The average gymnast lost 0.22 points on floor and 0.21 points on beam due to artistry deductions.

You can see the spread in the figure above. Most gymnasts are getting under three tenths in artistry deductions, though some are hit a little harder.

But, for perspective, there wasn’t a single routine from Olympic qualifications for which the judges thought, “Wow, the lack of artistic interpretation here is as bad as putting a hand down on the beam.” The average routine got about 2.2 points in total e-score deductions on beam, and 2.0 points deducted on floor. That means that artistry deductions usually make up about 10% of the total deductions taken. It’s pretty low.

The total artistry deductions are low in large part because of the limited deductions available in the Code of Points. Very few single artistry deduction can be greater than one tenth. If I were a gymnast or a coach, I wouldn't find many incentives in the Code to prioritize artistry. A world in which artistry deductions are very common, but also very low, is possibly the least desirable from the perspective of promoting artistry in the sport. Under this system, we are in essence saying that the current level of artistry in the sport is quite poor, but we don't care enough to incentivize gymnasts to improve.

I’m sure you’re all as curious as I was about the least artistic routine. That dubious honor goes, perhaps unsurprisingly, to Hong Un Jong’s floor routine. We all know why she went to Rio. It wasn’t to express the pain of her nation through dance.

Some Countries Are More Artistic Than Others

Time to open up a can of worms. For each country with a full team, I calculated the average artistry deduction across all gymnasts. Remember, these plots show deductions, so lower is better.

That’s right, Team USA was the most artistic on balance beam, and China was the least. The Americans also had the third most artistic floor routines, while Russia was in eighth. It’s just the numbers guys, don’t shoot me. At least we can all agree that the Netherlands deserve to be up near the top.

I should point out how poorly Asian gymnasts are scored on artistry metrics. There's plenty of subjectivity involved, but I personally do not find that Japan has the worst performance quality on floor, and I have to wonder if some stereotypes about Asian personalities are at play.

However, there are also some surprising patterns here. We all know that classically trained European gymnasts with “nice lines” are too often thought of as intrinsically more artistic than gymnasts from a different mold. I've certainly noticed this pattern in listening to gymnastics commentary. However, while this notion is still very much present in the discourse, I’m not sure if it’s really being reflected in the scores. American routines certainly don’t follow that mold, and the Belgian routines are nothing if not atypical. Meanwhile, Team Russia is not being rewarded for their beautiful wrist flicks, arguably the closest thing we saw to classical style in Rio.

So does a team’s reputation play into their artistry scores? The answer is a definitive maybe.

There’s clearly more to be done with this data, but I’m going to save that for another day. Check back soon for more!

Tags: Fun with Score Data, Olympic Talk, On Artistry