Score for Score

Posts tagged "Score for Score Business"

Is there racial bias in gymnastics judging? Almost certainly. Gymnastics judges fall prey to other forms of bias, and there’s racial bias in most things. But we have no evidence to prove it.

That’s because it’s really hard to study racial bias in a subjectively judged sport. And it’s even harder when you have no data on race. I can’t just sit down, crunch the numbers, and measure racial bias in gymnastics judging today. But I can describe what I would do if I had data or research funding. And I can describe what I wouldn’t do, too.

For the sake of simplicity, I’m going to talk about bias against black athletes in this article, though I want emphasize the likelihood of racial bias against gymnasts of color in general as well as pernicious stereotypes about Asian gymnasts in particular. I should also mention that I’m not going to talk about the myriad other ways that racism impacts this sport, most of which are probably more important than the scoring. And I’m just going to talk about how to measure racial bias, not how to solve it. This is a narrow article on a narrow topic.

It’s also a wordy article, but please bear with me — because if you’re going to study racial bias, it’s important to do it right.

The Top Priority: Collect and Release Data on Race

There is currently no publicly available source of data on gymnasts’ races. Collecting and releasing this data is a necessary prerequisite to any attempt to study racial bias. You can’t analyze what you don’t measure.

Fortunately, collecting this data is theoretically straightforward. On the form a gymnast uses to register for a competition, just add a question about race. Add one about age, sexual orientation, and other sensitive characteristics while you’re at it. Whatever software is being used to store information about a gymnast’s name and country can store information about her race too. The FIG can require that such data is collected at all FIG competitions; national federations can do the same.

Releasing the data is just as important as collecting it. Wherever competition results are posted, there should information on the demographic characteristics of gymnasts posted as well. Ideally, this data should be in a machine readable format like an API or a CSV. But collecting the data and keeping it locked up is almost as bad as not collecting it at all. (There should always be a “prefer not to say” option on forms for gymnasts who would rather not have their information shared.)

I cannot stress this enough. Every single method I’m going to discuss below requires data on the race of gymnasts. Some require data on the race of judges as well. This data is crucial.

Defining Racial Bias in Judging

By defining racial bias in gymnastics, we can get a better sense of why it’s so hard to study.

A judge is biased if she scores a routine differently based on the race of the gymnast performing it. In an ideal world, we’d be able to ask a judge to score a routine performed by a white gymnast, then score the exact same routine with the exact same execution performed by a black gymnast. But of course, this is impossible: no two performances are ever exactly the same. Statisticians call this the fundamental problem of causal inference.

Let’s try a weaker definition of racial bias that lets us avoid the this problem. A judge is biased if the discrepancies between her score and the correct score are different for gymnasts of different races. In other words, a judge who systematically overscores white athletes or systematically underscores black athletes is biased. This is a weaker definition because it does not let us prove definitively that race is the factor leading to over/underscoring. It’s possible that judges who underscore black people are really biased against something correlated with but distinct from race — hip hop floor routines, for example. But even so, I’m comfortable using this weaker definition because judges are not allowed to be biased about anything. There should be absolutely no characteristic, no matter how frivolous, for which it is okay to observe a systematic discrepancy between a judge’s score and the correct score. If we measure some arbitrary form of bias that is correlated with race and leading to the underscoring of black gymnasts, we should still try to eliminate whatever form of bias that is.

But even this second-best definition is impractical. In an ideal world, we’d be able to compare the scores a judge gives in competition to the true score that routine deserves. It’s possible to identify the objectively correct set of deductions — I can freeze a video, pull out a protractor, and tell you for sure if a split hit 180 degrees — but such careful scores don’t exist for the vast majority of routines. All we have is the scores given at the competition, which may themselves be biased. (Whether there’s an objectively correct way to score artistry is also a big question, but one for a different time.)

We cannot perfectly measure bias in real world competitions. But by keeping these definitions in mind, we can get close.

Comparing Scores By Race Doesn’t Measure Bias

Now that we’ve talked about what racial bias is, I want to clarify what it it is not. A judge is not biased just because she tends to give white gymnasts higher scores than black gymnasts. Without knowing anything about the quality of the routines, we simply cannot cry bias.

If we had data on gymnasts' races, it would be very tempting to simply compare the average score for black gymnasts to the average score for white gymnast. This method, referred to elsewhere as disparate impact analysis, does not work in this context. Simply observing that black gymnasts get lower scores than white gymnasts is not evidence of racial bias. Importantly, observing that black gymnasts get higher scores than white gymnasts is also not evidence that the system is unbiased.

This principle is clear when we consider other forms of bias. The fact that the average US Worlds team outscores the average Australian Worlds team is not evidence of nationality bias against Australia. The US gymnasts are simply performing better routines. In other words, the size of the gap between the American scores and the Australian scores is not itself a measure of bias — even if that bias really exists.

Don’t fall into the trap of using disparate impact analysis. It’s not helpful.

What Would Work

So what would good research actually look like? I’ve got ideas.

First, we could measure over- and underscoring by race. To do this, we’d take routines from a given competition and ask a separate set of judges to score them as perfectly as possible, using video replay and as much time as needed. (This method rests on the assumption that it’s possible for judges to arrive at an objective score at all — an assumption intrinsic to the Code of Points.) Then we’d compare those scores to the scores actually given on the day of competition, and calculate the average discrepancy by race. This is intuitively simple and thus compelling.

Second, we could instead treat the scores that judges give gymnasts of their own race as unbiased. In other words, we could make the assumptions that black judges are not racially biased either for or against black gymnasts, and so on. We would then compare the scores that black judges give black gymnasts to scores that other judges give black gymnasts to evaluate the presence of bias. This method is nice because it scales quite easily — we can analyze scores that exist instead of re-scoring routines. (For that reason, similar methods have been used to evaluate nationality bias.) However, it relies on the existence of a racially diverse pool of judges.

If score receipts that itemize deductions were available, we could use either of these methods to see how the type of deduction taken compares to the appropriate deduction. For example, black gymnasts have been stereotyped as powerful but inelegant tumblers. Using video recordings could be used to determine how much height a gymnast gets on a skill, we could test if black gymnasts are less likely to receive amplitude deductions than other gymnasts. We could also test if they are more likely to receive form deductions.

There’s also whole realm of laboratory experiments that you could conduct to evaluate judges’ bias outside of a competition setting. We talked before about how you can’t really observe two different gymnasts doing the same routine — but you can if you create computer animations of gymnasts. Judges could be randomly asked to score the same animated routine performed by a black or white gymnast, allowing us to measure if the black gymnast received worse scores for an identical routine. Video editing software could also be used to alter a recorded routine to change the race of an athlete, allowing for a similar experiment.

Next, there are many qualitative methods that focus on evaluating judges’ latent attitudes that may ultimately impact how they score. Next, jdges could be shown routines by athletes of different races and give verbal responses to questions about those routines. We could then analyze those responses to see if different language is used to describe routines from black athletes — especially on the lines of common racial stereotypes.

I’m sure there are many more methods out there, and I’d encourage you to leave your proposals in the comments.

In Summary

I bet you can guess my number one recommendation: collect and release information on the race of gymnasts and judges.

Beyond that, researchers interested in racial bias should be sure that they have a clear operational definition of bias and a research design that addresses that definition. Using bad research to advocate for gymnasts of color simply provides ammunition for those who are resistant to change.

Of course, any conversation around racial bias in judging has to occur as part of a larger conversation of racial bias in the sport. I look forward to seeing what form that conversation takes.

Tags: Score for Score Business

I love scores. I love data-driven decision-making. I’m a data scientist and I run a website with the word “score” in the name. Twice.

But I think we need to talk about the limits of using scores to select teams.

As the US selected its 2019 Worlds team, I saw more discussion of the highest-scoring team based on past scores than I’ve ever seen before. Gymnerd after gymnerd came to the table with quantitative evidence supporting their team predictions. I think this fantastic, and I hope that the tools on my site helped fuel these conversations in some small way.

But in gymnastics, as in all other fields, there is a downside to arguments based on data. We all share a strong tendency to treat numbers as objective. This is a bad tendency. Numbers mask the value judgements that lie beneath all — and I do mean all — data-driven arguments.

And when we start believing that our subjective, values-driven arguments are actually the cold, hard, truth, we cling to them more strongly and become less tolerant of good-faith debates about who should make a team.

So let’s take a minute to remind ourselves all the reasons why data-driven team selection is not an objective exercise. While I’m sure each and every one of you knows these reasons already, it’s worth reviewing them time and time again to strengthen the instinct for skepticism of data analysis.

First, scores come from judges. Judges are human. Humans are biased. Enough said.

Second, scores measure an infinitesimally small percentage of the total gymnastics that a gymnast does in any given year. A score can tell you whether a gymnast crashed her vault, but it can’t tell you whether she does that half the time in training or if this was a fluke caused by weird arena lighting. Of course, if I had to pick between information on how a gymnast does in a competition setting and information on how she does in a practice setting, I’d pick the competition information — but ideally I’d like both.

Third, gymnasts’ scores are intentionally varied throughout the season, making past performance an imperfect predictor of future performance. Most gymnasts intentionally wait until later in the year to roll out their upgrades and get in top shape, but they all do so do a different extent. We don’t have systematic information on any of this.

Finally, and most importantly, there countless ways to cut the data. Here a small fraction of the justifiable ways to pick the US team using data:

  • Take each gymnast’s mean score across all competitions this year, and pick five individuals who have the highest three-up, three-count score.

  • Do that same thing dropping the low score and the high score from the mean.

  • Do that same thing using medians instead of means.

  • Do that same thing using each gymnast’s best score all year instead of medians.

  • Do that same thing using only scores from FIG competitions.

  • Take each gymnast’s mean score across all competitions this year, and pick the team that scores the highest when you replace the top-scoring routine on each event with the 4th-best routine on each event.

  • Do that using any of the alternative metrics listed above.

  • Use execution scores to determine whether each routine was “hit” or not, and limit the team contenders to gymnasts with hit rates above a certain threshold.

  • Look at each gymnast’s best all-around score from this year, and pick the top five best all-rounders.

  • Compare each gymnast’s top score on each event to the top scores of other international gymnasts, and pick the team that is capable of making the most event finals as long as they have three routines of any quality to use during team finals.

  • Take a weighted average of each gymnast’s scores, giving more weight to more recent competitions. Then pick the best three-up, three-count team.

Et cetera.

Each of these methods is equally objective — which is to say, not very objective at all. They each reflect a different subjective perspective on what is valuable in a member of the US Worlds team.

So how should we be using scores to select teams? That’s up to you — it depends on your beliefs about how teams should be selected.

But I do have ideas about how we should talk about data-driven team selection. In particular, we spend a bit more time articulating the values beneath our analyses.

For example, “I think that even one fall in the past season constitutes evidence of inconsistency. As a result, I’ve chosen not to drop any scores when calculating each gymnast’s average score.”

Or maybe, “I believe in giving gymnasts the benefit of the doubt about their performances early in the season, well before they need to peak. As a result, I’ve chosen to only look at scores from US Nationals and Worlds Trials and I’ve dropped Classics and Pan Ams.”

Or how about, “I don’t think that US needs to worry about their team qualification performance, so I’m picking the team that scores the highest in a three-up, three-count team final situation even though that leaves them with a weak routine as the fourth-best qualifying.”

Or consider, “I don’t trust judges to score consistently across different competitions, and in particular I think it’s unfair to compare domestic scores to international scores. So I’m going to consider Classics and Championships, where all the contenders competed in front of the same panel, but not Pan Ams or Jesolo.”

These underlying assumptions are thing we can debate, and I bet those debates would be interesting and productive. Most importantly, they would be debates divorced from our personal feelings about individual gymnasts — feelings that enter into data-driven team predictions way more than they should.

While I'm thrilled that more people are engaging with data-driven gymnastics analysis, I worry that data will make us think that we're objectively right in our analysis. Quantitative results reflect our underlying values. Keep that in mind.

Tags: Score for Score Business

Today, I’m celebrating Score for Score’s first birthday! One year ago today, I pushed my local repo to Heroku and this website was born.

I’ve learned a lot about blogging in the past year - mostly, that it’s hard. I have a lot of ideas, but it turns out that turning an idea into a blog post takes a lot more time and effort than turning an idea into a tweet. I haven’t posted as often as I’d hoped, but I’m pretty happy with the posts I have put out.

So I wanted to take this opportunity to pull out my top ten most popular posts from the past year. New readers, I hope you’ll take a dive into the archives! My favorite posts aren’t always my most popular posts, but I hope you’ll enjoy!

  1. The Olympic Boost

    Team Japan has looked great in the lead-up to Tokyo. Is this just a coincidence, or does hosting the Olympics tend to give teams a boost?

  2. Skinner’s Path to Tokyo

    The second I saw that MyKayla Skinner had announced her return to elite, I dropped what I was doing to write up her prospects. I’m a grad student. I can do that. Since then, I’ve been impressed by the speed of her comeback. I never said her path to Tokyo would be easy, but I still believe it exists.

  3. Artistry at the Olympics: A Quantitive View (Part I)

    In this post, I took a deep dive into data from the Rio Olympics that separates out artistry deductions from other execution deductions. Armed with this data, we can finally get a sense for how judges use the artistry portion of the Code of Points.

    And if you liked this post, you’ll like Part II even more — even though that post didn’t crack the top ten!

  4. Who's Inconsistent? Reputation vs. Statistics

    In this post, I asked gym nerds which gymnasts they think are inconsistent, and compared their answers to my quantitative measure of a gymnast’s consistency (see #4 below!). The gymternet is spot on about most gymnasts… with one very notable exception!

  5. Chart of the Day: Age of Gymnasts at Worlds

    Here, I highlight a chart from the FIG on the ages of gymnasts at the 2017 World Championships. I love featuring charts of the day and I’m always on the lookout for more cool gymnastics graphs, so if you find one, pass it along!

  6. Review: Do Gymnasts Make Better Judges?

    This is the most popular post from my From the Academy series, where I review academic papers on scoring in gymnastics. In this one, the author tests whether former gymnasts judge split leaps more accurately than mere mortals.

  7. Measuring Consistency in Gymnastics

    In this post, I explored how to measure consistency in gymnastics, using everyone’s favorite heartbreaker Angelina Melnikova as a case study. Creating a quantitative method of consistency is nowhere near as straightforward as it sounds!

    The metric I presented here ultimately became the consistency stat that you see atop every gymnast’s page. (Or at least, every gymnast with enough data!)

  8. Does Your Leo Change Your Score?

    Some leos are awesome and some are not. And we all love to talk about that. But do gymnasts with better leos get higher scores? I used the scores from TheGymternet’s leo panel from the 2018 US Championships to find out!

  9. An Ode to MG Elite Wrists

    I think the title here is self explanatory. MG Elite Wrists. You know what I mean.

  10. Dear Flo: An Open Letter to FloGymnastics

    This is by far the most popular post on my website. I wrote it just after Flo gymnastics announced that the Michigan NCAA regionals would be locked behind a steep paywall, out of reach for many fans. I was seeing a lot of outrage on Twitter, but I was having a hard time picking out concrete complaints that a large corporation would respond to. In this post, I tried to distill the gymternet’s sentiments in an open letter - and I think a lot of people agreed with what I had to say.

So there’s the top ten! I’d also like to highlight my posts on robot judges — it’s some of the work I’m most proud of but it just doesn’t get the same attention as sexier topics.

What’s coming up in Year 2 of Score for Score? You’ll just have to follow along to find out!

Tags: Score for Score Business

Based on the numbers I get from Google Analytics, I’m guessing about half of you are reading this on your phone. If you’re one of those people, you may have already noticed my exciting new surprise: there is now a mobile version of the Score for Score website!

In particular, I’ve made some changes so that the menu no longer takes up over half of the page. Instead, you can click the three bars at the top to pull down a mobile version of the menu. (Fun fact: that style of menu is known as a hamburger menu!) The page content will also be wider on a mobile screen.

I know the user experience still leaves a lot to be desired when you’re on your phone, but hopefully this is a good first step. I still think the site works better on a desktop, and I still recommend that you use a computer if you’re planning to spend a lot of time with the Team Tester or the Score Selector. But given that this is the first real website I've ever created - and that I built it from scratch - I'm pretty excited that I got this to work :)

What else would you all like to see improved? Let me know in the comments or @scoreforscore on Twitter!

Tags: Score for Score Business

If you click on an individual gymnast’s page, you might notice something new at the top: consistency stats!

The consistency stats are a metric that I came up with to put a number to our general intuition about a gymnast’s consistency. The most important thing to know is that a lower consistency stat indicates a more consistent gymnast.

Here’s where the numbers for each gymnast come from.

  1. For each score, subtract the d-score from the total score, leaving the e-score minus any neutral deductions.
    This gets at what we’re interested in when we talk about consistency: how well did the gymnast hit? The difficulty that a gymnast choses to compete on any given day shouldn’t impact her consistency stat - except through its impact on her execution.

  2. Calculate the standard deviation of the scores we calculated above score on each event.
    The standard deviation measures of how spread out values are. This is why a lower consistency stat indicates a more consistent gymnast; a zero is a “perfect” score. Calculating standard deviations separately on each event ensures that a gymnast doesn’t look inconsistent just because there’s a lot of spread between her beam execution and her vault execution - that’s just a feature of the code of points.

  3. Take the mean of the standard deviations for each event.
    This gives us the “overall” consistency score listed in the rightmost column of each table.

I wrote a bit about the thought process behind this measure a while back, so check out that post if you want to know more about the origins.

How exactly do I interpret the numbers themselves?
Let’s take a gymnast with a balance beam consistency stat of 0.5. This means that each of her e-scores on beam is, on average, 0.5 points higher or lower than her mean beam e-score.

What time period do these statistics account for?
The consistency stats pull data from the past year - that is, 365 days before today. They’ll update automatically every time you visit a gymnast’s page, so if you go back to a page next month and the stats have changed, it’s probably because some old scores dropped out of the calculation or some new ones were added in.

Is every score from the past year included in the calculation?
I can only include scores for which I have separate d-score information. I get all of my score data from TheGymternet, and they do a really great job of tracking down scores— so when I don’t have d-score information for a meet, it usually means that score just isn’t online. If you scroll down to the bottom of each gymnast’s page, you’ll see the data we have on d-scores.

Why doesn’t every gymnast have a consistency stat on every event?
A gymnast must have competed an event at least twice in the past year in order to receive a consistency score. If a gymnast only competes once, then we have no way of knowing how consistent she is. And even if she’s competed twice, I might not have d-scores for both competitions.

What do you do with second vault scores?
If a gymnast competes a second vault, then you’ll see her consistency stat on those vaults alone in the “VT2” column of her page. However, in calculating her overall consistency score, I simply average the stats for her first vaults, bars, beam, and floor - ignoring the second vaults. Consistency is often relevant for team or all-around competitions, so this number is a little more useful.

Why don’t you publish a consistency ranking?
The more you compete, the more different scores go into the ranking, and the harder it is to get a nice low consistency stat. This means that the rankings would basically always be topped by gymnasts who’ve competed twice on one event in the past year. I firmly believe you should look at the consistency stats in context - that’s why they’re on each gymnast’s page right on top of the full list of that gymnast’s scores. If we just looked at the numbers on their own, we’d lose track of what goes in to the consistency stats.

This gymnast never seems to hit. Why are her consistency stats so good?
Consistency stats measure whether a gymnast performs at the same level every time she goes out. A gymnast who always falls in the exact same spot on her bar routine is therefore very consistent - even though she’s not hitting. Consistency stats are a guide to how much future scores might look like past scores; they don’t necessarily indicate how well a gymnast is going to do.

That stat looks weird. It’s totally off with my intuition about how this gymnast does.
I’m not surprised at all! The consistency stats are a very specific metric based on the limited information that comes from scores; there’s a lot more that goes into your impression of a gymnast! But as always, if you think there’s a bug, please e-mail me at

Got more questions? Comment below and I’ll get back to you!

Tags: Score for Score Business