Score for Score

How to Study Racial Bias in Gymnastics Judging

How to Study Racial Bias in Gymnastics Judging

by Brina June 4, 2020

Is there racial bias in gymnastics judging? Almost certainly. Gymnastics judges fall prey to other forms of bias, and there’s racial bias in most things. But we have no evidence to prove it.

That’s because it’s really hard to study racial bias in a subjectively judged sport. And it’s even harder when you have no data on race. I can’t just sit down, crunch the numbers, and measure racial bias in gymnastics judging today. But I can describe what I would do if I had data or research funding. And I can describe what I wouldn’t do, too.

For the sake of simplicity, I’m going to talk about bias against black athletes in this article, though I want emphasize the likelihood of racial bias against gymnasts of color in general as well as pernicious stereotypes about Asian gymnasts in particular. I should also mention that I’m not going to talk about the myriad other ways that racism impacts this sport, most of which are probably more important than the scoring. And I’m just going to talk about how to measure racial bias, not how to solve it. This is a narrow article on a narrow topic.

It’s also a wordy article, but please bear with me — because if you’re going to study racial bias, it’s important to do it right.

The Top Priority: Collect and Release Data on Race

There is currently no publicly available source of data on gymnasts’ races. Collecting and releasing this data is a necessary prerequisite to any attempt to study racial bias. You can’t analyze what you don’t measure.

Fortunately, collecting this data is theoretically straightforward. On the form a gymnast uses to register for a competition, just add a question about race. Add one about age, sexual orientation, and other sensitive characteristics while you’re at it. Whatever software is being used to store information about a gymnast’s name and country can store information about her race too. The FIG can require that such data is collected at all FIG competitions; national federations can do the same.

Releasing the data is just as important as collecting it. Wherever competition results are posted, there should information on the demographic characteristics of gymnasts posted as well. Ideally, this data should be in a machine readable format like an API or a CSV. But collecting the data and keeping it locked up is almost as bad as not collecting it at all. (There should always be a “prefer not to say” option on forms for gymnasts who would rather not have their information shared.)

I cannot stress this enough. Every single method I’m going to discuss below requires data on the race of gymnasts. Some require data on the race of judges as well. This data is crucial.

Defining Racial Bias in Judging

By defining racial bias in gymnastics, we can get a better sense of why it’s so hard to study.

A judge is biased if she scores a routine differently based on the race of the gymnast performing it. In an ideal world, we’d be able to ask a judge to score a routine performed by a white gymnast, then score the exact same routine with the exact same execution performed by a black gymnast. But of course, this is impossible: no two performances are ever exactly the same. Statisticians call this the fundamental problem of causal inference.

Let’s try a weaker definition of racial bias that lets us avoid the this problem. A judge is biased if the discrepancies between her score and the correct score are different for gymnasts of different races. In other words, a judge who systematically overscores white athletes or systematically underscores black athletes is biased. This is a weaker definition because it does not let us prove definitively that race is the factor leading to over/underscoring. It’s possible that judges who underscore black people are really biased against something correlated with but distinct from race — hip hop floor routines, for example. But even so, I’m comfortable using this weaker definition because judges are not allowed to be biased about anything. There should be absolutely no characteristic, no matter how frivolous, for which it is okay to observe a systematic discrepancy between a judge’s score and the correct score. If we measure some arbitrary form of bias that is correlated with race and leading to the underscoring of black gymnasts, we should still try to eliminate whatever form of bias that is.

But even this second-best definition is impractical. In an ideal world, we’d be able to compare the scores a judge gives in competition to the true score that routine deserves. It’s possible to identify the objectively correct set of deductions — I can freeze a video, pull out a protractor, and tell you for sure if a split hit 180 degrees — but such careful scores don’t exist for the vast majority of routines. All we have is the scores given at the competition, which may themselves be biased. (Whether there’s an objectively correct way to score artistry is also a big question, but one for a different time.)

We cannot perfectly measure bias in real world competitions. But by keeping these definitions in mind, we can get close.

Comparing Scores By Race Doesn’t Measure Bias

Now that we’ve talked about what racial bias is, I want to clarify what it it is not. A judge is not biased just because she tends to give white gymnasts higher scores than black gymnasts. Without knowing anything about the quality of the routines, we simply cannot cry bias.

If we had data on gymnasts' races, it would be very tempting to simply compare the average score for black gymnasts to the average score for white gymnast. This method, referred to elsewhere as disparate impact analysis, does not work in this context. Simply observing that black gymnasts get lower scores than white gymnasts is not evidence of racial bias. Importantly, observing that black gymnasts get higher scores than white gymnasts is also not evidence that the system is unbiased.

This principle is clear when we consider other forms of bias. The fact that the average US Worlds team outscores the average Australian Worlds team is not evidence of nationality bias against Australia. The US gymnasts are simply performing better routines. In other words, the size of the gap between the American scores and the Australian scores is not itself a measure of bias — even if that bias really exists.

Don’t fall into the trap of using disparate impact analysis. It’s not helpful.

What Would Work

So what would good research actually look like? I’ve got ideas.

First, we could measure over- and underscoring by race. To do this, we’d take routines from a given competition and ask a separate set of judges to score them as perfectly as possible, using video replay and as much time as needed. (This method rests on the assumption that it’s possible for judges to arrive at an objective score at all — an assumption intrinsic to the Code of Points.) Then we’d compare those scores to the scores actually given on the day of competition, and calculate the average discrepancy by race. This is intuitively simple and thus compelling.

Second, we could instead treat the scores that judges give gymnasts of their own race as unbiased. In other words, we could make the assumptions that black judges are not racially biased either for or against black gymnasts, and so on. We would then compare the scores that black judges give black gymnasts to scores that other judges give black gymnasts to evaluate the presence of bias. This method is nice because it scales quite easily — we can analyze scores that exist instead of re-scoring routines. (For that reason, similar methods have been used to evaluate nationality bias.) However, it relies on the existence of a racially diverse pool of judges.

If score receipts that itemize deductions were available, we could use either of these methods to see how the type of deduction taken compares to the appropriate deduction. For example, black gymnasts have been stereotyped as powerful but inelegant tumblers. Using video recordings could be used to determine how much height a gymnast gets on a skill, we could test if black gymnasts are less likely to receive amplitude deductions than other gymnasts. We could also test if they are more likely to receive form deductions.

There’s also whole realm of laboratory experiments that you could conduct to evaluate judges’ bias outside of a competition setting. We talked before about how you can’t really observe two different gymnasts doing the same routine — but you can if you create computer animations of gymnasts. Judges could be randomly asked to score the same animated routine performed by a black or white gymnast, allowing us to measure if the black gymnast received worse scores for an identical routine. Video editing software could also be used to alter a recorded routine to change the race of an athlete, allowing for a similar experiment.

Next, there are many qualitative methods that focus on evaluating judges’ latent attitudes that may ultimately impact how they score. Next, jdges could be shown routines by athletes of different races and give verbal responses to questions about those routines. We could then analyze those responses to see if different language is used to describe routines from black athletes — especially on the lines of common racial stereotypes.

I’m sure there are many more methods out there, and I’d encourage you to leave your proposals in the comments.

In Summary

I bet you can guess my number one recommendation: collect and release information on the race of gymnasts and judges.

Beyond that, researchers interested in racial bias should be sure that they have a clear operational definition of bias and a research design that addresses that definition. Using bad research to advocate for gymnasts of color simply provides ammunition for those who are resistant to change.

Of course, any conversation around racial bias in judging has to occur as part of a larger conversation of racial bias in the sport. I look forward to seeing what form that conversation takes.

Tags: Score for Score Business