Score for Score

On Using Scores to Select Teams

On Using Scores to Select Teams

by Brina Oct. 2, 2019

I love scores. I love data-driven decision-making. I’m a data scientist and I run a website with the word “score” in the name. Twice.

But I think we need to talk about the limits of using scores to select teams.

As the US selected its 2019 Worlds team, I saw more discussion of the highest-scoring team based on past scores than I’ve ever seen before. Gymnerd after gymnerd came to the table with quantitative evidence supporting their team predictions. I think this fantastic, and I hope that the tools on my site helped fuel these conversations in some small way.

But in gymnastics, as in all other fields, there is a downside to arguments based on data. We all share a strong tendency to treat numbers as objective. This is a bad tendency. Numbers mask the value judgements that lie beneath all — and I do mean all — data-driven arguments.

And when we start believing that our subjective, values-driven arguments are actually the cold, hard, truth, we cling to them more strongly and become less tolerant of good-faith debates about who should make a team.

So let’s take a minute to remind ourselves all the reasons why data-driven team selection is not an objective exercise. While I’m sure each and every one of you knows these reasons already, it’s worth reviewing them time and time again to strengthen the instinct for skepticism of data analysis.

First, scores come from judges. Judges are human. Humans are biased. Enough said.

Second, scores measure an infinitesimally small percentage of the total gymnastics that a gymnast does in any given year. A score can tell you whether a gymnast crashed her vault, but it can’t tell you whether she does that half the time in training or if this was a fluke caused by weird arena lighting. Of course, if I had to pick between information on how a gymnast does in a competition setting and information on how she does in a practice setting, I’d pick the competition information — but ideally I’d like both.

Third, gymnasts’ scores are intentionally varied throughout the season, making past performance an imperfect predictor of future performance. Most gymnasts intentionally wait until later in the year to roll out their upgrades and get in top shape, but they all do so do a different extent. We don’t have systematic information on any of this.

Finally, and most importantly, there countless ways to cut the data. Here a small fraction of the justifiable ways to pick the US team using data:

  • Take each gymnast’s mean score across all competitions this year, and pick five individuals who have the highest three-up, three-count score.

  • Do that same thing dropping the low score and the high score from the mean.

  • Do that same thing using medians instead of means.

  • Do that same thing using each gymnast’s best score all year instead of medians.

  • Do that same thing using only scores from FIG competitions.

  • Take each gymnast’s mean score across all competitions this year, and pick the team that scores the highest when you replace the top-scoring routine on each event with the 4th-best routine on each event.

  • Do that using any of the alternative metrics listed above.

  • Use execution scores to determine whether each routine was “hit” or not, and limit the team contenders to gymnasts with hit rates above a certain threshold.

  • Look at each gymnast’s best all-around score from this year, and pick the top five best all-rounders.

  • Compare each gymnast’s top score on each event to the top scores of other international gymnasts, and pick the team that is capable of making the most event finals as long as they have three routines of any quality to use during team finals.

  • Take a weighted average of each gymnast’s scores, giving more weight to more recent competitions. Then pick the best three-up, three-count team.

Et cetera.

Each of these methods is equally objective — which is to say, not very objective at all. They each reflect a different subjective perspective on what is valuable in a member of the US Worlds team.

So how should we be using scores to select teams? That’s up to you — it depends on your beliefs about how teams should be selected.

But I do have ideas about how we should talk about data-driven team selection. In particular, we spend a bit more time articulating the values beneath our analyses.

For example, “I think that even one fall in the past season constitutes evidence of inconsistency. As a result, I’ve chosen not to drop any scores when calculating each gymnast’s average score.”

Or maybe, “I believe in giving gymnasts the benefit of the doubt about their performances early in the season, well before they need to peak. As a result, I’ve chosen to only look at scores from US Nationals and Worlds Trials and I’ve dropped Classics and Pan Ams.”

Or how about, “I don’t think that US needs to worry about their team qualification performance, so I’m picking the team that scores the highest in a three-up, three-count team final situation even though that leaves them with a weak routine as the fourth-best qualifying.”

Or consider, “I don’t trust judges to score consistently across different competitions, and in particular I think it’s unfair to compare domestic scores to international scores. So I’m going to consider Classics and Championships, where all the contenders competed in front of the same panel, but not Pan Ams or Jesolo.”

These underlying assumptions are thing we can debate, and I bet those debates would be interesting and productive. Most importantly, they would be debates divorced from our personal feelings about individual gymnasts — feelings that enter into data-driven team predictions way more than they should.

While I'm thrilled that more people are engaging with data-driven gymnastics analysis, I worry that data will make us think that we're objectively right in our analysis. Quantitative results reflect our underlying values. Keep that in mind.

Tags: Score for Score Business