Goalkeepers and their heights

Author

Cameron West and Stéphane Guillou

Published

June 23, 2025

Set up code

library(ggplot2)
library(dplyr)
library(plotly)
library(knitr)

players_raw <- read.csv("../../data_sources/Players2024.csv")

A glimpse at the dataset

We’ll begin just by taking a glimpse at the dataset:

Show code

kable(head(players_raw))

name	birth_date	height_cm	positions	nationality	age	club
James Milner	1986-01-04	175	Midfield	England	38	Brighton and Hove Albion Football Club
Anastasios Tsokanis	1991-05-02	176	Midfield	Greece	33	Volou Neos Podosferikos Syllogos
Jonas Hofmann	1992-07-14	176	Midfield	Germany	32	Bayer 04 Leverkusen Fußball
Pepe Reina	1982-08-31	188	Goalkeeper	Spain	42	Calcio Como
Lionel Carole	1991-04-12	180	Defender	France	33	Kayserispor Kulübü
Ludovic Butelle	1983-04-03	188	Goalkeeper	France	41	Stade de Reims

Cleaning the data

The data has a few issues, as the following plot shows:

Show code

ggplot(players_raw, aes(x = positions, y = height_cm)) +
  geom_boxplot()

Show code

#sns.catplot(df_raw, x = "positions", y = "height_cm")

It looks like some of the players’ positions and heights were recorded incorrectly. To clean, let’s remove the “Missing” positions and ensure that heights are reasonable:

Show code

# Remove missing position and ensure reasonable heights
players <- players_raw %>% filter(positions != "Missing", height_cm > 100)

To confirm, let’s plot the outliers in a different colour

Show code

# Identify outliers
outliers <- anti_join(players_raw, players)

# Plot
ggplot(players, aes(x = positions, y = height_cm)) +
  geom_boxplot() + 
  geom_point(data = outliers, colour = "red")

Visualising the players’ heights

After cleaning the data we can now analyse the players’ heights to see if there’s differences between positions. Let’s make the boxplot without the outliers

Show code

ggplot(players, aes(x = positions, y = height_cm)) +
  geom_boxplot() +
  labs(x = "Position", y = "Height (cm)")

Show code

ggsave("tb.png")

It looks like goalkeepers are taller than the rest!

Let’s through the age variable into the mix, to see if players’ heights allow them to compete longer.

Show code

p <- ggplot(players, aes(x = age, y = height_cm, colour = positions, label = name, label2 = nationality)) + 
  geom_point() + 
  facet_wrap(vars(positions)) + 
  labs(x = "Age", colour = "Position", y = "Height (cm)")

ggplotly(p)

It doesn’t look like there’s a relationship between heights and ages, but clearly it affects their position!

Global spread

We haven’t looked at the nationality column yet. Let’s draw up a map using plotly to see where the players come from.

Show code

# Change country names to match plotly reference
players <- players %>% 
  mutate(nationality = case_match(nationality,
                                  "England" ~ "United Kingdom",
                                  "Türkiye" ~ "Turkey",
                                  "Cote d'Ivoire" ~ "Ivory Coast",
                                  "Northern Ireland" ~ "United Kingdom",
                                  "Wales" ~ "United Kingdom",
                                  .default = nationality))
    
# Make the country count
countries <- players %>%
  group_by(nationality) %>%
  summarise(n = n())

# Make the plot
countries %>% 
  plot_ly(type = "choropleth", 
          locations = countries$nationality, 
          locationmode = "country names", 
          z = countries$n) %>%
  colorbar(title = "# of Players")