Goalkeepers and their heights

Author

Cameron West and Stéphane Guillou

Published

June 23, 2025

Set up code
import matplotlib.pyplot as plt
import pandas as pd
import plotly.express as px
import seaborn as sns

df_raw = pd.read_csv("../../data_sources/Players2024.csv")

A glimpse at the dataset

We’ll begin just by taking a glimpse at the dataset:

Show code
df_raw.head(10)
name birth_date height_cm positions nationality age club
0 James Milner 1986-01-04 175.0 Midfield England 38 Brighton and Hove Albion Football Club
1 Anastasios Tsokanis 1991-05-02 176.0 Midfield Greece 33 Volou Neos Podosferikos Syllogos
2 Jonas Hofmann 1992-07-14 176.0 Midfield Germany 32 Bayer 04 Leverkusen Fußball
3 Pepe Reina 1982-08-31 188.0 Goalkeeper Spain 42 Calcio Como
4 Lionel Carole 1991-04-12 180.0 Defender France 33 Kayserispor Kulübü
5 Ludovic Butelle 1983-04-03 188.0 Goalkeeper France 41 Stade de Reims
6 Daley Blind 1990-03-09 180.0 Defender Netherlands 34 Girona Fútbol Club S. A. D.
7 Craig Gordon 1982-12-31 193.0 Goalkeeper Scotland 41 Heart of Midlothian Football Club
8 Dimitrios Sotiriou 1987-09-13 185.0 Goalkeeper Greece 37 Omilos Filathlon Irakliou FC
9 Alessio Cragno 1994-06-28 184.0 Goalkeeper Italy 30 Associazione Calcio Monza

Cleaning the data

The data had a few issues, as the following plot shows:

Show code
sns.catplot(df_raw, x = "positions", y = "height_cm")

It looks like some of the players’ positions and heights were recorded incorrectly. To clean, let’s remove the “Missing” positions and ensure that heights are reasonable:

Show code
df = df_raw.copy()

# Remove missing position
df = df[df["positions"] != "Missing"]

# Ensure reasonable heights
df = df[df["height_cm"] > 100]

To confirm, let’s plot the outliers in a different colour

Show code
# Identify outliers
outliers = pd.concat([df_raw,df]).drop_duplicates(keep = False)

sns.catplot(df, x = "positions", y = "height_cm")
sns.stripplot(outliers, x = "positions", y = "height_cm", color = "r")

Visualising the players’ heights

After cleaning the data we can now analyse the players’ heights to see if there’s differences between positions. A box plot can show the distribution of heights:

Show code
sns.catplot(data = df, x = "positions", y = "height_cm", kind = "box", order = ["Goalkeeper", "Defender", "Midfield", "Attack"])
plt.xlabel("Position")
plt.ylabel("Height (cm)")
plt.savefig("tb.png")
plt.show()

A scatterplot of the relationsip between height and position.

It looks like goalkeepers are taller than the rest!

Let’s through the age variable into the mix, to see if players’ heights allow them to compete longer.

Show code
px.scatter(data_frame = df, x = "age", y = "height_cm", color = "positions",
           facet_col = "positions", facet_col_wrap = 2, hover_name = "name",
           hover_data = "nationality", labels = {"height_cm": "Height (cm)",
                                                 "positions": "Position"})

It doesn’t look like there’s a relationship between heights and ages, but clearly it affects their position!

Global spread

We haven’t looked at the nationality column yet. Let’s draw up a map using plotly to see where the players come from.

Show code
# Change country names to match plotly reference
df["nationality"] = df["nationality"].replace(["England", "Türkiye", "Cote d'Ivoire", 
                                               "Northern Ireland", "Wales"], 
                                               ["United Kingdom", "Turkey", "Ivory Coast",
                                                "United Kingdom", "United Kingdom"])

# Make the count
countries = df.value_counts("nationality")

# Make the plot
px.choropleth(locations = countries.index, locationmode = "country names", color = countries,
              labels = {"locations": "Country", "color": "# of players"})

Looks like most players are from Europe. Pan and zoom to see the finer details.