Statistics

Things are changing…

As we prepare for the next round of intensives, this material may change.

Book now for 3rd-5th Feb 2026

This session is aimed as an overview of how to perform some statistical modelling with Python. It is a Python workshop, not a statistics workshop - if you’d like to better understand the statistical models, or need help deciding what’s best for you, please consult a statistics resource or contact a statistician.

In this session, we’ll cover

Descriptive statistics
- Measures of central tendancy
- Measures of variability
- Measures of correlation
Inferential statistics
- Linear regressions
- Calculating confidence intervals
- T-tests
- \(\chi^2\) test
- ANOVAs

We’ll use three new modules: - numpy - scipy.stats - statsmodels

Setting up

To begin, create a new file for this session.

Be careful with the file name

We’ve encountered issues when calling this file statistics.py, so please choose a different name. Examples include,

stats.py
stats_workshop.py
statistics_workshop.py

Specifically, when you render Quarto files with seaborn plots, it gets confused and uses your file to compute statistics instead of its dependency.

We’ll be working from our “Players2024” dataset again. To bring it in and clean it up,

import pandas as pd

df = pd.read_csv("data/Players2024.csv")
df = df[df["positions"] != "Missing"]
df = df[df["height_cm"] > 100]

Descriptive Statistics

We’ll start with sample size. All dataframes have most descriptive statistics functions available right off the bat which we access via the . operator.

To calculate the number of non-empty observations in a column, say the numeric variable df["height_cm"], we use the .count() method

df["height_cm"].count()

np.int64(5932)

Measures of central tendancy

We can compute measures of central tendancy similarly. The average value is given by

df["height_cm"].mean()

np.float64(183.04130141604855)

the median by

df["height_cm"].median()

np.float64(183.0)

and the mode by

df["height_cm"].mode()

0    185.0
Name: height_cm, dtype: float64

.mode() returns a dataframe with the most frequent values as there can be multiple.

Measures of variance

We can also compute measures of variance. The minimum and maximum are as expected

df["height_cm"].min()
df["height_cm"].max()

np.float64(206.0)

The range is the difference

df["height_cm"].min() - df["height_cm"].max()

np.float64(-46.0)

Quantiles are given by .quantile(...) with the fraction inside. The inter-quartile range (IQR) is the difference between 25% and 75%.

q1 = df["height_cm"].quantile(0.25)
q3 = df["height_cm"].quantile(0.75)
IQR = q3 - q1

A column’s standard deviation and variance are given by

df["height_cm"].std()
df["height_cm"].var()

np.float64(46.7683158241558)

And the standard error of the mean (SEM) with

df["height_cm"].sem()

np.float64(0.08879229764682213)

You can calculate the skewness and kurtosis (variation of tails) of a sample with

df["height_cm"].skew()
df["height_cm"].kurt()

np.float64(-0.4338044567190438)

All together, you can see a nice statistical summary with

df["height_cm"].describe()

count    5932.000000
mean      183.041301
std         6.838736
min       160.000000
25%       178.000000
50%       183.000000
75%       188.000000
max       206.000000
Name: height_cm, dtype: float64

Measures of correlation

If you’ve got two numeric variables, you might want to examine covariance and correlation. These indicate how strongly the variables are linearly related. We’ll need to use the df["age"] variable as well.

The covariance between “height_cm” and “age” is

df["height_cm"].cov(df["age"])

np.float64(0.5126608276592355)

The .cov() function compares the column it’s attached to (here df["height_cm"]) with the column you input (here df["age"]). This means we could swap the columns without issue:

{python} df["age"].cov(df["height_cm"])

Similarly, we can find the Pearson correlation coefficient between two columns.

df["height_cm"].corr(df["age"])

np.float64(0.01682597901197298)

You can also specify “kendall” or “spearman” for their respective correlation coefficients

df["height_cm"].corr(df["age"], method = "kendall")
df["height_cm"].corr(df["age"], method = "spearman")

np.float64(0.007604345289158663)

Reminder about groupbys

Before we move to inferential statistics, it’s worth reiterating the power of groupbys discussed in the second workshop.

To group by a specific variable, like “positions”, we use

gb = df.groupby("positions")

By applying our statistics to the gb object, we’ll apply them to every variable for each position. Note that we should specify numeric_only = True, because these statistics won’t work for non-numeric variables

gb.mean(numeric_only = True)

	height_cm	age
positions
Attack	180.802673	25.061108
Defender	184.193269	25.716471
Goalkeeper	190.668508	26.587017
Midfield	180.497017	25.201671

Inferential Statistics

Inferential statistics requires using the module scipy.stats, which we’ll bring in with

import scipy.stats as stats

Simple linear regressions

Least-squares regression for two sets of measurements can be performed with the function stats.linregress()”

stats.linregress(x = df["age"], y = df["height_cm"])

LinregressResult(slope=np.float64(0.025827494764561896), intercept=np.float64(182.38260451315895), rvalue=np.float64(0.01682597901197298), pvalue=np.float64(0.19506275453364344), stderr=np.float64(0.019930266529602007), intercept_stderr=np.float64(0.5159919571772644))

If we store this as a variable, we can access the different values with the . operator. For example, the p-value is

lm = stats.linregress(x = df["age"], y = df["height_cm"])
lm.pvalue

np.float64(0.19506275453364344)

Plotting it

Naturally, you’d want to plot this. We’ll need to use the overlaying techniques from the visualisation session. Let’s import seaborn and matplotlib

import seaborn as sns
import matplotlib.pyplot as plt

Start by making a scatterplot of the data,

sns.relplot(data = df, x = "age", y = "height_cm")

Then, you’ll need to plot the regression as a line. For reference,

\[ y = \text{slope}\times x + \text{intercept}\]

sns.relplot(data = df, x = "age", y = "height_cm")

x_lm = df["age"]
y_lm = lm.slope*x_lm + lm.intercept
sns.lineplot(x = x_lm, y = y_lm, color = "r")

\(t\)-tests

We can also perform \(t\)-tests with the scipy.stats module. Typically, this is performed to examine the statistical signficance of a difference between two samples’ means. Let’s examine whether that earlier groupby result for is accurate for heights, specifically, are goalkeepers taller than non-goalkeepers?

Let’s start by separating the goalkeepers from the non-goalkeepers in two variables

goalkeepers = df[df["positions"] == "Goalkeeper"]
non_goalkeepers = df[df["positions"] != "Goalkeeper"]

The \(t\)-test for the means of two independent samples is given by

stats.ttest_ind(goalkeepers["height_cm"], non_goalkeepers["height_cm"])

TtestResult(statistic=np.float64(35.2144964816995), pvalue=np.float64(7.551647917141636e-247), df=np.float64(5930.0))

Yielding a p-value of \(8\times 10^{-247}\approx 0\), indicating that the null-hypothesis (heights are the same) is extremely unlikely.

ANOVAs

What about the means of the other three? We could use an ANOVA to examine them. We use the stats.f_oneway() function for this. However, this requires us to send a list of samples in for each group, so we should separate the three positions.

defender = df[df["positions"] == "Defender"].height_cm
midfield = df[df["positions"] == "Midfield"].height_cm
attack = df[df["positions"] == "Attack"].height_cm

We can then perform the ANOVA on this list of samples

stats.f_oneway(defender, midfield, attack)

F_onewayResult(statistic=np.float64(199.74794987909772), pvalue=np.float64(2.6216423274516227e-84))

With \(p = 3\times10^{-84}\), it looks like their positions are not all independent of height.

\(\chi^2\) tests

\(χ^2\) tests are useful for examining the relationship of categorical variables by comparing the frequencies of each. Often, you’d use this if you can make a contingency table.

We only have one useful categorical variable here, “positions” (the others have too many unique values), so we’ll need to create another. Let’s see if there’s a relationship between players’ positions and names with the letter “a”.

Make a binary column for players with the letter “a” in their names. To do this, we need to apply a string method to all the columns in the dataframe as follows

df["a_in_name"] = df["name"].str.contains("a")

Let’s cross tabulate positions with this new column

a_vs_pos = pd.crosstab(df["positions"],df["a_in_name"])
print(a_vs_pos)

a_in_name   False  True 
positions               
Attack        291   1280
Defender      355   1606
Goalkeeper    149    575
Midfield      312   1364

The \(χ^2\) test’s job is to examine whether players’ positions depend on the presence of “a” in their name. To evaluate it we need to send the contingency table in:

stats.chi2_contingency(a_vs_pos)

Chi2ContingencyResult(statistic=np.float64(2.1808405074930404), pvalue=np.float64(0.5357320466340114), dof=3, expected_freq=array([[ 293.17211733, 1277.82788267],
       [ 365.9519555 , 1595.0480445 ],
       [ 135.10923803,  588.89076197],
       [ 312.76668914, 1363.23331086]]))

More complex modelling

If you need to do more advanced statistics, particularly if you need more regressions, you’ll likely need to turn to a different package: statsmodels. It is particularly useful for statistical modelling.

We’ll go through three examples

Simple linear regressions (like before)
Multiple linear regressions
Logistic regressions

What’s nice about statsmodels is that it gives an R-like interface and summaries.

To start with, let’s import the tools. We’ll use the formula interface, which offers us an R-like way of creating models.

import statsmodels.formula.api as smf

Simple linear regressions revisited

Let’s perform the same linear regression as before, looking at the “age” and “height variables”. Our thinking is that players’ heights dictate how long they can play, so we’ll make \(x = \text{height\_cm}\) and \(y = \text{age}\).

The first step is to make the set up the variables. We’ll use the function smf.ols() for ordinary least squares. It takes in two imputs:

The formula string, in the form y ~ X1 + X2 ...
The data

We create the model and compute the fit

mod = smf.ols("age ~ height_cm", df)
res = mod.fit()

Done! Let’s take a look at the results

res.summary()

OLS Regression Results
Dep. Variable:	age	R-squared:	0.000
Model:	OLS	Adj. R-squared:	0.000
Method:	Least Squares	F-statistic:	1.679
Date:	Mon, 17 Nov 2025	Prob (F-statistic):	0.195
Time:	14:34:32	Log-Likelihood:	-17279.
No. Observations:	5932	AIC:	3.456e+04
Df Residuals:	5930	BIC:	3.457e+04
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	23.4973	1.549	15.165	0.000	20.460	26.535
height_cm	0.0110	0.008	1.296	0.195	-0.006	0.028

Omnibus:	204.256	Durbin-Watson:	0.269
Prob(Omnibus):	0.000	Jarque-Bera (JB):	206.613
Skew:	0.427	Prob(JB):	1.36e-45
Kurtosis:	2.671	Cond. No.	4.91e+03

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.91e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

That’s a lot nicer than with scipy. We can also make a plot by getting the model’s \(y\) values with res.fittedvalues

sns.relplot(data = df, x = "height_cm", y = "age")
sns.lineplot(x = df["height_cm"], y = res.fittedvalues, color = "black")

Generalised linear models

The statsmodels module has lots of advanced statistical models available. We’ll take a look at one more: Generalised Linear Models. The distributions they include are

Binomial
Poisson
Negative Binomial
Gaussian (Normal)
Gamma
Inverse Gaussian
Tweedie

We’ll use the binomial option to create logistic regressions.

Logistic regressions examine the distribution of binary data. For us, we can compare the heights of goalkeepers vs non-goalkeepers again. Let’s make a new column which is 1 for goalkeepers and 0 for non-goalkeepers:

df["gk"] = (df["positions"] == "Goalkeeper")*1

Multiplying by 1 turns True \(\rightarrow\) 1 and False \(\rightarrow\) 0

Now, we can model this column with height. Specifically,

\[ \text{gk} \sim \text{height\_cm}\]

Start by making the model with the function smf.glm(). We need to specify the family of distributions; they all live in sm.families, which comes from a different submodule that we should import:

import statsmodels.api as sm
mod = smf.glm("gk ~ height_cm", data = df, family = sm.families.Binomial())

Next, evaluate the results

res = mod.fit()

Let’s have a look at the summary:

res.summary()

Generalized Linear Model Regression Results
Dep. Variable:	gk	No. Observations:	5932
Model:	GLM	Df Residuals:	5930
Model Family:	Binomial	Df Model:	1
Link Function:	Logit	Scale:	1.0000
Method:	IRLS	Log-Likelihood:	-1583.5
Date:	Mon, 17 Nov 2025	Deviance:	3167.0
Time:	14:34:36	Pearson chi2:	4.02e+03
No. Iterations:	7	Pseudo R-squ. (CS):	0.1879
Covariance Type:	nonrobust

	coef	std err	z	P>\|z\|	[0.025	0.975]
Intercept	-53.2336	1.927	-27.622	0.000	-57.011	-49.456
height_cm	0.2745	0.010	26.938	0.000	0.255	0.294

Finally, we can plot the result like before

sns.relplot(data = df, x = "height_cm", y = "gk")
sns.lineplot(x = df["height_cm"], y = res.fittedvalues, color = "black")