Statistics

This session is aimed as an overview of how to perform some statistical modelling with Python. It is a Python workshop, not a statistics workshop - if you’d like to better understand the statistical models, or need help deciding what’s best for you, please consult a statistics resource or contact a statistician.

In this session, we’ll cover

We’ll use three new modules: - numpy - scipy.stats - statsmodels

We’ll be working from our “Players2024” dataset again. To bring it in and clean it up,

import pandas as pd

df = pd.read_csv("data/Players2024.csv")
df = df[df["positions"] != "Missing"]
df = df[df["height_cm"] > 100]

Descriptive Statistics

We’ll start with sample size. All dataframes have most descriptive statistics functions available right off the bat which we access via the . operator.

To calculate the number of non-empty observations in a column, say the numeric variable df["height_cm"], we use the .count() method

df["height_cm"].count()
np.int64(5932)

Measures of central tendancy

We can compute measures of central tendancy similarly. The average value is given by

df["height_cm"].mean()
np.float64(183.04130141604855)

the median by

df["height_cm"].median()
np.float64(183.0)

and the mode by

df["height_cm"].mode()
0    185.0
Name: height_cm, dtype: float64

.mode() returns a dataframe with the most frequent values as there can be multiple.

Measures of variance

We can also compute measures of variance. The minimum and maximum are as expected

df["height_cm"].min()
df["height_cm"].max()
np.float64(206.0)

The range is the difference

df["height_cm"].min() - df["height_cm"].max()
np.float64(-46.0)

Quantiles are given by .quantile(...) with the fraction inside. The inter-quartile range (IQR) is the difference between 25% and 75%.

q1 = df["height_cm"].quantile(0.25)
q3 = df["height_cm"].quantile(0.75)
IQR = q3 - q1

A column’s standard deviation and variance are given by

df["height_cm"].std()
df["height_cm"].var()
np.float64(46.7683158241558)

And the standard error of the mean (SEM) with

df["height_cm"].sem()
np.float64(0.08879229764682213)

You can calculate the skewness and kurtosis (variation of tails) of a sample with

df["height_cm"].skew()
df["height_cm"].kurt()
np.float64(-0.4338044567190438)

All together, you can see a nice statistical summary with

df["height_cm"].describe()
count    5932.000000
mean      183.041301
std         6.838736
min       160.000000
25%       178.000000
50%       183.000000
75%       188.000000
max       206.000000
Name: height_cm, dtype: float64

Measures of correlation

If you’ve got two numeric variables, you might want to examine covariance and correlation. These indicate how strongly the variables are linearly related. We’ll need to use the df["age"] variable as well.

The covariance between “height_cm” and “age” is

df["height_cm"].cov(df["age"])
np.float64(0.5126608276592355)

The .cov() function compares the column it’s attached to (here df["height_cm"]) with the column you input (here df["age"]). This means we could swap the columns without issue:

{python} df["age"].cov(df["height_cm"])

Similarly, we can find the Pearson correlation coefficient between two columns.

df["height_cm"].corr(df["age"])
np.float64(0.01682597901197298)

You can also specify “kendall” or “spearman” for their respective correlation coefficients

df["height_cm"].corr(df["age"], method = "kendall")
df["height_cm"].corr(df["age"], method = "spearman")
np.float64(0.007604345289158663)

Reminder about groupbys

Before we move to inferential statistics, it’s worth reiterating the power of groupbys discussed in the second workshop.

To group by a specific variable, like “positions”, we use

gb = df.groupby("positions")

By applying our statistics to the gb object, we’ll apply them to every variable for each position. Note that we should specify numeric_only = True, because these statistics won’t work for non-numeric variables

gb.mean(numeric_only = True)
height_cm age
positions
Attack 180.802673 25.061108
Defender 184.193269 25.716471
Goalkeeper 190.668508 26.587017
Midfield 180.497017 25.201671

Inferential Statistics

Inferential statistics requires using the module scipy.stats, which we’ll bring in with

import scipy.stats as stats

Simple linear regressions

Least-squares regression for two sets of measurements can be performed with the function stats.linregress()

stats.linregress(x = df["age"], y = df["height_cm"])
LinregressResult(slope=np.float64(0.025827494764561896), intercept=np.float64(182.38260451315895), rvalue=np.float64(0.01682597901197298), pvalue=np.float64(0.19506275453364344), stderr=np.float64(0.019930266529602007), intercept_stderr=np.float64(0.5159919571772644))

If we store this as a variable, we can access the different values with the . operator. For example, the p-value is

lm = stats.linregress(x = df["age"], y = df["height_cm"])
lm.pvalue
np.float64(0.19506275453364344)

Plotting it

Naturally, you’d want to plot this. We’ll need to use the overlaying techniques from the visualisation session. Let’s import seaborn and matplotlib

import seaborn as sns
import matplotlib.pyplot as plt

Start by making a scatterplot of the data,

sns.relplot(data = df, x = "age", y = "height_cm")

Then, you’ll need to plot the regression as a line. For reference,

y=slope×x+intercept

So

sns.relplot(data = df, x = "age", y = "height_cm")

x_lm = df["age"]
y_lm = lm.slope*x_lm + lm.intercept
sns.lineplot(x = x_lm, y = y_lm, color = "r")

t-tests

We can also perform t-tests with the scipy.stats module. Typically, this is performed to examine the statistical signficance of a difference between two samples’ means. Let’s examine whether that earlier groupby result for is accurate for heights, specifically, are goalkeepers taller than non-goalkeepers?

Let’s start by separating the goalkeepers from the non-goalkeepers in two variables

goalkeepers = df[df["positions"] == "Goalkeeper"]
non_goalkeepers = df[df["positions"] != "Goalkeeper"]

The t-test for the means of two independent samples is given by

stats.ttest_ind(goalkeepers["height_cm"], non_goalkeepers["height_cm"])
TtestResult(statistic=np.float64(35.2144964816995), pvalue=np.float64(7.551647917141636e-247), df=np.float64(5930.0))

Yielding a p-value of 8×102470, indicating that the null-hypothesis (heights are the same) is extremely unlikely.

ANOVAs

What about the means of the other three? We could use an ANOVA to examine them. We use the stats.f_oneway() function for this. However, this requires us to send a list of samples in for each group, so we should separate the three positions.

defender = df[df["positions"] == "Defender"].height_cm
midfield = df[df["positions"] == "Midfield"].height_cm
attack = df[df["positions"] == "Attack"].height_cm

We can then perform the ANOVA on this list of samples

stats.f_oneway(defender, midfield, attack)
F_onewayResult(statistic=np.float64(199.74794987909772), pvalue=np.float64(2.6216423274516227e-84))

With p=3×1084, it looks like their positions are not all independent of height.

χ2 tests

χ2 tests are useful for examining the relationship of categorical variables by comparing the frequencies of each. Often, you’d use this if you can make a contingency table.

We only have one useful categorical variable here, “positions” (the others have too many unique values), so we’ll need to create another. Let’s see if there’s a relationship between players’ positions and names with the letter “a”.

Make a binary column for players with the letter “a” in their names. To do this, we need to apply a string method to all the columns in the dataframe as follows

df["a_in_name"] = df["name"].str.contains("a")

Let’s cross tabulate positions with this new column

a_vs_pos = pd.crosstab(df["positions"],df["a_in_name"])
print(a_vs_pos)
a_in_name   False  True 
positions               
Attack        291   1280
Defender      355   1606
Goalkeeper    149    575
Midfield      312   1364

The χ2 test’s job is to examine whether players’ positions depend on the presence of “a” in their name. To evaluate it we need to send the contingency table in:

stats.chi2_contingency(a_vs_pos)
Chi2ContingencyResult(statistic=np.float64(2.1808405074930404), pvalue=np.float64(0.5357320466340114), dof=3, expected_freq=array([[ 293.17211733, 1277.82788267],
       [ 365.9519555 , 1595.0480445 ],
       [ 135.10923803,  588.89076197],
       [ 312.76668914, 1363.23331086]]))

More complex modelling

If you need to do more advanced statistics, particularly if you need more regressions, you’ll likely need to turn to a different package: statsmodels. It is particularly useful for statistical modelling.

We’ll go through three examples

  1. Simple linear regressions (like before)
  2. Multiple linear regressions
  3. Logistic regressions

What’s nice about statsmodels is that it gives an R-like interface and summaries.

To start with, let’s import the tools. We’ll use the formula interface, which offers us an R-like way of creating models.

import statsmodels.formula.api as smf

Simple linear regressions revisited

Let’s perform the same linear regression as before, looking at the “age” and “height variables”. Our thinking is that players’ heights dictate how long they can play, so we’ll make x=height_cm and y=age.

The first step is to make the set up the variables. We’ll use the function smf.ols() for ordinary least squares. It takes in two imputs:

  • The formula string, in the form y ~ X1 + X2 ...
  • The data

We create the model and compute the fit

mod = smf.ols("age ~ height_cm", df)
res = mod.fit()

Done! Let’s take a look at the results

res.summary()
OLS Regression Results
Dep. Variable: age R-squared: 0.000
Model: OLS Adj. R-squared: 0.000
Method: Least Squares F-statistic: 1.679
Date: Fri, 16 May 2025 Prob (F-statistic): 0.195
Time: 05:12:16 Log-Likelihood: -17279.
No. Observations: 5932 AIC: 3.456e+04
Df Residuals: 5930 BIC: 3.457e+04
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 23.4973 1.549 15.165 0.000 20.460 26.535
height_cm 0.0110 0.008 1.296 0.195 -0.006 0.028
Omnibus: 204.256 Durbin-Watson: 0.269
Prob(Omnibus): 0.000 Jarque-Bera (JB): 206.613
Skew: 0.427 Prob(JB): 1.36e-45
Kurtosis: 2.671 Cond. No. 4.91e+03


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.91e+03. This might indicate that there are
strong multicollinearity or other numerical problems.


That’s a lot nicer than with scipy. We can also make a plot by getting the model’s y values with res.fittedvalues

sns.relplot(data = df, x = "height_cm", y = "age")
sns.lineplot(x = df["height_cm"], y = res.fittedvalues, color = "black")

Generalised linear models

The statsmodels module has lots of advanced statistical models available. We’ll take a look at one more: Generalised Linear Models. The distributions they include are

  • Binomial
  • Poisson
  • Negative Binomial
  • Gaussian (Normal)
  • Gamma
  • Inverse Gaussian
  • Tweedie

We’ll use the binomial option to create logistic regressions.

Logistic regressions examine the distribution of binary data. For us, we can compare the heights of goalkeepers vs non-goalkeepers again. Let’s make a new column which is 1 for goalkeepers and 0 for non-goalkeepers:

df["gk"] = (df["positions"] == "Goalkeeper")*1

Multiplying by 1 turns True 1 and False 0

Now, we can model this column with height. Specifically,

gkheight_cm

Start by making the model with the function smf.glm(). We need to specify the family of distributions; they all live in sm.families, which comes from a different submodule that we should import:

import statsmodels.api as sm
mod = smf.glm("gk ~ height_cm", data = df, family = sm.families.Binomial())

Next, evaluate the results

res = mod.fit()

Let’s have a look at the summary:

res.summary()
Generalized Linear Model Regression Results
Dep. Variable: gk No. Observations: 5932
Model: GLM Df Residuals: 5930
Model Family: Binomial Df Model: 1
Link Function: Logit Scale: 1.0000
Method: IRLS Log-Likelihood: -1583.5
Date: Fri, 16 May 2025 Deviance: 3167.0
Time: 05:12:17 Pearson chi2: 4.02e+03
No. Iterations: 7 Pseudo R-squ. (CS): 0.1879
Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975]
Intercept -53.2336 1.927 -27.622 0.000 -57.011 -49.456
height_cm 0.2745 0.010 26.938 0.000 0.255 0.294

Finally, we can plot the result like before

sns.relplot(data = df, x = "height_cm", y = "gk")
sns.lineplot(x = df["height_cm"], y = res.fittedvalues, color = "black")