import pandas as pd
= pd.read_csv("data/Players2024.csv")
df = df[df["positions"] != "Missing"]
df = df[df["height_cm"] > 100] df
Statistics
This session is aimed as an overview of how to perform some statistical modelling with Python. It is a Python workshop, not a statistics workshop - if you’d like to better understand the statistical models, or need help deciding what’s best for you, please consult a statistics resource or contact a statistician.
In this session, we’ll cover
- Descriptive statistics
- Measures of central tendancy
- Measures of variability
- Measures of correlation
- Inferential statistics
- Linear regressions
- Calculating confidence intervals
- T-tests
test- ANOVAs
We’ll use three new modules: - numpy
- scipy.stats
- statsmodels
We’ll be working from our “Players2024” dataset again. To bring it in and clean it up,
Descriptive Statistics
We’ll start with sample size. All dataframes have most descriptive statistics functions available right off the bat which we access via the .
operator.
To calculate the number of non-empty observations in a column, say the numeric variable df["height_cm"]
, we use the .count()
method
"height_cm"].count() df[
np.int64(5932)
Measures of central tendancy
We can compute measures of central tendancy similarly. The average value is given by
"height_cm"].mean() df[
np.float64(183.04130141604855)
the median by
"height_cm"].median() df[
np.float64(183.0)
and the mode by
"height_cm"].mode() df[
0 185.0
Name: height_cm, dtype: float64
.mode()
returns a dataframe with the most frequent values as there can be multiple.
Measures of variance
We can also compute measures of variance. The minimum and maximum are as expected
"height_cm"].min()
df["height_cm"].max() df[
np.float64(206.0)
The range is the difference
"height_cm"].min() - df["height_cm"].max() df[
np.float64(-46.0)
Quantiles are given by .quantile(...)
with the fraction inside. The inter-quartile range (IQR) is the difference between 25% and 75%.
= df["height_cm"].quantile(0.25)
q1 = df["height_cm"].quantile(0.75)
q3 = q3 - q1 IQR
A column’s standard deviation and variance are given by
"height_cm"].std()
df["height_cm"].var() df[
np.float64(46.7683158241558)
And the standard error of the mean (SEM) with
"height_cm"].sem() df[
np.float64(0.08879229764682213)
You can calculate the skewness and kurtosis (variation of tails) of a sample with
"height_cm"].skew()
df["height_cm"].kurt() df[
np.float64(-0.4338044567190438)
All together, you can see a nice statistical summary with
"height_cm"].describe() df[
count 5932.000000
mean 183.041301
std 6.838736
min 160.000000
25% 178.000000
50% 183.000000
75% 188.000000
max 206.000000
Name: height_cm, dtype: float64
Measures of correlation
If you’ve got two numeric variables, you might want to examine covariance and correlation. These indicate how strongly the variables are linearly related. We’ll need to use the df["age"]
variable as well.
The covariance between “height_cm” and “age” is
"height_cm"].cov(df["age"]) df[
np.float64(0.5126608276592355)
The
.cov()
function compares the column it’s attached to (heredf["height_cm"]
) with the column you input (heredf["age"]
). This means we could swap the columns without issue:
{python} df["age"].cov(df["height_cm"])
Similarly, we can find the Pearson correlation coefficient between two columns.
"height_cm"].corr(df["age"]) df[
np.float64(0.01682597901197298)
You can also specify “kendall” or “spearman” for their respective correlation coefficients
"height_cm"].corr(df["age"], method = "kendall")
df["height_cm"].corr(df["age"], method = "spearman") df[
np.float64(0.007604345289158663)
Reminder about groupbys
Before we move to inferential statistics, it’s worth reiterating the power of groupbys discussed in the second workshop.
To group by a specific variable, like “positions”, we use
= df.groupby("positions") gb
By applying our statistics to the gb
object, we’ll apply them to every variable for each position. Note that we should specify numeric_only = True
, because these statistics won’t work for non-numeric variables
= True) gb.mean(numeric_only
height_cm | age | |
---|---|---|
positions | ||
Attack | 180.802673 | 25.061108 |
Defender | 184.193269 | 25.716471 |
Goalkeeper | 190.668508 | 26.587017 |
Midfield | 180.497017 | 25.201671 |
Inferential Statistics
Inferential statistics requires using the module scipy.stats
, which we’ll bring in with
import scipy.stats as stats
Simple linear regressions
Least-squares regression for two sets of measurements can be performed with the function stats.linregress()
”
= df["age"], y = df["height_cm"]) stats.linregress(x
LinregressResult(slope=np.float64(0.025827494764561896), intercept=np.float64(182.38260451315895), rvalue=np.float64(0.01682597901197298), pvalue=np.float64(0.19506275453364344), stderr=np.float64(0.019930266529602007), intercept_stderr=np.float64(0.5159919571772644))
If we store this as a variable, we can access the different values with the .
operator. For example, the p-value is
= stats.linregress(x = df["age"], y = df["height_cm"])
lm lm.pvalue
np.float64(0.19506275453364344)
Plotting it
Naturally, you’d want to plot this. We’ll need to use the overlaying techniques from the visualisation session. Let’s import seaborn and matplotlib
import seaborn as sns
import matplotlib.pyplot as plt
Start by making a scatterplot of the data,
= df, x = "age", y = "height_cm") sns.relplot(data
Then, you’ll need to plot the regression as a line. For reference,
So
= df, x = "age", y = "height_cm")
sns.relplot(data
= df["age"]
x_lm = lm.slope*x_lm + lm.intercept
y_lm = x_lm, y = y_lm, color = "r") sns.lineplot(x
-tests
We can also perform scipy.stats
module. Typically, this is performed to examine the statistical signficance of a difference between two samples’ means. Let’s examine whether that earlier groupby result for is accurate for heights, specifically, are goalkeepers taller than non-goalkeepers?
Let’s start by separating the goalkeepers from the non-goalkeepers in two variables
= df[df["positions"] == "Goalkeeper"]
goalkeepers = df[df["positions"] != "Goalkeeper"] non_goalkeepers
The
"height_cm"], non_goalkeepers["height_cm"]) stats.ttest_ind(goalkeepers[
TtestResult(statistic=np.float64(35.2144964816995), pvalue=np.float64(7.551647917141636e-247), df=np.float64(5930.0))
Yielding a p-value of
ANOVAs
What about the means of the other three? We could use an ANOVA to examine them. We use the stats.f_oneway()
function for this. However, this requires us to send a list of samples in for each group, so we should separate the three positions.
= df[df["positions"] == "Defender"].height_cm
defender = df[df["positions"] == "Midfield"].height_cm
midfield = df[df["positions"] == "Attack"].height_cm attack
We can then perform the ANOVA on this list of samples
stats.f_oneway(defender, midfield, attack)
F_onewayResult(statistic=np.float64(199.74794987909772), pvalue=np.float64(2.6216423274516227e-84))
With
tests
We only have one useful categorical variable here, “positions” (the others have too many unique values), so we’ll need to create another. Let’s see if there’s a relationship between players’ positions and names with the letter “a”.
Make a binary column for players with the letter “a” in their names. To do this, we need to apply a string method to all the columns in the dataframe as follows
"a_in_name"] = df["name"].str.contains("a") df[
Let’s cross tabulate positions with this new column
= pd.crosstab(df["positions"],df["a_in_name"])
a_vs_pos print(a_vs_pos)
a_in_name False True
positions
Attack 291 1280
Defender 355 1606
Goalkeeper 149 575
Midfield 312 1364
The
stats.chi2_contingency(a_vs_pos)
Chi2ContingencyResult(statistic=np.float64(2.1808405074930404), pvalue=np.float64(0.5357320466340114), dof=3, expected_freq=array([[ 293.17211733, 1277.82788267],
[ 365.9519555 , 1595.0480445 ],
[ 135.10923803, 588.89076197],
[ 312.76668914, 1363.23331086]]))
More complex modelling
If you need to do more advanced statistics, particularly if you need more regressions, you’ll likely need to turn to a different package: statsmodels
. It is particularly useful for statistical modelling.
We’ll go through three examples
- Simple linear regressions (like before)
- Multiple linear regressions
- Logistic regressions
What’s nice about statsmodels
is that it gives an R-like interface and summaries.
To start with, let’s import the tools. We’ll use the formula interface, which offers us an R-like way of creating models.
import statsmodels.formula.api as smf
Simple linear regressions revisited
Let’s perform the same linear regression as before, looking at the “age” and “height variables”. Our thinking is that players’ heights dictate how long they can play, so we’ll make
The first step is to make the set up the variables. We’ll use the function smf.ols()
for ordinary least squares. It takes in two imputs:
- The formula string, in the form
y ~ X1 + X2 ...
- The data
We create the model and compute the fit
= smf.ols("age ~ height_cm", df)
mod = mod.fit() res
Done! Let’s take a look at the results
res.summary()
Dep. Variable: | age | R-squared: | 0.000 |
Model: | OLS | Adj. R-squared: | 0.000 |
Method: | Least Squares | F-statistic: | 1.679 |
Date: | Fri, 16 May 2025 | Prob (F-statistic): | 0.195 |
Time: | 05:12:16 | Log-Likelihood: | -17279. |
No. Observations: | 5932 | AIC: | 3.456e+04 |
Df Residuals: | 5930 | BIC: | 3.457e+04 |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
Intercept | 23.4973 | 1.549 | 15.165 | 0.000 | 20.460 | 26.535 |
height_cm | 0.0110 | 0.008 | 1.296 | 0.195 | -0.006 | 0.028 |
Omnibus: | 204.256 | Durbin-Watson: | 0.269 |
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 206.613 |
Skew: | 0.427 | Prob(JB): | 1.36e-45 |
Kurtosis: | 2.671 | Cond. No. | 4.91e+03 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.91e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
That’s a lot nicer than with scipy. We can also make a plot by getting the model’s res.fittedvalues
= df, x = "height_cm", y = "age")
sns.relplot(data = df["height_cm"], y = res.fittedvalues, color = "black") sns.lineplot(x
Generalised linear models
The statsmodels
module has lots of advanced statistical models available. We’ll take a look at one more: Generalised Linear Models. The distributions they include are
- Binomial
- Poisson
- Negative Binomial
- Gaussian (Normal)
- Gamma
- Inverse Gaussian
- Tweedie
We’ll use the binomial option to create logistic regressions.
Logistic regressions examine the distribution of binary data. For us, we can compare the heights of goalkeepers vs non-goalkeepers again. Let’s make a new column which is 1
for goalkeepers and 0
for non-goalkeepers:
"gk"] = (df["positions"] == "Goalkeeper")*1 df[
Multiplying by 1 turns
True
1
andFalse
0
Now, we can model this column with height. Specifically,
Start by making the model with the function smf.glm()
. We need to specify the family of distributions; they all live in sm.families
, which comes from a different submodule that we should import:
import statsmodels.api as sm
= smf.glm("gk ~ height_cm", data = df, family = sm.families.Binomial()) mod
Next, evaluate the results
= mod.fit() res
Let’s have a look at the summary:
res.summary()
Dep. Variable: | gk | No. Observations: | 5932 |
Model: | GLM | Df Residuals: | 5930 |
Model Family: | Binomial | Df Model: | 1 |
Link Function: | Logit | Scale: | 1.0000 |
Method: | IRLS | Log-Likelihood: | -1583.5 |
Date: | Fri, 16 May 2025 | Deviance: | 3167.0 |
Time: | 05:12:17 | Pearson chi2: | 4.02e+03 |
No. Iterations: | 7 | Pseudo R-squ. (CS): | 0.1879 |
Covariance Type: | nonrobust |
coef | std err | z | P>|z| | [0.025 | 0.975] | |
Intercept | -53.2336 | 1.927 | -27.622 | 0.000 | -57.011 | -49.456 |
height_cm | 0.2745 | 0.010 | 26.938 | 0.000 | 0.255 | 0.294 |
Finally, we can plot the result like before
= df, x = "height_cm", y = "gk")
sns.relplot(data = df["height_cm"], y = res.fittedvalues, color = "black") sns.lineplot(x