library(dplyr)
<- read.csv("data/Players2024.csv")
players <- players %>% filter(positions != "Missing", height_cm > 100) players
Statistics
This session is aimed as an overview of how to perform some statistical modelling with R. It is an R workshop, not a statistics workshop - if you’d like to better understand the statistical models, or need help deciding what’s best for you, please consult a statistics resource or contact a statistician.
In this session, we’ll cover
- Descriptive statistics
- Measures of central tendancy
- Measures of variability
- Measures of correlation
- Inferential statistics
- Linear regressions
- Calculating confidence intervals
- T-tests
- \(\chi^2\) test
- ANOVAs
We’ll be working from our “Players2024” dataset. After downloading it and putting it in your data folder, to bring it in and clean it up,
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
Descriptive Statistics
We’ll start with sample size. To calculate the number of non-empty observations in a column, say the numeric variable players$height_cm
, we use the length()
function
length(players$height_cm)
[1] 5932
We can compute measures of central tendancy similarly. The average value is given by
mean(players$height_cm)
[1] 183.0413
and the median by
median(players$height_cm)
[1] 183
Measures of variance
We can also compute measures of variance. The minimum and maximum are as expected
min(players$height_cm)
[1] 160
max(players$height_cm)
[1] 206
The function range()
yields both
range(players$height_cm)
[1] 160 206
So the actual range, i.e. the difference, is
diff(range(players$height_cm))
[1] 46
Quartiles are given by quantile()
and the inter-quartile range (IQR) by IQR()
:
quantile(players$height_cm)
0% 25% 50% 75% 100%
160 178 183 188 206
IQR(players$height_cm)
[1] 10
A column’s standard deviation and variance are given by
sd(players$height_cm)
[1] 6.838736
var(players$height_cm)
[1] 46.76832
All together, you can see a nice statistical summary with
summary(players$height_cm)
Min. 1st Qu. Median Mean 3rd Qu. Max.
160 178 183 183 188 206
Measures of correlation
If you’ve got two numeric variables, you might want to examine covariance and correlation. These indicate how strongly the variables are linearly related. We’ll need to use the players$age
variable as well.
The covariance between “height_cm” and “age” is
cov(players$height_cm, players$age)
[1] 0.5126608
Similarly, we can find the Pearson correlation coefficient between two columns.
cor(players$height_cm, players$age)
[1] 0.01682598
You can also specify “kendall” or “spearman” for their respective correlation coefficients
cor(players$height_cm, players$age, method = "kendall")
[1] 0.005417946
cor(players$height_cm, players$age, method = "spearman")
[1] 0.007604345
Reminder about groupbys
Before we move to inferential statistics, it’s worth reiterating the power of groupbys discussed in the second workshop.
To group by a specific variable, like “positions”, we use
%>%
players group_by(positions)
# A tibble: 5,932 × 7
# Groups: positions [4]
name birth_date height_cm positions nationality age club
<chr> <chr> <dbl> <chr> <chr> <int> <chr>
1 James Milner 1986-01-04 175 Midfield England 38 Bright…
2 Anastasios Tsokanis 1991-05-02 176 Midfield Greece 33 Volou …
3 Jonas Hofmann 1992-07-14 176 Midfield Germany 32 Bayer …
4 Pepe Reina 1982-08-31 188 Goalkeeper Spain 42 Calcio…
5 Lionel Carole 1991-04-12 180 Defender France 33 Kayser…
6 Ludovic Butelle 1983-04-03 188 Goalkeeper France 41 Stade …
7 Daley Blind 1990-03-09 180 Defender Netherlands 34 Girona…
8 Craig Gordon 1982-12-31 193 Goalkeeper Scotland 41 Heart …
9 Dimitrios Sotiriou 1987-09-13 185 Goalkeeper Greece 37 Omilos…
10 Alessio Cragno 1994-06-28 184 Goalkeeper Italy 30 Associ…
# ℹ 5,922 more rows
By applying our statistics to the group_by
object, we’ll apply them to every variable for each position.
%>%
players group_by(positions) %>%
summarise(mean_height = mean(height_cm))
# A tibble: 4 × 2
positions mean_height
<chr> <dbl>
1 Attack 181.
2 Defender 184.
3 Goalkeeper 191.
4 Midfield 180.
Inferential Statistics
While descriptive statistics describes the data definitively, inferential statistics aim to produce models for extrapolating conlusions.
Simple linear regressions
Least-squares regression for two sets of measurements can be performed with the function lm
. Recall that linear regressions have the mathematical form
\[ Y = β_1 X + β_0 \]
and we use the regression tool to estimate the parameters \(β_0\,,β_1\). We can equivalently say that \(Y \sim X\), which is what R takes in
lm("height_cm ~ age", players)
Call:
lm(formula = "height_cm ~ age", data = players)
Coefficients:
(Intercept) age
182.38260 0.02583
If we store this as a variable, we can then produce a summary of the results
<- lm("height_cm ~ age", players)
model summary(model)
Call:
lm(formula = "height_cm ~ age", data = players)
Residuals:
Min 1Q Median 3Q Max
-23.028 -5.028 0.075 4.978 23.075
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 182.38260 0.51599 353.460 <2e-16 ***
age 0.02583 0.01993 1.296 0.195
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.838 on 5930 degrees of freedom
Multiple R-squared: 0.0002831, Adjusted R-squared: 0.0001145
F-statistic: 1.679 on 1 and 5930 DF, p-value: 0.1951
If you want to get specific parameters out, we can index with $
:
summary(model)$r.squared
[1] 0.0002831136
That’s a pretty shocking fit.
Plotting it
Naturally, you’d want to plot this. We’ll need to use techniques from the visualisation session. Let’s import ggplot2
library(ggplot2)
Start by making a scatterplot of the data,
ggplot(players, aes(x = height_cm, y = age)) +
geom_point()
Then, you’ll need to plot the regression as a line. For reference,
\[ y = \text{slope}\times x + \text{intercept}\]
So
<- model$coefficients[1]
b0 <- model$coefficients[2]
b1
ggplot(players, aes(x = age, y = height_cm)) +
geom_point() +
geom_abline(intercept = b0, slope = b1)
\(t\)-tests
We can also perform \(t\)-tests. Typically, these are performed to examine the statistical signficance of a difference between two samples’ means. Let’s examine whether that earlier groupby result for is accurate for heights, specifically, are goalkeepers taller than non-goalkeepers?
Let’s start by creating a new column with the values
FALSE |
Non-goalkeeper |
TRUE |
Goalkeeper |
<- players %>%
players mutate(gk = positions == "Goalkeeper")
The \(t\)-test’s goal is to check whether \(\text{height\_cm}\) depends on \(\text{gk}\), so the formula is \(\text{height\_cm}\sim\text{gk}\). This is given to the t.test
function:
t.test(height_cm ~ gk, data = players)
Welch Two Sample t-test
data: height_cm by gk
t = -48.817, df = 1274.4, p-value < 2.2e-16
alternative hypothesis: true difference in means between group FALSE and group TRUE is not equal to 0
95 percent confidence interval:
-9.036644 -8.338391
sample estimates:
mean in group FALSE mean in group TRUE
181.9810 190.6685
Yielding a p-value of \(p<2.2\times10^{-16}\), indicating that the null-hypothesis (heights are the same) is extremely unlikely.
To visualise this result, it might be helpful to produce a histogram of the heights
ggplot(players,
aes(x = height_cm, fill = gk)) +
geom_histogram(bins = 24)
ANOVAs
What about the means of the other three? We could use an ANOVA to examine them. We use the aov()
function for this.
Let’s start by making a new dataset without goalkeepers
<- players %>% filter(gk == FALSE) no_gk
Next, we save the analysis of variance results
<- aov(height_cm ~ positions, data = no_gk) res_aov
And examine them with summary()
summary(res_aov)
Df Sum Sq Mean Sq F value Pr(>F)
positions 2 15470 7735 199.7 <2e-16 ***
Residuals 5205 201552 39
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Even without goalkeepers included, it looks like their positions are not all independent of height.
\(\chi^2\) tests
\(χ^2\) tests are useful for examining the relationship of categorical variables by comparing the frequencies of each. Often, you’d use this if you can make a contingency table.
We only have one useful categorical variable here, “positions” (the others have too many unique values), so we’ll need to create another. Let’s see if there’s a relationship between players’ positions and names with the letter “a”.
Make a binary column for players with the letter “a” in their names. To do this, we need to apply a string method to all the columns in the dataframe as follows
<- players %>%
players mutate(a_in_name = grepl("a", name))
The
grepl
function perform pattern matching: it checks if the pattern"a"
is inside the values inname
.
Let’s cross tabulate positions with this new column
table(players$positions, players$a_in_name)
FALSE TRUE
Attack 291 1280
Defender 355 1606
Goalkeeper 149 575
Midfield 312 1364
The \(χ^2\) test’s job is to examine whether players’ positions depend on the presence of “a” in their name. To evaluate it we need to send the contingency table in:
chisq.test(table(players$positions, players$a_in_name))
Pearson's Chi-squared test
data: table(players$positions, players$a_in_name)
X-squared = 2.1808, df = 3, p-value = 0.5357
As expected, there is no signifcant relationship. A simple bar plot can help us here
ggplot(players,
aes(x = positions, fill = a_in_name)) +
geom_bar()
If we use the position = "fill"
parameter to geom_bar
, we’ll see each as a proportion
ggplot(players,
aes(x = positions, fill = a_in_name)) +
geom_bar(position = "fill")
It looks as though the proportions are much the same.
Generalised linear models
We’ll finish by looking at Generalised Linear Models. The distributions they include are
- Binomial
- Poisson
- Gaussian (Normal)
- Gamma
- Inverse Gaussian
- A few quasi options
We’ll use the binomial option to create logistic regressions.
Logistic regressions examine the distribution of binary data. For us, we can compare the heights of goalkeepers vs non-goalkeepers again.
Now, we can model this column with height. We’ll do the same as our \(t\)-test case, but this time we need to specify that family = binomial
to ensure we’ll get a logistic:
<- glm(gk ~ height_cm, family = binomial, data = players) res_logistic
We can take a look at the results with
summary(res_logistic)
Call:
glm(formula = gk ~ height_cm, family = binomial, data = players)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -53.23360 1.92714 -27.62 <2e-16 ***
height_cm 0.27448 0.01019 26.94 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 4401.4 on 5931 degrees of freedom
Residual deviance: 3167.0 on 5930 degrees of freedom
AIC: 3171
Number of Fisher Scoring iterations: 6
And we can then visualise it with ggplot2
. We need to make another variable, because we need to replace TRUE
\(\rightarrow\) 1
and FALSE
\(\rightarrow\) 0
for the plot.
<- players %>% mutate(gk_numeric = as.numeric(gk)) players
Now we can plot the logistic regression. The fitted values (on the \(y\)-axis) are stored in res_logistic$fitted.values
, but there are no provided \(x\)-values - these come from the players
dataset. Use geom_point()
for the data and geom_line()
for the fit:
ggplot(players, aes(x = height_cm, y = gk_numeric)) +
geom_point() +
geom_line(aes(y = res_logistic$fitted.values))