Getting started

Project scaffold

You’re welcome to dive in and work as you please, but if you’re feeling at a loss where to begin, follow the scaffold below. Don’t forget to start from our template and look at the example report.

Step 0: Pick a dataset

We have nine datasets for you to choose from. We recommend saving your data inside your project.

Dataset	Description	Source
World populations	A summary of world populations and corresponding statistics	Data from a Tidy Tuesday post on 2014 CIA World Factbook data
Soccer players	A summary of approx. 6000 soccer players from 2024	Data from a Kaggle submission.
Coffee survey	A survey of blind coffee tasting results	Data from a Kaggle submission
Gapminder	GDP and life expectancy data by country	Data from the Research Bazaar’s R novice tutorial, sourced from Gapminder.
Melbourne housing data	A collection of houses for sale in Melbourne.	Data from a Kaggle submission
Goodreads books	A summary of books on Goodreads.	Data from a Kaggle submission
Queensland hospitals	Queensland emergency department statistics.	Data from the Queensland Government’s Open Data Portal.
Queensland fuel prices	Fuel prices by the pump in Queensland	Data from the Queensland Government’s Open Data Portal
Aeroplane bird strikes	Aeroplane bird strike incidents fron the 90s	Data from a Tidy Tuesday post sourced from an FAA database

Remember, to load the data into R we need to use the read.csv() function.

dataset <- read.csv("path_to_data")

Step 1: Understand the data

The datasets are varied with respect to variable types and content. The first exercise you should complete is a overview of the data. Use the following techniques to do so.

Your goal: identify which variables are discrete (categorical) and which are continuous.

Viewing the data structure

Use the following functions to view your data and the underlying data types.

names(dataset)
str(dataset)
summary(dataset)

Picking out individual columns

To view the contents of particular columns, you can select them via indexing

dataset$column_name"
unique(dataset$column_name)
summary(dataset$column_name)

You can also apply other statistics to the column, like max().

Step 2: Taking a subset

The datasets have lots of observations for lots of variables. To draw meaningful results, it’s often useful to take a subset of those.

Your goal: filter by a condition or group by and aggregate over a particular variable

First, we want to load the dplyr library to perform data manipulation

library(dplyr)

Filtering

Recall that filtering looks like indexing. If you only want to examine a certain subset of a variable, the following code will isolate that subset

subset = dataset %>% filter(condition)

where condition depends on the columns. For example, country == "Australia".

Hint: we’ve used the pipe operator %>% here, which is equivalent to filter(datatset, condition).

Grouping

If you want to aggregate over a particular variable you need to group by it. This answers questions like, what is the average \(x\) for every \(y\).

aggregated = dataset %>%
  group_by("variable_to_group_by") %>%
  summarise(summary_1 = ..., summary_2 = ..., ...)

The summarise function aggregates by applying some statistic to a particular column for every unique value in the grouping variable. For example, summarise(avg_pop = mean(population)) makes a column in the summary table for the average population for each value of the grouped variable.

Step 3: Visualise the relationship between variables

With your summary dataset, you can now try to visualise your variables.

Your goal: create a visualisation of one to three variables in your summary data.

For visualisation, we use the ggplot2 library.

library(ggplot2)

Next, you’ll need to identify the variables to visualise. Using ggplot, we then specify the data, the mappings and the graphical elements

ggplot(data = aggregated,
       mapping = aes(x = ..., y = ..., ...)) +
  geom_...()

Step 4: Looking ahead

Now that you’ve performed your first analysis and visualisation of the dataset, use these results to inform your next analysis!

Below you’ll find some general tips which can help. They have dataset-specific tips too, so check them out. Otherwise, feel free to ask if you have any other questions.