Datasets and tips
Datasets
We have nine datasets for you to choose from. We recommend saving your data inside your project.
Dataset | Description | Source |
---|---|---|
World populations | A summary of world populations and corresponding statistics | Data from a Tidy Tuesday post on 2014 CIA World Factbook data |
Soccer players | A summary of approx. 6000 soccer players from 2024 | Data from a Kaggle submission. |
Coffee survey | A survey of blind coffee tasting results | Data from a Kaggle submission |
Gapminder | GDP and life expectancy data by country | Data from the Research Bazaar’s R novice tutorial, sourced from Gapminder. |
Melbourne housing data | A collection of houses for sale in Melbourne. | Data from a Kaggle submission |
Goodreads books | A summary of books on Goodreads. | Data from a Kaggle submission |
Queensland hospitals | Queensland emergency department statistics. | Data from the Queensland Government’s Open Data Portal. |
Queensland fuel prices | Fuel prices by the pump in Queensland | Data from the Queensland Government’s Open Data Portal |
Aeroplane bird strikes | Aeroplane bird strike incidents fron the 90s | Data from a Tidy Tuesday post sourced from an FAA database |
Tips
Here’s a few general tips. In addition, we strongly recommend using popular cheatsheets, which give a quick and easy reference for common packages and functions, and from Data to Viz, which guides you through choosing a visualisation.
Hotkeys
Code | Hotkey | Description |
---|---|---|
F9 (or Fn + F9) | Run current line | |
# %% |
Ctrl + 2 | New cell (only in Spyder) |
Ctrl+Enter | Run current cell (when in Script) | |
Ctrl+C | Cancel current operation (when in Console) |
Data manipulation
Use the pandas
package to analyse your data:
import pandas as pd
Importing and exporting data
In case you’ve forgotten, use the read.csv()
function to import data:
= pd.read_csv("data/dataset.csv") df
If you’d like to export any files from Python to “.csv”, use the .to_csv()
method
"data/output_name.csv") df.to_csv(
Initial exploration
You’ll want to explore the data to start with - below are a few functions to get started.
Function | Example | Description |
---|---|---|
df.columns |
Returns the variable names | |
df.info() |
Returns the structure of the dataset (variable names, counts and types) | |
df["variable"] |
Returns a specific column | |
pd.unique("variable") |
Returns the unique values of a variable | |
df.describe() or df["variable"].describe() |
Returns a statistical summary of the dataset or a variable |
Removing nan
s
We can remove nan
s by filtering with the condition df["variable"].notna()
:
= df[df["variable"].notna()] df
Time series data
If you’ve picked a dataset with time-series data (e.g. a “date” variable), you should transform that variable so that it visualises better:
"variable"] = pd.to_datetime(df["variable"]) df[
Categorical and ordered data
If you’re dealing with categorical data, it can be helpful to tell Python
"variable"] = df["variable"].astype("category") df[
To manually specify the order of categories, use the df["variable"].cat.reorder_categories()
function and use the ordered = True
parameter
"variable"] = df["variable"].cat.reorder_categories(["cat1", "cat2", ...], ordered = True) df[
This is particularly useful for the Coffee survey dataset.
If you’re dealing with categorical data, look at the pandas guide for inspiration and help.
Renaming variables
Some datasets have cumbersome names for their variables. We can change variable names with df.rename()
, sending a dictionary to the columns =
parameter:
= df.rename(columns = {"old_name": "new_name"}) df
This is particularly useful for the World population dataset.
A dictionary is a Python variable with key-value pairs. The structure is
key: value
, so above we have a dictionary with one key,"old_name"
and corresponding value"new_name"
. They are created as follows:= {"key1": "value1", example_dictionary "key2": "value2", "key3": "value3", ...}
Note that multiple lines are used purely for readability, you could just as well do this on one line.
Visualisation
You can make simple visualisations with seaborn
’s relplot()
, catplot()
and displot()
functions
import seaborn as sns
= df, x = "variable_x", y = "variable_y", hue = "variable_colour", ...) sns.relplot(data
We can add plot elements easily with matplotlib.pyplot
import seaborn as sns
import matplotlib.pyplot as plt
= df, x = "variable_x", y = "variable_y", hue = "variable_colour", ...)
sns.relplot(data "x axis label")
plt.xlabel("y axis label") plt.ylabel(