Tips

Things are changing…

As we prepare for the next round of intensives, this material may change.

Book now for 3rd-5th Feb 2026

Here’s a few general tips. In addition, we strongly recommend using the pandas cheatsheets, which give a quick and easy reference for common packages and functions, and from Data to Viz, which guides you through choosing a visualisation.

Hotkeys

Code	Hotkey	Description
	`F9` (or `Fn` + `F9`)	Run current line
`# %%`	`Ctrl` + `2`	New cell (only in Spyder)
	`Ctrl`+`Enter`	Run current cell (when in Script)
	`Ctrl`+`C`	Cancel current operation (when in Console)

Data manipulation

Use the pandas package to analyse your data:

import pandas as pd

Importing and exporting data

In case you’ve forgotten, use the read.csv() function to import data:

df = pd.read_csv("data/dataset.csv")

If you’d like to export any files from Python to “.csv”, use the .to_csv() method

df.to_csv("data/output_name.csv")

Initial exploration

You’ll want to explore the data to start with - below are a few functions to get started.

Function	Example	Description
`df.columns`		Returns the variable names
`df.info()`		Returns the structure of the dataset (variable names, counts and types)
`df["variable"]`		Returns a specific column
`df["variable"].unique()`		Returns the unique values of a variable
`df.describe()` or `df["variable"].describe()`	Returns a statistical summary of the dataset or a variable

Removing `nan`s

We can remove nans by filtering with the condition df["variable"].notna():

df = df[df["variable"].notna()]

Time series data

If you’ve picked a dataset with time-series data (e.g. a “date” variable), you should transform that variable so that it visualises better:

df["variable"] = pd.to_datetime(df["variable"])

Categorical and ordered data

If you’re dealing with categorical data, it can be helpful to tell Python

df["variable"] = df["variable"].astype("category")

To manually specify the order of categories, use the df["variable"].cat.reorder_categories() function and use the ordered = True parameter

df["variable"] = df["variable"].cat.reorder_categories(["cat1", "cat2", ...], ordered = True)

This is particularly useful for the Coffee survey dataset.

If you’re dealing with categorical data, look at the pandas guide for inspiration and help.

Renaming variables

Some datasets have cumbersome names for their variables. We can change variable names with df.rename(), sending a dictionary to the columns = parameter:

df = df.rename(columns = {"old_name": "new_name"})

This is particularly useful for the World population dataset.

A dictionary is a Python variable with key-value pairs. The structure is key: value, so above we have a dictionary with one key, "old_name" and corresponding value "new_name". They are created as follows:
example_dictionary = {"key1": "value1",
                      "key2": "value2",
                      "key3": "value3",
                      ...}
Note that multiple lines are used purely for readability, you could just as well do this on one line.

Visualisation

You can make simple visualisations with seaborn’s relplot(), catplot() and displot() functions

import seaborn as sns

sns.relplot(data = df, x = "variable_x", y = "variable_y", hue = "variable_colour", ...)

We can add plot elements easily with matplotlib.pyplot

import seaborn as sns
import matplotlib.pyplot as plt

sns.relplot(data = df, x = "variable_x", y = "variable_y", hue = "variable_colour", ...)
plt.xlabel("x axis label")
plt.ylabel("y axis label")

Hotkeys

Data manipulation

Importing and exporting data

Initial exploration

Removing nans

Time series data

Categorical and ordered data

Renaming variables

Visualisation

Removing `nan`s