Plotting data - one numerical variable

Plotting data

Matplotlib library
The Python programming language itself does not have functions for plotting graphs. We have to use an additional library to do this. Matplotlib is a popular Python library for plotting graphs. The official matplotlib website is http://matplotlib.org, which has a gallery of possible graph types.

To use matplotlib we must include the following code once in each notebook.

%matplotlib inline
import matplotlib.pyplot as plt

The line %matplotlib inline allows us to display matplotlib-generated graphs within jupyter notebooks.

The line import matplotlib.pyplot as plt loads the matplotlib library so we can use its plotting functions. In addition we rename the library plt for convenience (otherwise we have to keep writing matplotlib.pyplot every time we wanted to change something in the graph).

One numerical variable: histograms

Histograms are the main method of displaying a numerical (quantitative) variable.

If you haven't done so already, watch the first half of this video on histograms (you can ignore the second half on stemplots as they are rarely used nowadays).

To plot a histogram of Alaskan salmon masses we use the hist() method like so:

salmon_masses['mass'].hist()

Run the following code cell to see how this is done.

In [0]:

%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd

# Read the alaskan_salmon data set into a DataFrame called "salmon_masses".
salmon_masses = pd.read_csv('alaskan_salmon.csv')

# Create a histogram of the salmon masses
salmon_masses['mass'].plot.hist()

Notice that this histogram shows two distinct peaks. Such a distribution is described as bimodal, as in having two modes (peaks). In contrast, a distribution that has just one peak is called unimodal.

Can you think of a reason why these salmon have a bimodal distribution in mass, and why the peak at 3kg is lower than the peak at 1.75kg?

The piece of code

salmon_masses['mass']

looks like we're accessing the value of the key 'mass' in a Python dictionary called salmon_masses. We're not. salmon_masses is a DataFrame not a dictionary. The syntax is the same but the effect is different. salmon_masses['mass'] contains all of the masses. We can see this if we print it.

In [0]:

print(salmon_masses['mass'])

Note
Placing a semicolon at the end of the last plotting command in a code cell like so:

salmon_masses['mass'].plot.hist();

suppresses the printing of irrelevant information before the graph making the output cleaner. Try it in the above code.

Label your graphs

As with all graphs, the one we plotted above needs to be labelled fully and clearly so that someone else can look at it and know immediately what it is presenting. We need the following:

Labels on the $x$ and $y$ axes
A title

We add $x$ and $y$ axes labels with the functions

plt.xlabel('mass (kg)')
plt.ylabel('Number of salmon')

and a title with the function

plt.title('Masses of Alaskan sockeye salmon');

It's worth pointing out that the unit of mass is included in the $x$ -axis label. This means a reader immediately knows what units the masses are in. If the units were missing the reader has to guess if the masses are in grams, kilograms or even pounds or ounces. Try to make life as easy as possible for other people to understand what you are presenting by including relevant information in your graphs and tables.

Add axis labels and a title to the above histogram.

Exercise Notebook

Plotting data - one numerical variable

Next Notebook

One categorical variable

Plotting data - one numerical variable

Plotting data

One numerical variable: histograms

Label your graphs

Exercise Notebook

Next Notebook

Product

Resources

Company