Project: Charles Farrant - 2020-21/EDA workshop 1 Group B (Thursday)

Path: EDA workshop 1 / Self-study Notebooks Reference / 01 - Exploring data with Python.ipynb

Views: ²¹¹⁶
Image: ubuntu2004

Kernel: Python 3 (system-wide)

Exploring data with Python

Displaying data

The human eye is a natural pattern detector, adept at spotting trends and exceptions in visual displays. For this reason, biologists spend hours creating and examining visual summaries of their data - graphs and, to a lesser extent, tables. Effective graphs enable visual comparisons of measurements between groups, and they expose relationships between different variables. They are also the principal means of communicating results to a wider audience.

Florence Nightingale (1858) was one of the ﬁrst persons to put graphs to good use. In her famous wedge diagrams, redrawn in the ﬁgure above, she visualised the causes of death of British troops during the Crimean War. The number of cases is indicated by the area of a wedge, and the cause of death by colour. The diagrams showed convincingly that disease was the main cause of soldier deaths during the wars, not wounds or other causes. With these vivid graphs, she successfully campaigned for military and public health measures that saved many lives.

Effective graphs are a prerequisite for good data analysis, revealing general patterns in the data that bare numbers cannot. Therefore, the first step in any data analysis is to graph the data and look at it. Humans are a visual species, with brains evolved to process visual information. Take advantage of millions of years of evolution, and look at visual representations of your data before doing anything else.

Graphs are vital tools for analysing data. They are also used to communicate patterns in data to a wider audience in the form of reports, slide shows, and web content. The two purposes, analysis and presentation, are largely coincident because the most revealing displays will be the best both for identifying patterns in the data and for communicating these patterns to others. Both purposes require displays that are clear, honest, and efficient.

Over the next few notebooks we will look at how different types of datasets are commonly displayed in tables and graphs. The types of datasets we will look at are:

A single numerical variable
A single categorical variable
Two numerical variables
Two categorical variables
A categorical variable and a numerical variable
A categorical variable and two numerical variables

But before we do that we need to discuss data analysis software.

Data analysis in Python

The are many software tools and packages for plotting and analysing data. You've probably come across Microsoft excel - a favourite of biologists, "R" is another commonly used package and also free to download. All have their pros and cons, and for basic analysis and plotting it doesn't really matter which one you use. But in this course, because we have already been learning Python, it seems sensible to carry on using Python to do data analysis.

Python by itself though doesn't do data analysis and plotting. We have to import modules into Python to help us do these things.

The main aims of this and the next notebooks are:

To introduce you to pandas, a python module designed for performing data analysis.
To show you how to set up Jupyter notebooks for plotting graphs in Python using the modules matplotlib and seaborn.

We'll use a simple dataset of body masses of Alaskan sockeye salmon to demonstrate these two aims.

Body mass of Alaskan sockeye salmon

In the file alaskan_salmon.csv are the body masses (in kg) of 228 female sockeye salmon sampled from Pick Creek in Alaska (Hendry et al. 1999).

If you click on the link to this file you can download it and examine it in an excel spreadsheet.

The first few lines of alaskan_salmon.csv look like this:

mass
3.09
2.91
3.06
2.69
... and so on

The first line is called a header and contains the name of the variable. In this case the variable is called 'mass'. Each line thereafter contains the mass (in kg) of an individual salmon.

CSV files are a common file type for storing data. They are particularly useful because they are human-readable. Which means they are written in plain text so you can open them in any text editor and edit them, for example, to correct mistakes or change the name of a variable.

Pandas: data analysis library

In this course we are going to use a python package specifically designed for data analysis called pandas. Pandas provides lots of functions for reading in, analysing, manipulating and describing data. The official pandas website is http://pandas.pydata.org. This website provides a lot more information on the use of this library than can be covered in these workshops.

To use pandas we must include the following code once in each notebook.

import pandas as pd

This imports the pandas library and gives it the shorthand name pd.

Reading a dataset from file

The first two things we have to do is load, or read in, the Alaskan salmon file and to call the dataset something sensible. We use the pandas method read_csv() like so:

salmon_masses = pd.read_csv('alaskan_salmon.csv')

In the following code cell:

import pandas
read in the dataset from the file alaskan_salmon.csv and call the dataset salmon_masses

In [0]:

Examining the dataset

salmon_masses is a Python variable of type DataFrame; in the same sense that the Python variable a = 7 is an integer, b = 'Hello, World' is a string and c = [1, 2, 3] is a list. If you think of a DataFrame as a table containing data with each column representing a measureable variable and each row an individual that has been measured you can't go wrong.

In this case salmon_masses is a DataFrame with one column called mass containing the measured masses of 228 Alaskan salmon.

Having read in a dataset to a DataFrame, it's always a good idea to have a look at it to see how it is structured. To print the DataFrame type

print(salmon_masses)

or print it in a nice format just type

salmon_masses

Try these in the above code cell

There are two columns. The first column is an index, or row number, for each salmon. Recall that Python indices start from 0 and not from 1. The second column contains the masses of individual salmon.

As this is a very long DataFrame, pandas has only printed the first and last 30 values; the middle values (from index 30 to index 197) have been omitted.

Also notice that the shape of the DataFrame has been printed at the bottom, in this case 228 rows and 1 column (the index column is ignored).

If you wanted to print just the first 10 lines, say, use the head() method like so

print(salmon_masses.head(10))

or the last 7 lines, say, use the tail() method like so

print(salmon_masses.tail(7))

Try these in the above code cell.

To print the shape of the DataFrame only (i.e., the number of rows and columns) use

print(salmon_masses.shape)

In addition

print(len(salmon_masses))

prints just the number of rows

If you want to print a list of the variable names (i.e., the column headers) you can use the method

print(salmon_masses.columns.values)

Try all of these in the above code cell to see how they work.

References

Nightingale, F. (1858). Notes on Matters Affecting the Health, Efficiecy and Hospital Administration of the British Army. London, Harrison and Sons.

Hendry, A. P., et al. (1999). Condition dependence and adaptation-by-time: breeding date, life history, and energy allocation in a population of salmon. Oikis 85:499-514.

Exercise Notebook

Exploring data with Python

Next Notebook

Plotting data - one numerical variable

Exploring data with Python

Displaying data

Data analysis in Python

Body mass of Alaskan sockeye salmon

Reading a dataset from file

Examining the dataset

References

Exercise Notebook

Next Notebook

Product

Resources

Company