CoCalc Public FilesAssignments W20 (make a copy in your sandbox!) / Lab 5 - Paired Data with pandas.ipynbOpen with one click!

1

In previous labs, the data you worked with was given to you in the form of lists inside the assignment notebook. This works well for simple data but becomes messy with larger or more complex datasets. It's also not how the data you may work with in real life comes packaged.

2

In this lab, you will learn the basics of using pandas, a common and powerful Python data analysis library that works well with Seaborn and Numpy. The usual abbreviation for importing pandas is "pd".

3

In [2]:

import pandas as pd import numpy as np import seaborn as sns

4

One of the most useful things we can do with pandas is read files into a Jupyter notebook (or other Python code). This produces a table-like object called a *pandas data frame*.

Our data is stored in a common format called CSV (comma-separated values), so we will use the pandas `read_csv`

function to read it. The basic syntax is `pd.read_csv("filename")`

.

In this lab, you will look at a dataset of blood lead concentrations in children whose fathers worked in a lead-using industry, compared to individually matched controls. That is, each child with an exposed father was paired with a similar child whose father did not work in such an industry.

5

- Read the blood_lead.csv file into your notebook and assign it to a variable. View the resulting object. HINT: Don't use
`print`

. It makes the data frame look worse.

6

In [ ]:

#TODO

7

8

As usual, we want to start by visualizing the data. Pandas works well with Seaborn plotting tools. For example, to make side-by-side dotplots of a dataframe called `df`

, you just need to enter `sns.stripplot(data=df)`

.

9

- Make a beeswarm plot of the blood lead data. Make sure the y-axis is appropriately labeled.

10

In [ ]:

#TODO

11

Notice that Seaborn automatically labels the categories using the column labels from your original data frame. You can change these if you want, but we'll stay with the originals.

A good figure should have a title. To set one, use the syntax `p.set_title("Title")`

(where `p`

is the name of your plot). You can also use the `fontsize`

option to set the title font size.

12

- Add a descriptive title to your plot. Make it reasonably large.

13

In [9]:

#TODO

14

So far, we have used the default Seaborn colors. However, there are plenty of other options. We can set colors manually or, better, choose one of the palettes that Seaborn makes available. Since this dataset is about blood, we might want to use shades of red. To do this, just put `palette="Reds"`

into your plot command.

15

- Change the palette used by your plot. You can find a list at https://chrisalbon.com/python/data_visualization/seaborn_color_palettes/ and a detailed explanation (for those who are interested) at https://seaborn.pydata.org/tutorial/color_palettes.html.

16

In [ ]:

#TODO

17

- Make a different type of visualization of this data. Label the axes, make a title, and use a color scheme you like.

18

In [ ]:

#TODO

19

While the side-by-side dot plots are useful, they don't show the relationship between exposed and control children. There is no built-in Seaborn or matplotlib function for making such a plot, so you can use the custom function below.

20

In [2]:

# Purpose: Plot dot plots for 2 groups of data and lines connecting pairs # Inputs: dataset is required, but you can also include labels for the 2 groups and colors for the lines and points (otherwise, defaults are given) # Outputs: dot plots for 2 groups of data and lines connecting pairs def slopePlot(data, labels=["", ""], line_color="gray", point_color="black"): import matplotlib.pyplot as plt from numpy import array dataArr = array(data) fig, ax = plt.subplots(figsize=(4, 3)) x1=0.8 x2=1.2 n = dataArr.shape[0] for i in range(n): ax.plot([x1, x2], [dataArr[i,0], dataArr[i,1]], color=line_color) # Plot the points ax.scatter(n*[x1-0.01], dataArr[:,0], color=point_color, s=25, label=labels[0]) ax.scatter(n*[x2+0.01], dataArr[:,1], color=point_color, s=25, label=labels[1]) # Fix the axes and labels ax.set_xticks([x1, x2]) _ = ax.set_xticklabels(labels, fontsize='x-large') return ax

21

- Use the slopePlot function to plot the data. Be sure to label axes, title, and groups. Describe what you see.

22

In [3]:

#TODO

23

24

The two groups appear to have different blood lead levels. We want to find out what the difference is and find the associated p-value. We will do this by resampling, but first a few technicalities require our attention.

The data in question is in a pandas data frame. For now, we will make the two columns into 1-D Numpy arrays and proceed as in earlier labs.

25

To acccess a column in a pandas data frame, just use the column's title. To access the column "Col 1" in the data frame `df`

, use `df["Col 1"]`

. We can then use the `np.array`

function to convert the column into a 1-D Numpy array.

26

- Make each column into an array, assigning each to a variable. View the arrays.

27

In [ ]:

#TODO

28

The core idea in analyzing paired data is to focus on the difference between the paired values rather than the values themselves.

29

- Subtract one array from the other to get an array of differences. Then, find the median difference for your observed sample.

30

In [22]:

#TODO

31

- Make an array of randomly chosen +1 and -1 values, which we'll call the sign array. This array should be of the same length as the actual data.

32

In [1]:

#TODO

33

- Multiply the array of differences and the sign array. (With Numpy arrays, unlike lists, you can do the whole calculation at once.) Then, find the median difference for this new array.

34

In [ ]:

#TODO

35

- Perform this procedure 10,000 times, show your histogram with observed median difference to check that your simulation results make sense, and find a two-tailed p-value.

36

In [ ]:

#TODO

37

- Write a sentence interpreting the p-value in the context of the study.

38

In [23]:

#TODO

39