CoCalc Public FilesAssignments W20 (make a copy in your sandbox!) / Lab 5 - Paired Data with pandas.ipynb
Views : 125

# Lab 5: Python Tricks and More Two Groups Comparisons

In previous labs, the data you worked with was given to you in the form of lists inside the assignment notebook. This works well for simple data but becomes messy with larger or more complex datasets. It's also not how the data you may work with in real life comes packaged.

In this lab, you will learn the basics of using pandas, a common and powerful Python data analysis library that works well with Seaborn and Numpy. The usual abbreviation for importing pandas is "pd".

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns


One of the most useful things we can do with pandas is read files into a Jupyter notebook (or other Python code). This produces a table-like object called a pandas data frame.

Our data is stored in a common format called CSV (comma-separated values), so we will use the pandas read_csv function to read it. The basic syntax is pd.read_csv("filename").

In this lab, you will look at a dataset of blood lead concentrations in children whose fathers worked in a lead-using industry, compared to individually matched controls. That is, each child with an exposed father was paired with a similar child whose father did not work in such an industry.

1. Read the blood_lead.csv file into your notebook and assign it to a variable. View the resulting object. HINT: Don't use print. It makes the data frame look worse.
In [ ]:
#TODO


## Making Pretty Plots

As usual, we want to start by visualizing the data. Pandas works well with Seaborn plotting tools. For example, to make side-by-side dotplots of a dataframe called df, you just need to enter sns.stripplot(data=df).

1. Make a beeswarm plot of the blood lead data. Make sure the y-axis is appropriately labeled.
In [ ]:
#TODO


Notice that Seaborn automatically labels the categories using the column labels from your original data frame. You can change these if you want, but we'll stay with the originals.

A good figure should have a title. To set one, use the syntax p.set_title("Title") (where p is the name of your plot). You can also use the fontsize option to set the title font size.

1. Add a descriptive title to your plot. Make it reasonably large.
In [9]:
#TODO


So far, we have used the default Seaborn colors. However, there are plenty of other options. We can set colors manually or, better, choose one of the palettes that Seaborn makes available. Since this dataset is about blood, we might want to use shades of red. To do this, just put palette="Reds" into your plot command.

1. Change the palette used by your plot. You can find a list at https://chrisalbon.com/python/data_visualization/seaborn_color_palettes/ and a detailed explanation (for those who are interested) at https://seaborn.pydata.org/tutorial/color_palettes.html.
In [ ]:
#TODO

1. Make a different type of visualization of this data. Label the axes, make a title, and use a color scheme you like.
In [ ]:
#TODO


While the side-by-side dot plots are useful, they don't show the relationship between exposed and control children. There is no built-in Seaborn or matplotlib function for making such a plot, so you can use the custom function below.

In [2]:
# Purpose: Plot dot plots for 2 groups of data and lines connecting pairs
# Inputs: dataset is required, but you can also include labels for the 2 groups and colors for the lines and points (otherwise, defaults are given)
# Outputs: dot plots for 2 groups of data and lines connecting pairs

def slopePlot(data, labels=["", ""], line_color="gray", point_color="black"):
import matplotlib.pyplot as plt
from numpy import array

dataArr = array(data)
fig, ax = plt.subplots(figsize=(4, 3))

x1=0.8
x2=1.2
n = dataArr.shape[0]
for i in range(n):
ax.plot([x1, x2], [dataArr[i,0], dataArr[i,1]], color=line_color)

# Plot the points
ax.scatter(n*[x1-0.01], dataArr[:,0], color=point_color, s=25, label=labels[0])
ax.scatter(n*[x2+0.01], dataArr[:,1], color=point_color, s=25, label=labels[1])

# Fix the axes and labels
ax.set_xticks([x1, x2])
_ = ax.set_xticklabels(labels, fontsize='x-large')

return ax

1. Use the slopePlot function to plot the data. Be sure to label axes, title, and groups. Describe what you see.
In [3]:
#TODO


## Resampling

The two groups appear to have different blood lead levels. We want to find out what the difference is and find the associated p-value. We will do this by resampling, but first a few technicalities require our attention.

The data in question is in a pandas data frame. For now, we will make the two columns into 1-D Numpy arrays and proceed as in earlier labs.

To acccess a column in a pandas data frame, just use the column's title. To access the column "Col 1" in the data frame df, use df["Col 1"]. We can then use the np.array function to convert the column into a 1-D Numpy array.

1. Make each column into an array, assigning each to a variable. View the arrays.
In [ ]:
#TODO


The core idea in analyzing paired data is to focus on the difference between the paired values rather than the values themselves.

1. Subtract one array from the other to get an array of differences. Then, find the median difference for your observed sample.
In [22]:
#TODO

1. Make an array of randomly chosen +1 and -1 values, which we'll call the sign array. This array should be of the same length as the actual data.
In [1]:
#TODO

1. Multiply the array of differences and the sign array. (With Numpy arrays, unlike lists, you can do the whole calculation at once.) Then, find the median difference for this new array.
In [ ]:
#TODO

1. Perform this procedure 10,000 times, show your histogram with observed median difference to check that your simulation results make sense, and find a two-tailed p-value.
In [ ]:
#TODO

1. Write a sentence interpreting the p-value in the context of the study.
In [23]:
#TODO