CoCalc Public FilesWeek 5 / Lab 5 - Mini-Project with pandas.ipynb
Author: Dean Neutel
Views : 104

# Lab 5: Python Tricks and More Two Groups Comparisons

In previous labs, the data you worked with was given to you in the form of lists inside the assignment notebook. This works well for simple data but becomes messy with larger or more complex datasets. It's also not how the data you may work with in real life comes packaged.

In this lab, you will learn the basics of using pandas, a common and powerful Python data analysis library that works well with Seaborn and Numpy. The usual abbreviation for importing pandas is "pd".

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns


One of the most useful things we can do with pandas is read files into a Jupyter notebook (or other Python code). This produces a table-like object called a pandas data frame.

Our data is stored in a common format called CSV (comma-separated variable), so we will use the pandas read_csv function to read it. The basic syntax is pd.read_csv("filename").

In this lab, you will look at a dataset of blood loss (in mL) in patients having one of two types of surgery.

1. Read the blood_loss.csv file into your notebook and assign it to a variable. View the resulting object. HINT: Don't use print. It makes the data frame look worse.
In [15]:
#1
blood_loss

Treatment 1 Treatment 2
0 550 185.0
1 160 209.0
2 250 410.0
3 600 627.0
4 720 90.0
5 680 110.0
6 350 225.0
7 695 142.0
8 720 46.0
9 820 60.0
10 490 127.0
11 790 90.0
12 120 1223.0
13 240 111.0
14 250 135.0
15 1690 145.0
16 400 220.0
17 135 310.0
18 460 23.0
19 260 105.0
20 280 255.0
21 780 300.0
22 660 150.0
23 530 175.0
24 360 273.0
25 470 450.0
26 790 NaN

Notice that, at the end of the second column, there is an "NaN". This stands for "not a number" and is there because the number of observations in the two columns is not the same. We will have to deal with this later, but for now, it's not an issue.

## Making Pretty Plots

As usual, we want to start by visualizing the data. Pandas works well with Seaborn plotting tools. For example, to make side-by-side dotplots of a dataframe called df, you just need to enter sns.stripplot(data=df).

1. Make a beeswarm plot of the blood loss data. Make sure the y-axis is appropriately labeled.
In [20]:
#2
p=sns.swarmplot(data=blood_loss)
p.set(xlabel="Treatment Group")+p.set(ylabel="Blood Loss (mL)");


Notice that Seaborn automatically labels the categories using the column labels from your original data frame. You can change these if you want, but we'll stay with the originals.

A good figure should have a title. To set one, use the syntax p.set_title("Title") (where p is the name of your plot). You can also use the fontsize option to set the title font size.

1. Add a descriptive title to your plot. Make it reasonably large.
In [22]:
#3
p=sns.swarmplot(data=blood_loss)
p.set(xlabel="Treatment Group")+p.set(ylabel="Blood Loss (mL)");
p.set_title("Blood Loss When Using Two Surgerical Techniques");


So far, we have used the default Seaborn colors. However, there are plenty of other options. We can set colors manually or, better, choose one of the palettes that Seaborn makes available. Since this dataset is about blood, we might want to use shades of red. To do this, just put palette="Reds" into your plot command.

1. Change the palette used by your plot. You can find a list at https://chrisalbon.com/python/data_visualization/seaborn_color_palettes/ and a detailed explanation (for those who are interested) at https://seaborn.pydata.org/tutorial/color_palettes.html.
In [27]:
#4
p=sns.swarmplot(data=blood_loss, palette="Reds")
p.set(xlabel="Treatment Group")+p.set(ylabel="Blood Loss (mL)");
p.set_title("Blood Loss When Using Two Surgerical Techniques");

1. Make a different type of visualization of this data. Label the axes, make a title, and use a color scheme you like.
In [38]:
#5
p=sns.violinplot(data=blood_loss, palette="Reds")
p.set(xlabel="Treatment Group")+p.set(ylabel="Blood Loss (mL)");
p.set_title("Blood Loss When Using Two Surgerical Techniques");


## Bootstrapping

The two types of surgery seem to have different levels of blood loss. We want to find out what the difference is and get a measure of the associated uncertainty. We will do this by bootstrapping, but first a few technicalities require our attention.

The data in question is in a pandas data frame. For many purposes, this is a good thing, but in this case, the NaN at the end of the second column would cause problems. Workarounds are possible but would be somewhat clumsy. We will therefore just make the two columns into lists and proceed as in earlier labs.

To acccess a column in a pandas data frame, just use the column's title. To access the column "Col 1" in the data frame df, use df["Col 1"]. We can then use the list function to convert the column into a list.

1. Make each column into a list, assigning each to a variable. View the lists.
In [66]:
#6
Treatment1 = list(blood_loss["Treatment 1"])
Treatment2 = list(blood_loss["Treatment 2"])
display("Treatment 1 List", Treatment1, "Treatment 2 List", Treatment2);

'Treatment 1 List'
[550, 160, 250, 600, 720, 680, 350, 695, 720, 820, 490, 790, 120, 240, 250, 1690, 400, 135, 460, 260, 280, 780, 660, 530, 360, 470, 790]
'Treatment 2 List'
[185.0, 209.0, 410.0, 627.0, 90.0, 110.0, 225.0, 142.0, 46.0, 60.0, 127.0, 90.0, 1223.0, 111.0, 135.0, 145.0, 220.0, 310.0, 23.0, 105.0, 255.0, 300.0, 150.0, 175.0, 273.0, 450.0, nan]

We're almost done, but the list for Treatment 2 still has that pesky NaN. Since it's at the end, the easiest way to get rid of it is to get all the other list elements and assign the new list to the same variable as the old one.

1. Make a list without the NaN. HINT: You may want to review indexing.
In [67]:
#7
Treatment2 = []
for i in range(len(blood_loss["Treatment 2"])-1):
Treatment2.append(blood_loss["Treatment 2"][i])
Treatment2

[185.0, 209.0, 410.0, 627.0, 90.0, 110.0, 225.0, 142.0, 46.0, 60.0, 127.0, 90.0, 1223.0, 111.0, 135.0, 145.0, 220.0, 310.0, 23.0, 105.0, 255.0, 300.0, 150.0, 175.0, 273.0, 450.0]

We are now ready to compute confidence intervals and p-values.

1. Referring back to the visualizations you made earlier, pick a descriptor for the data and compute it for both treatments. Briefly justify your choice.
In [68]:
#8
Median_1 = np.median(Treatment1)
Median_2 = np.median(Treatment2)
display("Median of Treatment 1 List",Median_1, "Median of Treatment 2 List", Median_2)

'Median of Treatment 1 List'
490.0
'Median of Treatment 2 List'
162.5
1. Pick a measure you want to compare about the two groups. Most likely, that is a measure of central tendency or variation, but you could use something else. Find your observed difference.
In [75]:
#9
absolute1 = []
absolute2 = []
differenceMedian = Median_1 - Median_2
for i in Treatment1:
for i in Treatment2:

[230.0]
[67.5]
'Observed Difference'
327.5
1. Using 10,000 bootstrap replicates, find the 99% pivotal confidence interval for the difference.
In [88]:
#10
total = 10000
treatment_difference = np.zeros(total)
for i in range(total):
Random_Treatment1 = np.random.choice(Treatment1, len(Treatment1))
Random_Treatment2 = np.random.choice(Treatment2, len(Treatment2))
difference = np.median(Random_Treatment1) - np.median(Random_Treatment2)
treatment_difference[i] = difference
treatment_difference.sort()
M_lower = treatment_difference[49]
M_upper = treatment_difference[9949]
M_observed = np.median(treatment_difference)
M_upper_pivotal = 2*M_observed - M_lower
M_lower_pivotal = 2*M_observed - M_upper
display("M Pivotal(Lower): Red Line", M_lower_pivotal, "M Pivotal(Upper): Red Line", M_upper_pivotal)
p = sns.distplot(treatment_difference, kde=False, axlabel="Difference in Treatment 1 and Treatment 2 Medians")
p.set(ylabel="Count")
p.axvline(M_lower_pivotal, color="red");
p.axvline(M_upper_pivotal, color="red");
p.axvline(M_observed, color="blue");

'M Pivotal(Lower): Red Line'
103.5
'M Pivotal(Upper): Red Line'
555.0
1. Write a sentence interpreting your effect size and confidence interval in the context of the study.

#11 Since the confidence interval of our effect size doesn't contain our null hypothesis value of 0, we can say that our result is statistically significant. So this tells us that there is a statistically significant difference between the medians from treatment 1 and treatment 2.

We can also find a p-value for the observed difference. Since Treatment 1 seems to have considerably more variability than Treatment 2, even after excluding outliers, the two-box method makes sense.

1. Using the two-box method, find the two-sided p-value for the observed difference. HINT: There are two ways to recenter your data: you can use a for loop or convert the list to a Numpy array and just subtract.
In [92]:
#12
Treatment1_centered = np.array(Treatment1) - Median_1
Treatment2_centered = np.array(Treatment2) - Median_2
total = 10000
treatment_difference = np.zeros(total)
differenceMedian = Median_1 - Median_2
other_limit = -differenceMedian
for i in range(total):
Random_Treatment1 = np.random.choice(Treatment1_centered, len(Treatment1_centered))
Random_Treatment2 = np.random.choice(Treatment2_centered, len(Treatment2_centered))
treatment_difference[i] = np.median(Random_Treatment1) - np.median(Random_Treatment2)
p = sns.distplot(treatment_difference, kde=False, axlabel="Difference in Treatment 1 and Treatment 2 Medians")
p.set(ylabel="Count")
p.axvline(differenceMedian, color="red");
p.axvline(other_limit, color="red");
p.axvline(np.median(treatment_difference), color="blue");
p.set_title("Difference in Median Blood Loss When Using Two Surgerical Techniques");
pvalue = (sum(treatment_difference>=differenceMedian)+sum(treatment_difference<=other_limit))/total
display("P Value", pvalue)

'P Value'
0.0
1. Write a sentence interpreting the p-value.

#13 Since the p-value is 0, which is below our alpha value of 0.01, we can say that our result is statistically significant. In fact, we got 0 for our p-value which means that there is a 0% chance that this result is due to random chance which further shows the significance of our results (we fail to reject the null hypothesis).

In [ ]: