 CoCalc Public FilesWeek 1 / hw1-turnin.ipynb
Author: Valeria Rojas
Views : 57
Description: Jupyter notebook Week 1/hw1-turnin.ipynb

# Homework 1: Visualizing Data

In [ ]:



Name: Valeria Rojas

I collaborated with: Eva Guillory and Usma Rahman

In :
# Dataset: Representative dataset of age of first time mothers in USA in 1980 and 2016

ages1980 = [23,32,20,27,28,20,29,25,19,19,19,27,27,25,24,18,17,20,22,17,22,20,19,23,16,20,20,15,17,15,15,19,21,25,17,28,18,23,24,24,20,24,23,28,33,25,25,21,24,20,21,25,17,21,24,29,34,22,20,23,23,16,22,24,23,16,22,20,17,27,23,21,16,19,29,25,23,16,19,21,26,26,26,31,17,20,18,21,22,27,30,16,17,21,26,16,31,22,30,21,24,22,23,22,26,27,24,28,27,20,21,20,22,21,33,22,18,28,19,24,26,18,22,14,23,19,17,18,23,29,25,28,18,32,19,21,25,18,18,19,18,23,24,18,21,21,26,26,18,25,19,27,21,27,20,27,19,19,36,22,29,19,26,20,17,17,20,19,25,21,28,31,17,18,29,24,26,19,19,23,16,30,18,28,18,20,31,24,26,22,32,25,18,22,30,19,35,20,15,30]
ages2016 = [21,29,41,23,35,16,18,19,26,29,33,19,18,28,21,18,24,31,37,31,27,36,26,23,34,25,22,33,27,26,26,30,32,20,28,20,24,20,22,33,24,20,19,23,19,28,34,33,19,38,31,16,38,19,22,25,31,26,34,21,20,30,27,20,29,14,18,30,31,17,36,33,32,32,19,29,35,28,22,27,34,29,28,22,20,29,33,22,25,26,26,31,30,26,37,34,27,20,21,25,32,18,28,27,35,31,31,26,25,24,30,15,28,26,17,35,27,24,23,20,24,21,31,18,29,24,27,28,20,21,21,32,23,35,39,19,28,30,31,23,21,39,23,32,25,28,36,19,23,17,21,30,27,32,27,22,19,22,28,28,24,32,22,37,21,30,30,23,29,29,18,40,33,27,25,26,22,29,27,30,29,30,25,28,24,23,17,19,36,18,29,22,25,20,20,34,32,21,21,25]

In :
#Problem 1 Part A
%matplotlib inline
import matplotlib as mat
import seaborn as sns
p=sns.stripplot(ages1980, orient="vertical", color="hotpink")
p.set(xlabel="Ages (years)", ylabel="Amount of Mothers (%)") #first dotplot for mothers in 1980

[Text(0,0.5,'Amount of Mothers (%)'), Text(0.5,0,'Ages (years)')] In :
p1=sns.stripplot(ages2016, orient="vertical", color="gold")
p1.set(xlabel="Ages (years)", ylabel="Amount of Mothers (%)") #second dotplot for mothers in 2016

[Text(0,0.5,'Amount of Mothers (%)'), Text(0.5,0,'Ages (years)')] In :
p2=sns.swarmplot(ages1980, color="purple")
p2.set(xlabel="Ages (years)", ylabel="Amount of Mothers (%)") #first beeswarm plot of Mothers in 1980

[Text(0,0.5,'Amount of Mothers (%)'), Text(0.5,0,'Ages (years)')] In :
p3=sns.swarmplot(ages2016, color="orange")
p3.set(xlabel="Ages (years)", ylabel="Amount of Mothers (%)") #second beeswarm plot of Mothers in 2016

[Text(0,0.5,'Amount of Mothers (%)'), Text(0.5,0,'Ages (years)')] In :
p4=sns.distplot(ages1980, kde=True, axlabel="X axis label", color="limegreen")
p4.set(xlabel="Ages (years)", ylabel="Amount of Mothers (%)") #first histogram of Mothers in 1980

[Text(0,0.5,'Amount of Mothers (%)'), Text(0.5,0,'Ages (years)')] In :
p5=sns.distplot(ages2016, kde=True, axlabel="X axis label", color="silver")
p5.set(xlabel="Ages (years)", ylabel="Amount of Mothers (%)") #second histogram of Mothers in 2016

[Text(0,0.5,'Amount of Mothers (%)'), Text(0.5,0,'Ages (years)')] In :
p6=sns.violinplot(ages1980)
p6.set(xlabel="Ages (years)", ylabel="Amount of Mothers (%)") #first violin plot of mothers in 1980

[Text(0,0.5,'Amount of Mothers (%)'), Text(0.5,0,'Ages (years)')] In :
p7=sns.violinplot(ages2016)
p7.set(xlabel="Ages (years)", ylabel="Amount of Mothers (%)") #second violin plot of mothers in 2016

[Text(0,0.5,'Amount of Mothers (%)'), Text(0.5,0,'Ages (years)')] In :
pbox=sns.boxplot(ages1980, color="violet", orient="vertical")
pbox.set(xlabel="Rainfall (inches)", ylabel="Los Angeles") #first boxplot for mothers in 1980

[Text(0,0.5,'Los Angeles'), Text(0.5,0,'Rainfall (inches)')] In :
pbox1=sns.boxplot(ages2016, color="red", orient="vertical")
pbox1.set(xlabel="Rainfall (inches)", ylabel="Los Angeles") #second boxplot for mothers in 2016

[Text(0,0.5,'Los Angeles'), Text(0.5,0,'Rainfall (inches)')] In :
#Problem 1 Part B
import numpy as np
m1980mean=np.mean(ages1980) #mean of mothers in 1980
print(m1980mean)

m1980median=np.median(ages1980)  #median of mothers in 1980
print(m1980median)

m1980variance=np.var(ages1980)  #variance of mothers in 1980
print(m1980variance)

m1980standard_deviation=np.std(ages1980)  #standard deviation of mothers in 1980
print(m1980standard_deviation)

22.55 22.0 20.8975 4.571378347938398
In :
import numpy as np
m2016mean=np.mean(ages2016) #mean of mothers in 2016
print(m2016mean)

m2016median=np.median(ages2016)  #median of mothers in 2016
print(m2016median)

m2016variance=np.var(ages2016)  #variance of mothers in 2016
print(m2016variance)

m2016standard_deviation=np.std(ages2016)  #standard deviation of mothers in 2016
print(m2016standard_deviation)

26.31 26.0 33.6439 5.800336197152713
In :
#Problem 1 Part C

#I would use the histogram because it shows two types of curves/ranges. The bar graph is helpful to show the increase and decrease in the amount of mothers throughout different points of their life. And the curve helps show a trend overall seeing as it curves over the bar graph.

#Secondly I would use the mean for the descriptive statistic because it takes into account every single mother. The median simply shows us the middle value of the mothers however the mean actually displays the overall account for their ages and population surveyed.

In :
#Problem 1 Part D

#The mean statistics will cause the graph to appear right skewed because the value for the mean increases from 1980 to 2016.

#The histogram plot will cause the graph to appear thick-tailed and bimodial (for 2016) and unimodial (for 1980) according to the top curve for the line on the graph.

In :
#Problem 2

pebbles_number=[0,6,12,6,6,5,2,3,3,2,1,0,2,1,1] #setting up values as a list for # of pebbles
pebbles_number_median=np.median(pebbles_number) #code to set up calculating median for # of pebbles
print(pebbles_number_median) #print median for first y-axes using # of pebbles, referring to the bar graph

pebbles_percent=[0,1.7,4.1,5.9,7,8.1,8.3,9.2,10,10.2,10.3,10.28,11.3,11.4,11.9,11.88] #setting up values as a list for % of pebbles
pebbles_percent_median=np.median(pebbles_curve) #code to set up calculating median for % of pebbles
print(pebbles_percent_median) #median for second y-axes,% of pebbles, referring to the curvy line

2.0 9.6
In [ ]:
#sketch graph of number # of pebbles (seperate PDF file)
#(under another file in this folder, there is a PDF file named "HW 1 Sketch es". The sketches for numbers 2 AND 4 will be under that separate PDF file)

In [ ]:
#sketch graph of percentage % of pebbles (seperate PDF file)
#(under another file in this folder, there is a PDF file named "HW 1 Sketch es". The sketches for numbers 2 AND 4 will be under that separate PDF file)

In [ ]:


In :
#Problem 3 Part A
#The first dot (from left to right) has a value for about 42.
#The second dot has a value of approximately 50.
#The first line extends from values of about 63 to 84.
#The width of the box extends from about the values for 84 to 95.
#The final line extends from about a value of 95 to 99.


In [ ]:
#Problem 3 Part B
#The box plot is skewed to the left since most of the values lie to the right. The boxplot also has outliers at about 42 and 50. The boxplot is unimodal from about 84 to 95.

In [ ]:
#Problem 3 Part C
#The exam is most likely easy because a majority of the median values fall above values of 82. Therefore this median value within the interquartile range can report than more of the class did relatively well on the exam, meaning it was maybe more on the easy side.

In [ ]:
#Problem 4
#sketch on seperate paper, as well as explanations for 1 apppropriate graph and 2 unappropriate graphs for distribution of means
#(under another file in this folder, there is a PDF file named "HW 1 Sketch es". The sketches for numbers 2 AND 4 will be under that separate PDF file)

In [ ]:
#Problem 5
#yes this can imply that at least 4 of the shells weigh less than 200 grams
#after doing some calculations, 4 shells can weigh 100 grams each (adding up to 400g), 2 shells can weight 50gram each (adding up to 100g), and 4 more shells can weigh 375 g each (adding up to 1500g).Once adding up the total indivdual values (400g+100g+1500+) you get 2000 g. Divide that by 10 (the number of shells)and you get 200g (the calculated average weight from the 10 randomly collected shells)

In [ ]:
#Problem 6
#One thing done well is the use of doing 3 studies to get three sets of data and to compare them. Also this graph shows a better comparison between both groups.
#A second thing done well is explaining what the error bars represent. WHile it is not preferred to use the mean, at least the data is telling the viewer what they represent.

#One thing that can be improved is taking out the error bars. This is b/c error bars completely limit and strictly confine the values to a certain set. Also while they do indicate what the error bars represent, they use it to indicate standard errors of the mean, as opposed to the median (which is preferred)
#Another thing that can be improved is to use another form of graph to present the data. This is because bar graphs supresses the shape of distribution for the number of words written by students. Also it can make it harder to differentiate between the two groups being tested. If one decides to still use a bar graph to represent data, then at least include a dot plot next to it to show a better distribution of the data.

In [ ]: