Project: Jingyi Xie - Autumn2016/BMS353

Path: Autumn2016 / solutions / week2 / Week2_practical_solutions.ipynb

Views: ⁴⁰³¹

Kernel: R (R-Project)

Week 2 Practical - Solutions

This Notebook contains the solution of the practical assignments for Week 2.

The solution to the exercises assigned but they are not unique. All the exercises proposed can be solved in different manners, some more eficient than other but still correct is the requested output is obtained.

For this reason, when you compare your solutions with these bear in mind that you might have found the solution using a different algorithm and different commands, nonethless they are still correct. These solutions are a guidance and also a surce of ideas that can be merged with yours to build your scripts and to improve it.

To help you to evaluate your notebooks please follow the following criteria:

Overall clarity (score = 0.25)
Correctness of the code (score = 0.25)
Exhaustive cover of required analysis (score= 0.25)
Interpretation of the results (score = 0.25)

These criteria are also used when peer-marking and self-marking. Please read the guidelines given to quantify the above criteria with a grade. Make sure that you score interpretation with a full mark if it is exhaustive and clear.

In [1]:

Data Frames

Exercise 1: Create another field for a second match score, to the game_card data frame and access each field separately to store them in to vectors. Use the markdown cell to describe your steps (Tip: to access the data frame field you use $, ie. `game_cards$ names` as below)

Please note that when creating data frames R turns names into factors. R does this by default when creating data frames from string vectors. We use stringsAsFactors=FALSE to avoid this.

In [14]:

names<- c('Bob','Claire','Luisa','Matt','Marta','Mike')
score<- c(34,82,59,72,50,100)

game_cards<- data.frame(names,score,stringsAsFactors=FALSE)
game_cards

  names  score
Bob     34  
Claire  82  
Luisa   59  
Matt    72  
Marta   50  
Mike   100  

In [15]:

game_cards$score2<- c(35,70,60,70,55,100)
game_cards

  names  score score2
Bob     34    35   
Claire  82    70   
Luisa   59    60   
Matt    72    70   
Marta   50    55   
Mike   100   100   

We can add a new filed to the data structure by using a new colname with the syntax game_cards$. We can also rename the colomuns using colnames(). We can then use the same syntax to store the content of each column in vectors

In [16]:

score1<-game_cards$score
score2<-game_cards$score2

score1
score2

[1]  34  82  59  72  50 100

[1]  35  70  60  70  55 100

Exercise 2: Do a simple analysis of the data frame that you have created.

First rename the columns into match1 and match2 (Tip: find out about colnames())
Calculate the dimensions of the frame, the overall minumum score and overall maxumum score. The minimum score for match1 and match2 and the maximum score for match1 and match2. Who were the players associated to those?
order the score for each match using the nested command game_cards[order(game_cards$score),]. Why do we need to use the function order() in this way? What is its output? Use the markdown cell to explain.

Step1: rename columns

In [18]:

colnames(game_cards)<-c("Names","match1","match2")
print(game_cards)

   Names match1 match2
  Bob     34     35
Claire     82     70
Luisa     59     60
 Matt     72     70
Marta     50     55
 Mike    100    100

Step2: Calculate the dimensions of the frame, the overall minumum score and overall maxumum score. The minimum score for match1 and match2 and the maximum score for match1 and match2. Identify the players associated to them.

In [37]:

omin<-min(game_cards[,2:3])
omax<-max(game_cards[,2:3])
# min and max are only defined on numerical values so we need to eliminate the names from the input
print(omin)
print(omax)

match1_min<-min(game_cards$match1)
match1_max<-max(game_cards$match1)

print(paste("first match min:",match1_min,sep=" ")) 
ind<-which(game_cards[,2]==match1_min, arr.ind=TRUE)
print(paste("Player:",game_cards$Names[ind]))


print(paste("first match max:",match1_min,sep=" ")) 
ind<-which(game_cards[,2]==match1_max, arr.ind=TRUE)
print(paste("Player:",game_cards$Names[ind]))


match2_min<-min(game_cards$match2)
match2_max<-max(game_cards$match2)

print(paste("second match min:",match2_min,sep=" ")) 
ind<-which(game_cards[,2]==match1_min, arr.ind=TRUE)
print(paste("Player:",game_cards$Names[ind]))

print(paste("second match max:",match2_min,sep=" ")) 
ind<-which(game_cards[,2]==match1_max, arr.ind=TRUE)
print(paste("Player:",game_cards$Names[ind]))

[1] 34
[1] 100
[1] "first match min: 34"
[1] "Player: Bob"
[1] "first match max: 34"
[1] "Player: Mike"
[1] "second match min: 35"
[1] "Player: Bob"
[1] "second match max: 35"
[1] "Player: Mike"

Step3: order scores

In [39]:

order_match1<-game_cards[order(game_cards$match1),]
order_match1

  Names  match1 match2
Bob     34     35   
Marta   50     55   
Luisa   59     60   
Matt    72     70   
Claire  82     70   
Mike   100    100   

In [40]:

order_match2<-game_cards[order(game_cards$match2),]
order_match2

  Names  match1 match2
Bob     34     35   
Marta   50     55   
Luisa   59     60   
Claire  82     70   
Matt    72     70   
Mike   100    100   

The function order gives as output the indeces of the sorted elements, for this reason we have to place the indeces (output of order()) as the rows of game_cards

Visualise the data

To visualise the data we use graphs that can be generated in different ways. The command plot() is a general graphical function that enables data plotting.

Exercise 3: Explore plot() with the R help and in a markdown command describe its characteristics and how we can change/add features.

We can also subdivide the plotting area in to different blocks to enable adjacent plots. We can do this using the command par(mfrow=c(nr,nc), where nr is the number of rows and nc is number of columns. For example to plot two graph charts (bar charts) on the same row we use par(mfrow=c(1,2)).

In [42]:

?plot()

plot() is a generic function to explore R object. It is a basic function used for generic X-Y plotting. A simple syntax is plot(x,y,...), where x is the coordinates of the points in the plot, while y is the y coordinates of the points in the plot. The ... gives room to enter more arguments, such as parameters, type, titles, and labels of the graph.

The parameters of the plot can be defined using the par function. that can be used within the function plot() or in teh environment as described in the example.

Some of the graphical parameters are: type is used to define the plot type, for instance "p" for points, "l" for lines, and "h" for histogram.

main is used to display the title of the graph (sub for subtitle).

xlab and ylab are used to label the x- and y- axes, respectively.

Exercise 4: The command par() implements a variety of more or less complex settings. Explore the par() command and in a markdown cell write a brief summary of settings that you think might be useful when presenting data.

There are a variery of graphical parameters that can be adjusted. Parameters can be set by specifying as arguments in tag=value form, or as a list of tagged values. Some useful adjustments to help display data include:

adj: a value between 0:1 to specify how justified text is; 0 for left-justified, 0.5 for centre, 1 for right-justified.

ann: annotation. Default is annotating axis titles and title, while if set to FALSE, there will be no annotations.

bg: to specify the colours to be used in the background.

cex, cex.axis, cex.lab, cex.main, cex.sub: magnification of plotting texts, with 1 as the default magnification value. Les than 1 makes it smaller.

col, col.axis, col.lab, col.main, col.sub: specifies the default colour of the plot. Default is black.

font, font.axis, font.lab, font.main, font.sub: specifies which font to use; is an integer. 1 = plain default text, 2 = bold, 3 = italic, 4 = bold italic.

lab: a numerical vector c(x,y,len) that modifies default way of axis annotation. Values of x and y are approximate number of tickmarks of x and y axes, whle len specifies the label length. Default is c(5,5,7).

las: style of axis labels; 0 = always parallel to axis (default); 1 = always horizontal; 2 = always perpendicular to axis; 3 = always vertical.

lty: specifies line type. Can either be an integer (0 = blank, 1 = solid (default), 2 = dashed, 3 = dotted, 4 = dotdash, 5 = longdash, 6 = twodash), or as character strings blank, solid, etc.

lwd: line width. Positive number. Default = 1.

mfcol, mfrow: a vector c(nr,nc); draw subsequent figures in an nrxnc array by column or rows.

pin: The current plot dimensions, (width, height), in inches.

xlog, ylog: Defaults to FALSE (linear scale). If TRUE, uses a logarithmic scale.

Exercise 5: What is the scatter plot useful for? Plot score against itsef. What do you get and why? Explore the data frame you created in Exercise 1. Explain in a markdown cell

The scatter plot is useful to visualise the relationship between two variables, usually the independent variable is x and dependent variable is y. Plotting score against score yields points that form a 45 degrees line, with a perfectly linear relationship. This is because x=y for every single point.

In [43]:

score<- c(34,82,59,72,50,100)
#score1<-c(24,32,69,56,45,90)
plot(score,score) 
abline(0,1) # give the 45 degrees line

In [49]:

game_cards_plot<-t(game_cards[-1])
colnames(game_cards_plot)<-game_cards$Names
game_cards_plot

barplot(game_cards_plot, beside=TRUE, xlab="Player", ylab="Score", main="Results for card games", ylim=c(0,130),
    legend=c("Match 1", "Match 2"), col=c("lightblue","red"))

       Bob Claire Luisa Matt Marta Mike
match1 34  82     59    72   50    100 
match2 35  70     60    70   55    100 

The data from Exercise 1 can be presented with a grouped bar plot instead of a scatter plot, since it would allow for a visual comparison of each player's scores for both games.

To create the bar plot above, I first converted my data frame into a matrix, since height in barplot can only be a vector or a matrix. Please note that the matrix is a 2x5 dimensions as opposed to 5x2 of the data frame. This is because the function barplot() treats each column as a group of bars. I switched beside to TRUE to achieve grouped bars instead of stacked bars. Simple labeling was achieved through main, xlab, and ylab. Legends were defaulted to the names of each row, to illustrate the data from different games. Finally, ylim was set to above 100, which was the theoretical maximum score in my card games, in order to make room for the legend box.

Exercse 6: Explore the iris data available in R. In the same print area plot two scatter plots of Sepal length versus Petal length and Sepal Width versus Petal width. What do you find?

In [51]:

par(mfrow=c(1,2))
plot(iris$Sepal.Length,iris$Petal.Length, xlab="Sepal Length", ylab="Petal Length")
plot(iris$Sepal.Width,iris$Petal.Width, xlab="Sepal Length", ylab="Petal Length")

From the scatter plots it appears that both sets of data exhibit a positive correlation. Also, the data is separated in dinstinct classes that ehibit different level of correlation. The bigger cluster is more positively correlated than the smaller cluster.

Exercise 7: Explore the iris dataset with linear regression and explain in a markdown cell your findings. Add titles and axis legends to the plots. Explore different colors and markers.

In [52]:

par(mfrow=c(2,2),bg="seashell",fg="navyblue")

plot(iris$Sepal.Length~iris$Petal.Length, xlab="Sepal Length", ylab="Petal Length", main="Sepal vs Petal Length")
reg1<-lm(iris$Sepal.Length~iris$Petal.Length)
abline(reg1)

plot(iris$Sepal.Width~iris$Petal.Width, xlab="Sepal Width", ylab="Petal Width", main="Sepal vs Petal Width")
reg2<-lm(iris$Sepal.Width~iris$Petal.Width)
abline(reg2)

plot(iris$Sepal.Length~iris$Sepal.Width, xlab="Sepal Length", ylab="Sepal Width", main="Sepal Length vs Width")
reg3<-lm(iris$Sepal.Length~iris$Sepal.Width)
abline(reg3)

plot(iris$Petal.Length~iris$Petal.Width, xlab="Petal Length", ylab="Petal Width", main="Petal Length vs Width")
reg4<-lm(iris$Petal.Length~iris$Petal.Width)
abline(reg4)

In [54]:

summary(reg1) # gives the summary for the regression model

Call:
lm(formula = iris$Sepal.Length ~ iris$Petal.Length)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.24675 -0.29657 -0.01515  0.27676  1.00269 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)        4.30660    0.07839   54.94   <2e-16 ***
iris$Petal.Length  0.40892    0.01889   21.65   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4071 on 148 degrees of freedom
Multiple R-squared:   0.76,	Adjusted R-squared:  0.7583 
F-statistic: 468.6 on 1 and 148 DF,  p-value: < 2.2e-16

I plotted the 4 regression models on the iris data. Sepal vs Petal Length and Petal Length vs Width appeared to have a positive, linear relationship, while the other two sets of comparisons appeared to be moderately linear in a negative direction. You can use the summary() to quantify the correlation.

Pie Charts

In 1887 Michelson-Morley experiments attempted to find variations in the speed of light due to earth’s motion through the aether. It was believed at the time that the aether was the medium through which light waves traveled. The data of this experiment is store in the dataset morley in R.

Exercise 8: Plot a pie chart as above with sample sizes for the experiments in morley data.

In [55]:

morley_table<-table(morley$Expt)
morley_table
labs<-paste("Experiment", names(morley_table), "\n", morley_table, sep=" ")
labs

pie(morley_table, labels=labs, main="Sample size of each experiment \n in Morley's experiments, 1887")

 1  2  3  4  5 
20 20 20 20 20 

[1] "Experiment 1 \n 20" "Experiment 2 \n 20" "Experiment 3 \n 20"
[4] "Experiment 4 \n 20" "Experiment 5 \n 20"

Boxplot

For the morley data we can also use boxplot for example:

In [56]:

boxplot(morley$Speed ~ morley$Expt,
  col='light grey', xlab='Experiment #',
  ylab="speed (km/s - 299,000)",
  main="Michelson–Morley experiment")
mtext("speed of light data")

sol=299792.458-299000 # deviation of real speed of ligth from the estimated 299,000 km/s
abline(h=sol, col='red')

The default behaviour is for the whiskers to extend out to the full range of the data...showing the extremes. Unless, that is, the extremes are too far away in which case they are considered outliers and plotted as circles. For the upper limit, Too far is taken as 'the upper quartile' + 1.5*'the interquartile range'. So, in this case, 'too far' would be:

In [ ]:

quantile(morley$Speed,prob=0.75)[["75%"]] + 1.5*IQR(morley$Speed)

In [ ]:

quantile(morley$Speed,prob=0.75)

In [ ]:

IQR(morley$Speed)

Exercise 9: Using these examples calculate for the morley data all the quantiles, the IQR, the mean, the standard deviation (sd()). Repeat the same for each experiment. (tip: morley$Speed[morley$Expt==1]). Discuss findings in a markdown cell. What can you conclude?

In [57]:

Morley_calculations=function(x){  
    Q1<-quantile(morley$Speed[morley$Expt==x],prob=0.25)
    Q2<-quantile(morley$Speed[morley$Expt==x],prob=0.5)
    Q3<-quantile(morley$Speed[morley$Expt==x],prob=0.75)
    IQR<-IQR(morley$Speed[morley$Expt==x])
    Avg<-mean(morley$Speed[morley$Expt==x])
    SD<-sd(morley$Speed[morley$Expt==x])
    
    Q<-c(Q1,Q2,Q3,IQR,Avg,SD,SD*2,IQR*1.5,Q3+(IQR*1.5),Avg+(SD*2))
    
    return (Q)
}

Label=c("Q1","Q2","Q3","IQR","Mean","SD","2SD","1.5IQR","Max_deviation_IQR","Max_deviation_SD")
Morley=data.frame(Label, Morley_calculations(1), Morley_calculations(2), Morley_calculations(3), 
                  Morley_calculations(4), Morley_calculations(5))
colnames(Morley)=c("Data", "Experiment 1", "Experiment 2", "Experiment 3", "Experiment 4", "Experiment 5")

Morley

   Data              Experiment 1 Experiment 2 Experiment 3 Experiment 4
Q1                 850.0000     800.00000    840.00000    767.50000  
Q2                 940.0000     845.00000    855.00000    815.00000  
Q3                 980.0000     885.00000    880.00000    865.00000  
IQR                130.0000      85.00000     40.00000     97.50000  
Mean               909.0000     856.00000    845.00000    820.50000  
SD                 104.9260      61.16414     79.10686     60.04165  
2SD                209.8521     122.32829    158.21371    120.08330  
1.5IQR             195.0000     127.50000     60.00000    146.25000  
Max_deviation_IQR 1175.0000    1012.50000    940.00000   1011.25000  
Max_deviation_SD  1118.8521     978.32829   1003.21371    940.58330  
   Experiment 5
807.50000   
810.00000   
870.00000   
 62.50000   
831.50000   
 54.21934   
108.43868   
 93.75000   
963.75000   
939.93868   

You can a function to return all the information about the morley data into a single vector, and organise it in a data frame based that inputs from this function to summarise the data from all 5 experiments.

In Morley data what we can notice is that IQR is larger than the SD (except in experiment 4), while the means and medians are quite comparable across all 5 experiments (well within the SD and IQR). With further calculations, it was found that the maximum deviation before a data point is considered an outlier is larger using the IQR+Q3 calculation, when compared to the mean+2SD calculation (again, except in experiment 4). Therefore the impact of the outliers in this type of data is very important, this leads to careful use of the statistical tools.

Histogram

They are very important plot to estimate density distributions from observed data. In the Morley data we can look at the Speed measured for all experiments and plot a histogram of the measurements.

In [ ]:

hist(morley$Speed)

We might want to make it prettier:

In [ ]:

par(fg=rgb(0.6,0.6,0.6))
hist(morley$Speed, prob=F,
     col=rgb(0.9,0.9,0.9),
     main='Michelson-Morley Experiment ',
     ylab="Frequency", xlab='Difference from Speed of Light')
par(fg='black')

We can calculate the density of the data and plot onto the histogram. We will have:

In [ ]:

par(fg=rgb(0.6,0.6,0.6))
hist(morley$Speed, prob=F,
     col=rgb(0.9,0.9,0.9),
     main='Michelson-Morley Experiment ',
     ylab="Frequency", xlab='Difference from Speed of Light')
par(fg='black')

lines(density(morley$Speed))
abline(v=mean(morley$Speed), col=rgb(0.5,0.5,0.5))
abline(v=median(morley$Speed), lty=3, col=rgb(0.5,0.5,0.5))
abline(v=mean(morley$Speed)+sd(morley$Speed), lty=2, col=rgb(0.7,0.7,0.7))
abline(v=mean(morley$Speed)-sd(morley$Speed), lty=2, col=rgb(0.7,0.7,0.7))
rug(morley$Speed)

Exercise 10: Plot a similar histogram for each experiment of the morley data. What do you conclude? Discuss it in a markdown cell.

In [74]:

par(mfrow=c(2,3))
expr<-matrix(c(1:20,21:40,41:60,61:80,81:100), nrow=20, ncol=5)

for (i in 1:5){
    x<-expr[,i]
    hist(morley$Speed[x], prob=F,main=paste("Michelson-Morley Experiment",i), ylab="Frequency", xlab="Difference from Speed of Light")
    lines(density(morley$Speed[expr[,i]]),col="black")
    abline(v=mean(morley$Speed[expr[,i]]), col="tomato", lwd=2)
    abline(v=median(morley$Speed[expr[,i]]), lty=3, col="purple", lwd=2)
    abline(v=mean(morley$Speed[expr[,i]])+sd(morley$Speed[expr[,i]]), lty=2, col="navyblue", lwd=2)
    abline(v=mean(morley$Speed[expr[,i]])-sd(morley$Speed[expr[,i]]), lty=2, col="navyblue", lwd=2)
    rug(morley$Speed[expr[,i]])
    }

Common distributions

Exercise 11: In R it is possible to generate more than one set of data from known probability distributions, changing the parameters accordingly. Using the commands:

rnorm() -- normal distribution
runif() -- uniform distribution
rbinom() -- binomial distribution to generate the data, calculate all the descriptive statistics. With the help of hist() and the density() discuss what you have found.

In [132]:

size<-10000
names<-c("norm","unif","binom")
norm<-rnorm(10000)
unif<-runif(10000)
binom<-rbinom(10000,2,0.5)

dist<-matrix(c(norm,unif,binom), nrow=10000,ncol=3)
par(mfrow=c(1,3))
for (i in 1:3){
    x<-dist[,i]
    hist(x, main=paste(names[i]," distribution"), ylab="Frequency", xlab="Value")
    lines(density(x),col="black")
    abline(v=mean(x), col="tomato", lwd=2)
    abline(v=median(x), lty=3, col="purple", lwd=2)
    abline(v=mean(x)+sd(x), lty=2, col="navyblue", lwd=2)
    abline(v=mean(x)-sd(x), lty=2, col="navyblue", lwd=2)
    rug(x)
}

I used 10000 samples to generate each histogram, to better estimate the density function from Histograms. For all three sets of data, the mean and median were identical, suggesting that the distributions are simmetrical. unif has bars of similar heights, as expected in a uniformly distributed population. If you increase the number of success in the binomial distribution you approximate it to a gaussian. The rugs under each graph are interesting visual reflections too.

T Test

T test is a way to compare sets of data that share common variance and their unknown distribution can be approximated with a Normal distribution. You can perform t-test in R using the command t.test(). Explore it with R help.

In case of the morley data we can perform t-test comapring Experiment 1 versus Experiment 2 Using the following syntax: t.test(morley$Speed[morley$Expt==1], morley$Speed[morley$Expt==2])

Exercise 12: Using the morley data perform a t-test for each experiment and compare the results. What is that you can conclude from this dataset? Use the markdown cell to explain the results and discuss your conclusions.

In [139]:

Expt_1_2<-t.test(morley$Speed[morley$Expt==1], morley$Speed[morley$Expt==2])
Expt_1_3<-t.test(morley$Speed[morley$Expt==1], morley$Speed[morley$Expt==3])
Expt_1_4<-t.test(morley$Speed[morley$Expt==1], morley$Speed[morley$Expt==4])
Expt_1_5<-t.test(morley$Speed[morley$Expt==1], morley$Speed[morley$Expt==5])
Expt_2_3<-t.test(morley$Speed[morley$Expt==2], morley$Speed[morley$Expt==3])
Expt_2_4<-t.test(morley$Speed[morley$Expt==2], morley$Speed[morley$Expt==4])
Expt_2_5<-t.test(morley$Speed[morley$Expt==2], morley$Speed[morley$Expt==5])
Expt_3_4<-t.test(morley$Speed[morley$Expt==3], morley$Speed[morley$Expt==4])
Expt_3_5<-t.test(morley$Speed[morley$Expt==3], morley$Speed[morley$Expt==5])
Expt_4_5<-t.test(morley$Speed[morley$Expt==4], morley$Speed[morley$Expt==5])

Expt_1_2

P_values<-c(Expt_1_2$p.value, Expt_1_3$p.value, Expt_1_4$p.value, Expt_1_5$p.value, Expt_2_3$p.value, Expt_2_4$p.value, Expt_2_5$p.value, Expt_3_4$p.value,Expt_3_5$p.value, Expt_4_5$p.value)
Morley_TTest<-matrix(P_values,nrow=10,ncol=1)
Morley_TTest

	Welch Two Sample t-test

data:  morley$Speed[morley$Expt == 1] and morley$Speed[morley$Expt == 2]
t = 1.9516, df = 30.576, p-value = 0.0602
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
  -2.419111 108.419111
sample estimates:
mean of x mean of y 
      909       856 

      [,1]       
 [1,] 0.060200496
 [2,] 0.036157418
 [3,] 0.002658854
 [4,] 0.006537757
 [5,] 0.625755649
 [6,] 0.071762251
 [7,] 0.188155833
 [8,] 0.277349303
 [9,] 0.533261995
[10,] 0.546788293

The p-values are not consistent and therfore the hypothesis that the presence of the aeter would not change the speed of light when travel at different angle is accepted.

The experiment concluded that the difference between the speed of light in the direction of movement through the presumed aether, and the speed at right angles did not exist.

In [ ]: