Contact
CoCalc Logo Icon
StoreFeaturesDocsShareSupport News AboutSign UpSign In
| Download

R

Views: 4031
Kernel: R (R-Project)

© Copyright 2016 Dr Marta Milo, University of Sheffield.

Week 2 Practical - Solutions

This Notebook contains the solution of the practical assignments for Week 2.

The solution to the exercises assigned but they are not unique. All the exercises proposed can be solved in different manners, some more eficient than other but still correct is the requested output is obtained.

For this reason, when you compare your solutions with these bear in mind that you might have found the solution using a different algorithm and different commands, nonethless they are still correct. These solutions are a guidance and also a surce of ideas that can be merged with yours to build your scripts and to improve it.

To help you to evaluate your notebooks please follow the following criteria:

  • Overall clarity (score = 0.25)

  • Correctness of the code (score = 0.25)

  • Exhaustive cover of required analysis (score= 0.25)

  • Interpretation of the results (score = 0.25)

These criteria are also used when peer-marking and self-marking. Please read the guidelines given to quantify the above criteria with a grade. Make sure that you score interpretation with a full mark if it is exhaustive and clear.

Data Frames

Exercise 1: Create another field for a second match score, to the game_card data frame and access each field separately to store them in to vectors. Use the markdown cell to describe your steps (Tip: to access the data frame field you use ,ie.gamecards, ie. `game_cardsnames` as below)

Please note that when creating data frames R turns names into factors. R does this by default when creating data frames from string vectors. We use stringsAsFactors=FALSE to avoid this.

names<- c('Bob','Claire','Luisa','Matt','Marta','Mike') score<- c(34,82,59,72,50,100) game_cards<- data.frame(names,score,stringsAsFactors=FALSE) game_cards
names score 1 Bob 34 2 Claire 82 3 Luisa 59 4 Matt 72 5 Marta 50 6 Mike 100
game_cards$score2<- c(35,70,60,70,55,100) game_cards
names score score2 1 Bob 34 35 2 Claire 82 70 3 Luisa 59 60 4 Matt 72 70 5 Marta 50 55 6 Mike 100 100

We can add a new filed to the data structure by using a new colname with the syntax game_cards$. We can also rename the colomuns using colnames(). We can then use the same syntax to store the content of each column in vectors

score1<-game_cards$score score2<-game_cards$score2 score1 score2
[1] 34 82 59 72 50 100
[1] 35 70 60 70 55 100

Exercise 2: Do a simple analysis of the data frame that you have created.

  • First rename the columns into match1 and match2 (Tip: find out about colnames())

  • Calculate the dimensions of the frame, the overall minumum score and overall maxumum score. The minimum score for match1 and match2 and the maximum score for match1 and match2. Who were the players associated to those?

  • order the score for each match using the nested command game_cards[order(game_cards$score),]. Why do we need to use the function order() in this way? What is its output? Use the markdown cell to explain.

Step1: rename columns

colnames(game_cards)<-c("Names","match1","match2") print(game_cards)
Names match1 match2 1 Bob 34 35 2 Claire 82 70 3 Luisa 59 60 4 Matt 72 70 5 Marta 50 55 6 Mike 100 100

Step2: Calculate the dimensions of the frame, the overall minumum score and overall maxumum score. The minimum score for match1 and match2 and the maximum score for match1 and match2. Identify the players associated to them.

omin<-min(game_cards[,2:3]) omax<-max(game_cards[,2:3]) # min and max are only defined on numerical values so we need to eliminate the names from the input print(omin) print(omax) match1_min<-min(game_cards$match1) match1_max<-max(game_cards$match1) print(paste("first match min:",match1_min,sep=" ")) ind<-which(game_cards[,2]==match1_min, arr.ind=TRUE) print(paste("Player:",game_cards$Names[ind])) print(paste("first match max:",match1_min,sep=" ")) ind<-which(game_cards[,2]==match1_max, arr.ind=TRUE) print(paste("Player:",game_cards$Names[ind])) match2_min<-min(game_cards$match2) match2_max<-max(game_cards$match2) print(paste("second match min:",match2_min,sep=" ")) ind<-which(game_cards[,2]==match1_min, arr.ind=TRUE) print(paste("Player:",game_cards$Names[ind])) print(paste("second match max:",match2_min,sep=" ")) ind<-which(game_cards[,2]==match1_max, arr.ind=TRUE) print(paste("Player:",game_cards$Names[ind]))
[1] 34 [1] 100 [1] "first match min: 34" [1] "Player: Bob" [1] "first match max: 34" [1] "Player: Mike" [1] "second match min: 35" [1] "Player: Bob" [1] "second match max: 35" [1] "Player: Mike"

Step3: order scores

order_match1<-game_cards[order(game_cards$match1),] order_match1
Names match1 match2 1 Bob 34 35 5 Marta 50 55 3 Luisa 59 60 4 Matt 72 70 2 Claire 82 70 6 Mike 100 100
order_match2<-game_cards[order(game_cards$match2),] order_match2
Names match1 match2 1 Bob 34 35 5 Marta 50 55 3 Luisa 59 60 2 Claire 82 70 4 Matt 72 70 6 Mike 100 100

The function order gives as output the indeces of the sorted elements, for this reason we have to place the indeces (output of order()) as the rows of game_cards

Visualise the data

To visualise the data we use graphs that can be generated in different ways. The command plot() is a general graphical function that enables data plotting.

Exercise 3: Explore plot() with the R help and in a markdown command describe its characteristics and how we can change/add features.

We can also subdivide the plotting area in to different blocks to enable adjacent plots. We can do this using the command par(mfrow=c(nr,nc), where nr is the number of rows and nc is number of columns. For example to plot two graph charts (bar charts) on the same row we use par(mfrow=c(1,2)).

?plot()

plot() is a generic function to explore R object. It is a basic function used for generic X-Y plotting. A simple syntax is plot(x,y,...), where x is the coordinates of the points in the plot, while y is the y coordinates of the points in the plot. The ... gives room to enter more arguments, such as parameters, type, titles, and labels of the graph.

The parameters of the plot can be defined using the par function. that can be used within the function plot() or in teh environment as described in the example.

Some of the graphical parameters are: type is used to define the plot type, for instance "p" for points, "l" for lines, and "h" for histogram.

main is used to display the title of the graph (sub for subtitle).

xlab and ylab are used to label the x- and y- axes, respectively.

Exercise 4: The command par() implements a variety of more or less complex settings. Explore the par() command and in a markdown cell write a brief summary of settings that you think might be useful when presenting data.

There are a variery of graphical parameters that can be adjusted. Parameters can be set by specifying as arguments in tag=value form, or as a list of tagged values. Some useful adjustments to help display data include:

adj: a value between 0:1 to specify how justified text is; 0 for left-justified, 0.5 for centre, 1 for right-justified.

ann: annotation. Default is annotating axis titles and title, while if set to FALSE, there will be no annotations.

bg: to specify the colours to be used in the background.

cex, cex.axis, cex.lab, cex.main, cex.sub: magnification of plotting texts, with 1 as the default magnification value. Les than 1 makes it smaller.

col, col.axis, col.lab, col.main, col.sub: specifies the default colour of the plot. Default is black.

font, font.axis, font.lab, font.main, font.sub: specifies which font to use; is an integer. 1 = plain default text, 2 = bold, 3 = italic, 4 = bold italic.

lab: a numerical vector c(x,y,len) that modifies default way of axis annotation. Values of x and y are approximate number of tickmarks of x and y axes, whle len specifies the label length. Default is c(5,5,7).

las: style of axis labels; 0 = always parallel to axis (default); 1 = always horizontal; 2 = always perpendicular to axis; 3 = always vertical.

lty: specifies line type. Can either be an integer (0 = blank, 1 = solid (default), 2 = dashed, 3 = dotted, 4 = dotdash, 5 = longdash, 6 = twodash), or as character strings blank, solid, etc.

lwd: line width. Positive number. Default = 1.

mfcol, mfrow: a vector c(nr,nc); draw subsequent figures in an nrxnc array by column or rows.

pin: The current plot dimensions, (width, height), in inches.

xlog, ylog: Defaults to FALSE (linear scale). If TRUE, uses a logarithmic scale.

Exercise 5: What is the scatter plot useful for? Plot score against itsef. What do you get and why? Explore the data frame you created in Exercise 1. Explain in a markdown cell

The scatter plot is useful to visualise the relationship between two variables, usually the independent variable is x and dependent variable is y. Plotting score against score yields points that form a 45 degrees line, with a perfectly linear relationship. This is because x=y for every single point.

score<- c(34,82,59,72,50,100) #score1<-c(24,32,69,56,45,90) plot(score,score) abline(0,1) # give the 45 degrees line
Image in a Jupyter notebook
game_cards_plot<-t(game_cards[-1]) colnames(game_cards_plot)<-game_cards$Names game_cards_plot barplot(game_cards_plot, beside=TRUE, xlab="Player", ylab="Score", main="Results for card games", ylim=c(0,130), legend=c("Match 1", "Match 2"), col=c("lightblue","red"))
Bob Claire Luisa Matt Marta Mike match1 34 82 59 72 50 100 match2 35 70 60 70 55 100
Image in a Jupyter notebook

The data from Exercise 1 can be presented with a grouped bar plot instead of a scatter plot, since it would allow for a visual comparison of each player's scores for both games.

To create the bar plot above, I first converted my data frame into a matrix, since height in barplot can only be a vector or a matrix. Please note that the matrix is a 2x5 dimensions as opposed to 5x2 of the data frame. This is because the function barplot() treats each column as a group of bars. I switched beside to TRUE to achieve grouped bars instead of stacked bars. Simple labeling was achieved through main, xlab, and ylab. Legends were defaulted to the names of each row, to illustrate the data from different games. Finally, ylim was set to above 100, which was the theoretical maximum score in my card games, in order to make room for the legend box.

Exercse 6: Explore the iris data available in R. In the same print area plot two scatter plots of Sepal length versus Petal length and Sepal Width versus Petal width. What do you find?

par(mfrow=c(1,2)) plot(iris$Sepal.Length,iris$Petal.Length, xlab="Sepal Length", ylab="Petal Length") plot(iris$Sepal.Width,iris$Petal.Width, xlab="Sepal Length", ylab="Petal Length")
Image in a Jupyter notebook

From the scatter plots it appears that both sets of data exhibit a positive correlation. Also, the data is separated in dinstinct classes that ehibit different level of correlation. The bigger cluster is more positively correlated than the smaller cluster.

Exercise 7: Explore the iris dataset with linear regression and explain in a markdown cell your findings. Add titles and axis legends to the plots. Explore different colors and markers.

par(mfrow=c(2,2),bg="seashell",fg="navyblue") plot(iris$Sepal.Length~iris$Petal.Length, xlab="Sepal Length", ylab="Petal Length", main="Sepal vs Petal Length") reg1<-lm(iris$Sepal.Length~iris$Petal.Length) abline(reg1) plot(iris$Sepal.Width~iris$Petal.Width, xlab="Sepal Width", ylab="Petal Width", main="Sepal vs Petal Width") reg2<-lm(iris$Sepal.Width~iris$Petal.Width) abline(reg2) plot(iris$Sepal.Length~iris$Sepal.Width, xlab="Sepal Length", ylab="Sepal Width", main="Sepal Length vs Width") reg3<-lm(iris$Sepal.Length~iris$Sepal.Width) abline(reg3) plot(iris$Petal.Length~iris$Petal.Width, xlab="Petal Length", ylab="Petal Width", main="Petal Length vs Width") reg4<-lm(iris$Petal.Length~iris$Petal.Width) abline(reg4)
Image in a Jupyter notebook
summary(reg1) # gives the summary for the regression model
Call: lm(formula = iris$Sepal.Length ~ iris$Petal.Length) Residuals: Min 1Q Median 3Q Max -1.24675 -0.29657 -0.01515 0.27676 1.00269 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.30660 0.07839 54.94 <2e-16 *** iris$Petal.Length 0.40892 0.01889 21.65 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.4071 on 148 degrees of freedom Multiple R-squared: 0.76, Adjusted R-squared: 0.7583 F-statistic: 468.6 on 1 and 148 DF, p-value: < 2.2e-16

I plotted the 4 regression models on the iris data. Sepal vs Petal Length and Petal Length vs Width appeared to have a positive, linear relationship, while the other two sets of comparisons appeared to be moderately linear in a negative direction. You can use the summary() to quantify the correlation.

Pie Charts

In 1887 Michelson-Morley experiments attempted to find variations in the speed of light due to earth’s motion through the aether. It was believed at the time that the aether was the medium through which light waves traveled. The data of this experiment is store in the dataset morley in R.

Exercise 8: Plot a pie chart as above with sample sizes for the experiments in morley data.

morley_table<-table(morley$Expt) morley_table labs<-paste("Experiment", names(morley_table), "\n", morley_table, sep=" ") labs pie(morley_table, labels=labs, main="Sample size of each experiment \n in Morley's experiments, 1887")
1 2 3 4 5 20 20 20 20 20
[1] "Experiment 1 \n 20" "Experiment 2 \n 20" "Experiment 3 \n 20" [4] "Experiment 4 \n 20" "Experiment 5 \n 20"
Image in a Jupyter notebook

Boxplot

For the morley data we can also use boxplot for example:

boxplot(morley$Speed ~ morley$Expt, col='light grey', xlab='Experiment #', ylab="speed (km/s - 299,000)", main="Michelson–Morley experiment") mtext("speed of light data") sol=299792.458-299000 # deviation of real speed of ligth from the estimated 299,000 km/s abline(h=sol, col='red')
Image in a Jupyter notebook

The default behaviour is for the whiskers to extend out to the full range of the data...showing the extremes. Unless, that is, the extremes are too far away in which case they are considered outliers and plotted as circles. For the upper limit, Too far is taken as 'the upper quartile' + 1.5*'the interquartile range'. So, in this case, 'too far' would be:

quantile(morley$Speed,prob=0.75)[["75%"]] + 1.5*IQR(morley$Speed)
quantile(morley$Speed,prob=0.75)
IQR(morley$Speed)

Exercise 9: Using these examples calculate for the morley data all the quantiles, the IQR, the mean, the standard deviation (sd()). Repeat the same for each experiment. (tip: morley$Speed[morley$Expt==1]). Discuss findings in a markdown cell. What can you conclude?

Morley_calculations=function(x){ Q1<-quantile(morley$Speed[morley$Expt==x],prob=0.25) Q2<-quantile(morley$Speed[morley$Expt==x],prob=0.5) Q3<-quantile(morley$Speed[morley$Expt==x],prob=0.75) IQR<-IQR(morley$Speed[morley$Expt==x]) Avg<-mean(morley$Speed[morley$Expt==x]) SD<-sd(morley$Speed[morley$Expt==x]) Q<-c(Q1,Q2,Q3,IQR,Avg,SD,SD*2,IQR*1.5,Q3+(IQR*1.5),Avg+(SD*2)) return (Q) } Label=c("Q1","Q2","Q3","IQR","Mean","SD","2SD","1.5IQR","Max_deviation_IQR","Max_deviation_SD") Morley=data.frame(Label, Morley_calculations(1), Morley_calculations(2), Morley_calculations(3), Morley_calculations(4), Morley_calculations(5)) colnames(Morley)=c("Data", "Experiment 1", "Experiment 2", "Experiment 3", "Experiment 4", "Experiment 5") Morley
Data Experiment 1 Experiment 2 Experiment 3 Experiment 4 1 Q1 850.0000 800.00000 840.00000 767.50000 2 Q2 940.0000 845.00000 855.00000 815.00000 3 Q3 980.0000 885.00000 880.00000 865.00000 4 IQR 130.0000 85.00000 40.00000 97.50000 5 Mean 909.0000 856.00000 845.00000 820.50000 6 SD 104.9260 61.16414 79.10686 60.04165 7 2SD 209.8521 122.32829 158.21371 120.08330 8 1.5IQR 195.0000 127.50000 60.00000 146.25000 9 Max_deviation_IQR 1175.0000 1012.50000 940.00000 1011.25000 10 Max_deviation_SD 1118.8521 978.32829 1003.21371 940.58330 Experiment 5 1 807.50000 2 810.00000 3 870.00000 4 62.50000 5 831.50000 6 54.21934 7 108.43868 8 93.75000 9 963.75000 10 939.93868

You can a function to return all the information about the morley data into a single vector, and organise it in a data frame based that inputs from this function to summarise the data from all 5 experiments.

In Morley data what we can notice is that IQR is larger than the SD (except in experiment 4), while the means and medians are quite comparable across all 5 experiments (well within the SD and IQR). With further calculations, it was found that the maximum deviation before a data point is considered an outlier is larger using the IQR+Q3 calculation, when compared to the mean+2SD calculation (again, except in experiment 4). Therefore the impact of the outliers in this type of data is very important, this leads to careful use of the statistical tools.

Histogram

They are very important plot to estimate density distributions from observed data. In the Morley data we can look at the Speed measured for all experiments and plot a histogram of the measurements.

hist(morley$Speed)

We might want to make it prettier:

par(fg=rgb(0.6,0.6,0.6)) hist(morley$Speed, prob=F, col=rgb(0.9,0.9,0.9), main='Michelson-Morley Experiment ', ylab="Frequency", xlab='Difference from Speed of Light') par(fg='black')

We can calculate the density of the data and plot onto the histogram. We will have:

par(fg=rgb(0.6,0.6,0.6)) hist(morley$Speed, prob=F, col=rgb(0.9,0.9,0.9), main='Michelson-Morley Experiment ', ylab="Frequency", xlab='Difference from Speed of Light') par(fg='black') lines(density(morley$Speed)) abline(v=mean(morley$Speed), col=rgb(0.5,0.5,0.5)) abline(v=median(morley$Speed), lty=3, col=rgb(0.5,0.5,0.5)) abline(v=mean(morley$Speed)+sd(morley$Speed), lty=2, col=rgb(0.7,0.7,0.7)) abline(v=mean(morley$Speed)-sd(morley$Speed), lty=2, col=rgb(0.7,0.7,0.7)) rug(morley$Speed)

Exercise 10: Plot a similar histogram for each experiment of the morley data. What do you conclude? Discuss it in a markdown cell.

par(mfrow=c(2,3)) expr<-matrix(c(1:20,21:40,41:60,61:80,81:100), nrow=20, ncol=5) for (i in 1:5){ x<-expr[,i] hist(morley$Speed[x], prob=F,main=paste("Michelson-Morley Experiment",i), ylab="Frequency", xlab="Difference from Speed of Light") lines(density(morley$Speed[expr[,i]]),col="black") abline(v=mean(morley$Speed[expr[,i]]), col="tomato", lwd=2) abline(v=median(morley$Speed[expr[,i]]), lty=3, col="purple", lwd=2) abline(v=mean(morley$Speed[expr[,i]])+sd(morley$Speed[expr[,i]]), lty=2, col="navyblue", lwd=2) abline(v=mean(morley$Speed[expr[,i]])-sd(morley$Speed[expr[,i]]), lty=2, col="navyblue", lwd=2) rug(morley$Speed[expr[,i]]) }
Image in a Jupyter notebook

Common distributions

Exercise 11: In R it is possible to generate more than one set of data from known probability distributions, changing the parameters accordingly. Using the commands:

  • rnorm() -- normal distribution

  • runif() -- uniform distribution

  • rbinom() -- binomial distribution to generate the data, calculate all the descriptive statistics. With the help of hist() and the density() discuss what you have found.

size<-10000 names<-c("norm","unif","binom") norm<-rnorm(10000) unif<-runif(10000) binom<-rbinom(10000,2,0.5) dist<-matrix(c(norm,unif,binom), nrow=10000,ncol=3) par(mfrow=c(1,3)) for (i in 1:3){ x<-dist[,i] hist(x, main=paste(names[i]," distribution"), ylab="Frequency", xlab="Value") lines(density(x),col="black") abline(v=mean(x), col="tomato", lwd=2) abline(v=median(x), lty=3, col="purple", lwd=2) abline(v=mean(x)+sd(x), lty=2, col="navyblue", lwd=2) abline(v=mean(x)-sd(x), lty=2, col="navyblue", lwd=2) rug(x) }
Image in a Jupyter notebook

I used 10000 samples to generate each histogram, to better estimate the density function from Histograms. For all three sets of data, the mean and median were identical, suggesting that the distributions are simmetrical. unif has bars of similar heights, as expected in a uniformly distributed population. If you increase the number of success in the binomial distribution you approximate it to a gaussian. The rugs under each graph are interesting visual reflections too.

T Test

T test is a way to compare sets of data that share common variance and their unknown distribution can be approximated with a Normal distribution. You can perform t-test in R using the command t.test(). Explore it with R help.

In case of the morley data we can perform t-test comapring Experiment 1 versus Experiment 2 Using the following syntax: t.test(morley$Speed[morley$Expt==1], morley$Speed[morley$Expt==2])

Exercise 12: Using the morley data perform a t-test for each experiment and compare the results. What is that you can conclude from this dataset? Use the markdown cell to explain the results and discuss your conclusions.

Expt_1_2<-t.test(morley$Speed[morley$Expt==1], morley$Speed[morley$Expt==2]) Expt_1_3<-t.test(morley$Speed[morley$Expt==1], morley$Speed[morley$Expt==3]) Expt_1_4<-t.test(morley$Speed[morley$Expt==1], morley$Speed[morley$Expt==4]) Expt_1_5<-t.test(morley$Speed[morley$Expt==1], morley$Speed[morley$Expt==5]) Expt_2_3<-t.test(morley$Speed[morley$Expt==2], morley$Speed[morley$Expt==3]) Expt_2_4<-t.test(morley$Speed[morley$Expt==2], morley$Speed[morley$Expt==4]) Expt_2_5<-t.test(morley$Speed[morley$Expt==2], morley$Speed[morley$Expt==5]) Expt_3_4<-t.test(morley$Speed[morley$Expt==3], morley$Speed[morley$Expt==4]) Expt_3_5<-t.test(morley$Speed[morley$Expt==3], morley$Speed[morley$Expt==5]) Expt_4_5<-t.test(morley$Speed[morley$Expt==4], morley$Speed[morley$Expt==5]) Expt_1_2 P_values<-c(Expt_1_2$p.value, Expt_1_3$p.value, Expt_1_4$p.value, Expt_1_5$p.value, Expt_2_3$p.value, Expt_2_4$p.value, Expt_2_5$p.value, Expt_3_4$p.value,Expt_3_5$p.value, Expt_4_5$p.value) Morley_TTest<-matrix(P_values,nrow=10,ncol=1) Morley_TTest
Welch Two Sample t-test data: morley$Speed[morley$Expt == 1] and morley$Speed[morley$Expt == 2] t = 1.9516, df = 30.576, p-value = 0.0602 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -2.419111 108.419111 sample estimates: mean of x mean of y 909 856
[,1] [1,] 0.060200496 [2,] 0.036157418 [3,] 0.002658854 [4,] 0.006537757 [5,] 0.625755649 [6,] 0.071762251 [7,] 0.188155833 [8,] 0.277349303 [9,] 0.533261995 [10,] 0.546788293

The p-values are not consistent and therfore the hypothesis that the presence of the aeter would not change the speed of light when travel at different angle is accepted.

The experiment concluded that the difference between the speed of light in the direction of movement through the presumed aether, and the speed at right angles did not exist.