Project: Jingyi Xie - Autumn2016/BMS353

Path: Autumn2016 / Week2-peer-grade / 29a8820a-894c-4f01-ac91-09874be50e0f / Week_2_practical.ipynb

Views: ⁴⁰³⁹

Kernel: R

Week 2 Practical

This Notebook contains practical assignments for Week 2.

It contains guidance on how to perform some basic plotting and implement some basic statistics in R. It also contains practical tasks that you will have to implement yourself. The data we will use is stored within R if not it will be provided in the folder Week 2.

You are free to base your work on the examples given here but you are also welcome to use different methods if you prefer, adding and/or creating new ways of displaying the data. You will need to add descriptions of what you have done in the assigned tasks and a comment on the results obtained using Markdown cells.

You will need to create a new notebook in the Week 2 folder of your SageMathCloud account that you will call your username_week2.ipynb. The notebooks will be self-marked following a set of guidelines that you will receive with a notebook that containes the solutions to the exercises. THIS is FORMATIVE feedback that you can use to improve your coding skills.

The last version of your notebook saved by the deadlines indicated on the module website will be the one that will be considered for self-marking. It will be moved in your assigment folder where you will find the guidelines and the solved notebook.

All the notebooks are meant to be used interactively. All the code needs to be written into code cells -- that can be executed by an R kernel. The outputs are not present in this notebook, but the code cells are executable.

Reminder: You can access each code cell for editing by clicking into it and pressing SHIFT and ENTER simultaneously to execute the code. You can run all code cells at once by clicking on Cell in the above menu bar and choose Run All.

Accessing the data

For this practical we will mainly be using the data sets that are provided within R. This is because the Week 2 practical is intended to provide guidance on how to use some basic commands for statistics and for plotting. You are free to use other appropriate data of your own work or that you have uploaded into your R workspace. If you do so you MUST describe the data set using a markdown cell and comment.

To see which data sets are stored in R, you can use the command data().

In [ ]:

data()

Data Frames

Before starting to use the data sets we need to introduce the concept of data frames. Data frames are structures that we use when we do any type of data analysis in R.

Data frames are a type of object in R that have a stucture that accommodates both characters and numerical values. We can create data frames from vectors and "force" matrices into data frames. For example if you want to store the results of a game of cards with four players in a dataframe you can use the following:

In [1]:

names<- c('Bob','Claire','Luisa','Matt','Marta','Mike')
score<- c(34,82,59,72,50,100)

game_cards<- data.frame(names,score,stringsAsFactors=FALSE)
game_cards

  names  score
Bob     34  
Claire  82  
Luisa   59  
Matt    72  
Marta   50  
Mike   100  

Please note that when creating data frames R turns names into factors. R does this by default when creating data frames from string vectors. We use stringsAsFactors=FALSE to avoid this.

Exercise 1: Create another field for a second match score, to the game_card data frame and access each field separately to store them in to vectors. Use the markdown cell to describe your steps (Tip: to access the data frame field you use $, ie. `game_cards$ names` as below)

In [ ]:

game_cards$names

Exercise 2: Do a simple analysis of the data frame that you have created.

First rename the columns into match1 and match2 (Tip: find out about colnames())
Calculate the dimensions of the frame, the overall minumum score and overall maxumum score. The minimum score for match1 and match2 and the maximum score for match1 and match2. Who were the players associated to those?
order the score for each match using the nested command game_cards[order(game_cards$score),]. Why do we need to use the function order() in this way? What is its output? Use the markdown cell to explain.

Visualise the data

To visualise the data we use graphs that can be generated in different ways. The command plot() is a general graphical function that enables data plotting.

Exercise 3: Explore plot() with the R help and in a markdown command describe its characteristics and how we can change/add features.

We can also subdivide the plotting area in to different blocks to enable adjacent plots. We can do this using the command par(mfrow=c(nr,nc), where nr is the number of rows and nc is number of columns. For example to plot two graph charts (bar charts) on the same row we use par(mfrow=c(1,2)).

In [ ]:

par(mfrow=c(1,2))
barplot(game_cards$score, names = game_cards$names)
barplot(game_cards$score, names = game_cards$names)

Exercise 4: The command par() implements a variety of more or less complex settings. Explore the par() command and in a markdown cell write a brief summary of settings that you think might be useful when presenting data.

Another way of plotting data is to use a scatterplot, which consists of plotting one set of data against another. For example

In [ ]:

score<- c(34,82,59,72,50,100)
score1<-c(24,32,69,56,45,90)
plot(score,score1) # you can also use plot(score~score1) it gives same results
abline(0,1) # give the 45 degrees line

Exercise 5: What is the scatter plot useful for? Plot score against itsef. What do you get and why? Explore the data frame you created in Exercise 1. Explain in a markdown cell

Exercse 6: Explore the iris data available in R. In the same print area plot two scatter plots of Sepal length versus Petal length and Sepal Width versus Petal width. What do you find?

If the data seems to follow a linear pattern very clearly, then we say that there is a high linear correlation, while if it seems that the data do not follow a linear pattern, we say that there is no linear correlation. If the data somewhat follows a linear path, then we say that there is a moderate linear correlation. The equation that defines that line is given by a process called linear regression. The line is the one that fits better the data by minimising the distance of the point from the line. It is possible to add regression lines to the scatter plots in R using the command lm(). For example

In [ ]:

plot(iris$Sepal.Length~iris$Petal.Length)
reg1<-lm(iris$Sepal.Length~iris$Petal.Length)
abline(reg1)

Exercise 7: Explore the iris dataset with linear regression and explain in a markdown cell your findings. Add titles and axis legends to the plots. Explore different colors and markers.

Pie charts

In [ ]:

# Pie Chart from data frame with Sample Sizes
iris_table <- table(iris$Species)
lbls <- paste(names(iris_table), "\n", iris_table, sep="")
pie(iris_table, labels = lbls, 
  	main="Pie Chart of Species of Iris\n (sample sizes)")

In 1887 Michelson-Morley experiments attempted to find variations in the speed of light due to earth’s motion through the aether. It was believed at the time that the aether was the medium through which light waves traveled. The data of this experiment is store in the dataset morley in R.

Exercise 8: Plot a pie chart as above with sample sizes for the experiments in morley data.

Boxplot

For the morley data we can also use boxplot for example:

In [ ]:

boxplot(morley$Speed ~ morley$Expt,
  col='light grey', xlab='Experiment #',
  ylab="speed (km/s - 299,000)",
  main="Michelson–Morley experiment")
mtext("speed of light data")

sol=299792.458-299000 # deviation of real speed of ligth from the estimated 299,000 km/s
abline(h=sol, col='red')

The default behaviour is for the whiskers to extend out to the full range of the data...showing the extremes. Unless, that is, the extremes are too far away in which case they are considered outliers and plotted as circles. For the upper limit, Too far is taken as 'the upper quartile' + 1.5*'the interquartile range'. So, in this case, 'too far' would be:

In [ ]:

quantile(morley$Speed,prob=0.75)[["75%"]] + 1.5*IQR(morley$Speed)

In [ ]:

quantile(morley$Speed,prob=0.75)

In [ ]:

IQR(morley$Speed)

Exercise 9: Using these examples calculate for the morley data all the quantiles, the IQR, the mean, the standard deviation (sd()). Repeat the same for each experiment. (tip: morley$Speed[morley$Expt==1]). Discuss findings in a markdown cell. What can you conclude?

Histogram

They are very important plot to estimate density distributions from observed data. In the Morley data we can look at the Speed measured for all experiments and plot a histogram of the measurements.

In [ ]:

hist(morley$Speed)

We might want to make it prettier:

In [ ]:

par(fg=rgb(0.6,0.6,0.6))
hist(morley$Speed, prob=F,
     col=rgb(0.9,0.9,0.9),
     main='Michelson-Morley Experiment ',
     ylab="Frequency", xlab='Difference from Speed of Light')
par(fg='black')

We can calculate the density of the data and plot onto the histogram. We will have:

In [ ]:

par(fg=rgb(0.6,0.6,0.6))
hist(morley$Speed, prob=F,
     col=rgb(0.9,0.9,0.9),
     main='Michelson-Morley Experiment ',
     ylab="Frequency", xlab='Difference from Speed of Light')
par(fg='black')

lines(density(morley$Speed))
abline(v=mean(morley$Speed), col=rgb(0.5,0.5,0.5))
abline(v=median(morley$Speed), lty=3, col=rgb(0.5,0.5,0.5))
abline(v=mean(morley$Speed)+sd(morley$Speed), lty=2, col=rgb(0.7,0.7,0.7))
abline(v=mean(morley$Speed)-sd(morley$Speed), lty=2, col=rgb(0.7,0.7,0.7))
rug(morley$Speed)

Exercise 10: Plot a similar histogram for each experiment of the morley data. What do you conclude? Discuss it in a markdown cell.

Common distributions

Exercise 11: In R it is possible to generate more than one set of data from known probability distributions, changing the parameters accordingly. Using the commands:

rnorm() -- normal distribution
runif() -- uniform distribution
rbinom() -- binomial distribution to generate the data, calculate all the descriptive statistics. With the help of hist() and the density() discuss what you have found.

T Test

T test is a way to compare sets of data that share common variance and their unknown distribution can be approximated with a Normal distribution. You can perform t-test in R using the command t.test(). Explore it with R help.

In case of the morley data we can perform t-test comapring Experiment 1 versus Experiment 2 Using the following syntax: t.test(morley$Speed[morley$Expt==1], morley$Speed[morley$Expt==2])

Exercise 12: Using the morley data perform a t-test for each experiment and compare the results. What is that you can conclude from this dataset? Use the markdown cell to explain the results and discuss your conclusions.

In [1]:

morley

    Expt Run Speed
1     1   850 
1     2   740 
1     3   900 
1     4  1070 
1     5   930 
1     6   850 
1     7   950 
1     8   980 
1     9   980 
1    10   880 
1    11  1000 
1    12   980 
1    13   930 
1    14   650 
1    15   760 
1    16   810 
1    17  1000 
1    18  1000 
1    19   960 
1    20   960 
2     1   960 
2     2   940 
2     3   960 
2     4   940 
2     5   880 
2     6   800 
2     7   850 
2     8   880 
2     9   900 
2    10   840 
⋮   ⋮    ⋮   ⋮    
4    11  910  
4    12  920  
4    13  890  
4    14  860  
4    15  880  
4    16  720  
4    17  840  
4    18  850  
4    19  850  
4    20  780  
5     1  890  
5     2  840  
5     3  780  
5     4  810  
5     5  760  
5     6  810  
5     7  790  
5     8  810  
5     9  820  
5    10  850  
5    11  870  
5    12  870  
5    13  810  
5    14  740  
5    15  810  
5    16  940  
5    17  950  
5    18  800  
5    19  810  
5    20  870  

In [ ]: