© Copyright 2016 Dr Marta Milo, University of Sheffield.
Week 2 Practical - Solutions
This Notebook contains the solution of the practical assignments for Week 2.
The solution to the exercises assigned but they are not unique. All the exercises proposed can be solved in different manners, some more eficient than other but still correct is the requested output is obtained.
For this reason, when you compare your solutions with these bear in mind that you might have found the solution using a different algorithm and different commands, nonethless they are still correct. These solutions are a guidance and also a surce of ideas that can be merged with yours to build your scripts and to improve it.
To help you to evaluate your notebooks please follow the following criteria:
Overall clarity (score = 0.25)
Correctness of the code (score = 0.25)
Exhaustive cover of required analysis (score= 0.25)
Interpretation of the results (score = 0.25)
These criteria are also used when peer-marking and self-marking. Please read the guidelines given to quantify the above criteria with a grade. Make sure that you score interpretation with a full mark if it is exhaustive and clear.
Data Frames
Exercise 1: Create another field for a second match score, to the game_card
data frame and access each field separately to store them in to vectors. Use the markdown cell to describe your steps (Tip: to access the data frame field you use names` as below)
Please note that when creating data frames R turns names into factors. R does this by default when creating data frames from string vectors. We use stringsAsFactors=FALSE
to avoid this.
We can add a new filed to the data structure by using a new colname with the syntax game_cards$
. We can also rename the colomuns using colnames(). We can then use the same syntax to store the content of each column in vectors
Exercise 2: Do a simple analysis of the data frame that you have created.
First rename the columns into
match1
andmatch2
(Tip: find out about colnames())Calculate the dimensions of the frame, the overall minumum score and overall maxumum score. The minimum score for match1 and match2 and the maximum score for match1 and match2. Who were the players associated to those?
order the score for each match using the nested command
game_cards[order(game_cards$score),]
. Why do we need to use the functionorder()
in this way? What is its output? Use the markdown cell to explain.
Step1: rename columns
Step2: Calculate the dimensions of the frame, the overall minumum score and overall maxumum score. The minimum score for match1 and match2 and the maximum score for match1 and match2. Identify the players associated to them.
Step3: order scores
The function order gives as output the indeces of the sorted elements, for this reason we have to place the indeces (output of order()
) as the rows of game_cards
Visualise the data
To visualise the data we use graphs that can be generated in different ways. The command plot()
is a general graphical function that enables data plotting.
Exercise 3: Explore plot()
with the R help and in a markdown command describe its characteristics and how we can change/add features.
We can also subdivide the plotting area in to different blocks to enable adjacent plots. We can do this using the command par(mfrow=c(nr,nc)
, where nr
is the number of rows and nc
is number of columns. For example to plot two graph charts (bar charts) on the same row we use par(mfrow=c(1,2))
.
plot()
is a generic function to explore R object. It is a basic function used for generic X-Y plotting. A simple syntax is plot(x,y,...)
, where x is the coordinates of the points in the plot, while y is the y coordinates of the points in the plot. The ...
gives room to enter more arguments, such as parameters, type, titles, and labels of the graph.
The parameters of the plot can be defined using the par
function. that can be used within the function plot()
or in teh environment as described in the example.
Some of the graphical parameters are:
type
is used to define the plot type, for instance "p" for points, "l" for lines, and "h" for histogram.
main
is used to display the title of the graph (sub
for subtitle).
xlab
and ylab
are used to label the x- and y- axes, respectively.
Exercise 4: The command par()
implements a variety of more or less complex settings. Explore the par()
command and in a markdown cell write a brief summary of settings that you think might be useful when presenting data.
There are a variery of graphical parameters that can be adjusted. Parameters can be set by specifying as arguments in tag
=value
form, or as a list of tagged values. Some useful adjustments to help display data include:
adj
: a value between 0:1 to specify how justified text is; 0 for left-justified, 0.5 for centre, 1 for right-justified.
ann
: annotation. Default is annotating axis titles and title, while if set to FALSE, there will be no annotations.
bg
: to specify the colours to be used in the background.
cex
, cex.axis
, cex.lab
, cex.main
, cex.sub
: magnification of plotting texts, with 1 as the default magnification value. Les than 1 makes it smaller.
col
, col.axis
, col.lab
, col.main
, col.sub
: specifies the default colour of the plot. Default is black
.
font
, font.axis
, font.lab
, font.main
, font.sub
: specifies which font to use; is an integer. 1 = plain default text, 2 = bold, 3 = italic, 4 = bold italic.
lab
: a numerical vector c(x,y,len)
that modifies default way of axis annotation. Values of x
and y
are approximate number of tickmarks of x and y axes, whle len
specifies the label length. Default is c(5,5,7)
.
las
: style of axis labels; 0 = always parallel to axis (default); 1 = always horizontal; 2 = always perpendicular to axis; 3 = always vertical.
lty
: specifies line type. Can either be an integer (0 = blank, 1 = solid (default), 2 = dashed, 3 = dotted, 4 = dotdash, 5 = longdash, 6 = twodash), or as character strings blank
, solid
, etc.
lwd
: line width. Positive number. Default = 1.
mfcol
, mfrow
: a vector c(nr,nc)
; draw subsequent figures in an nr
xnc
array by column or rows.
pin
: The current plot dimensions, (width, height), in inches.
xlog
, ylog
: Defaults to FALSE (linear scale). If TRUE, uses a logarithmic scale.
Exercise 5: What is the scatter plot useful for? Plot score
against itsef. What do you get and why? Explore the data frame you created in Exercise 1. Explain in a markdown cell
The scatter plot is useful to visualise the relationship between two variables, usually the independent variable is x
and dependent variable is y
. Plotting score
against score
yields points that form a 45 degrees line, with a perfectly linear relationship. This is because x=y for every single point.
The data from Exercise 1 can be presented with a grouped bar plot instead of a scatter plot, since it would allow for a visual comparison of each player's scores for both games.
To create the bar plot above, I first converted my data frame into a matrix, since height in barplot can only be a vector or a matrix. Please note that the matrix is a 2x5 dimensions as opposed to 5x2 of the data frame. This is because the function barplot()
treats each column as a group of bars. I switched beside to TRUE to achieve grouped bars instead of stacked bars. Simple labeling was achieved through main, xlab, and ylab. Legends were defaulted to the names of each row, to illustrate the data from different games. Finally, ylim was set to above 100, which was the theoretical maximum score in my card games, in order to make room for the legend box.
Exercse 6: Explore the iris
data available in R. In the same print area plot two scatter plots of Sepal length versus Petal length and Sepal Width versus Petal width. What do you find?
From the scatter plots it appears that both sets of data exhibit a positive correlation. Also, the data is separated in dinstinct classes that ehibit different level of correlation. The bigger cluster is more positively correlated than the smaller cluster.
Exercise 7: Explore the iris
dataset with linear regression and explain in a markdown cell your findings. Add titles and axis legends to the plots. Explore different colors and markers.
I plotted the 4 regression models on the iris data. Sepal vs Petal Length and Petal Length vs Width appeared to have a positive, linear relationship, while the other two sets of comparisons appeared to be moderately linear in a negative direction. You can use the summary()
to quantify the correlation.
Pie Charts
In 1887 Michelson-Morley experiments attempted to find variations in the speed of light due to earth’s motion through the aether. It was believed at the time that the aether was the medium through which light waves traveled. The data of this experiment is store in the dataset morley
in R.
Exercise 8: Plot a pie chart as above with sample sizes for the experiments in morley
data.
Boxplot
For the morley
data we can also use boxplot for example:
The default behaviour is for the whiskers to extend out to the full range of the data...showing the extremes. Unless, that is, the extremes are too far away in which case they are considered outliers and plotted as circles. For the upper limit, Too far is taken as 'the upper quartile' + 1.5*'the interquartile range'. So, in this case, 'too far' would be:
Exercise 9: Using these examples calculate for the morley data all the quantiles, the IQR, the mean, the standard deviation (sd()
). Repeat the same for each experiment. (tip: morley$Speed[morley$Expt==1]
). Discuss findings in a markdown cell. What can you conclude?
You can a function to return all the information about the morley
data into a single vector, and organise it in a data frame based that inputs from this function to summarise the data from all 5 experiments.
In Morley data what we can notice is that IQR is larger than the SD (except in experiment 4), while the means and medians are quite comparable across all 5 experiments (well within the SD and IQR). With further calculations, it was found that the maximum deviation before a data point is considered an outlier is larger using the IQR+Q3 calculation, when compared to the mean+2SD calculation (again, except in experiment 4). Therefore the impact of the outliers in this type of data is very important, this leads to careful use of the statistical tools.
Histogram
They are very important plot to estimate density distributions from observed data. In the Morley data we can look at the Speed measured for all experiments and plot a histogram of the measurements.
We might want to make it prettier:
We can calculate the density of the data and plot onto the histogram. We will have:
Exercise 10: Plot a similar histogram for each experiment of the morley data. What do you conclude? Discuss it in a markdown cell.
Common distributions
Exercise 11: In R it is possible to generate more than one set of data from known probability distributions, changing the parameters accordingly. Using the commands:
rnorm()
-- normal distributionrunif()
-- uniform distributionrbinom()
-- binomial distribution to generate the data, calculate all the descriptive statistics. With the help ofhist()
and thedensity()
discuss what you have found.
I used 10000 samples to generate each histogram, to better estimate the density function from Histograms. For all three sets of data, the mean and median were identical, suggesting that the distributions are simmetrical. unif
has bars of similar heights, as expected in a uniformly distributed population. If you increase the number of success in the binomial distribution you approximate it to a gaussian. The rugs under each graph are interesting visual reflections too.
T Test
T test is a way to compare sets of data that share common variance and their unknown distribution can be approximated with a Normal distribution. You can perform t-test in R using the command t.test()
. Explore it with R help.
In case of the morley data we can perform t-test comapring Experiment 1 versus Experiment 2 Using the following syntax: t.test(morley$Speed[morley$Expt==1], morley$Speed[morley$Expt==2])
Exercise 12: Using the morley data perform a t-test for each experiment and compare the results. What is that you can conclude from this dataset? Use the markdown cell to explain the results and discuss your conclusions.
The p-values are not consistent and therfore the hypothesis that the presence of the aeter would not change the speed of light when travel at different angle is accepted.
The experiment concluded that the difference between the speed of light in the direction of movement through the presumed aether, and the speed at right angles did not exist.