© Copyright 2016 Dr Marta Milo, University of Sheffield.
Week 3 Practical - Solutions
This Notebook contains the solution of the practical assignments for Week 3.
The solution to the exercises assigned but they are not unique. All the exercises proposed can be solved in different manners, some more eficient than other but still correct is the requested output is obtained.
For this reason, when you compare your solutions with these bear in mind that you might have found the solution using a different algorithm and different commands, nonethless they are still correct. These solutions are a guidance and also a surce of ideas that can be merged with yours to build your scripts and to improve it.
To help you to evaluate your notebooks please follow the following criteria:
Overall clarity (score = 0.25)
Correctness of the code (score = 0.25)
Exhaustive cover of required analysis (score= 0.25)
Interpretation of the results (score = 0.25)
These criteria are used when peer-marking and self-marking. Please read the guidelines given to quantify the above criteria with a grade. Make sure that you score interpretation with a full mark if it is exhaustive and clear.
Accessing the data
Comma or tab delimited files
Exercise 1: Explore the command read.table()
in R help. In a markdown cell give example of the use of this command for .csv
and .txt
files.
Example for .csv
files: read.csv(file, header=TRUE, sep=",", fill=TRUE)
Some times you want also to disable any quotes in the file that you read. This is done with using quote = ""
within the parameters of the function.
Example for .txt
files space delimited: read.delim(file, header=TRUE, sep=" ", fill=TRUE)
Example for .txt
files tab delimited: read.delim(file, header=TRUE, sep="\t", fill=TRUE)
Exercise 2: Load the data contained in the file smokers.csv and explore. Check what is its structure, using the command str()
and explain in a markdown cell what you have found and if there are problems with missing data.
From the information that str()
give on the object data
, it is stated that data
is composed of two variables with 19 observations. there is no information on NA (not applicable) values) and because of the different types of observations within column 2, it is not considered numeric. We can use is.na(data[,1])
and is.na(data[,2])
to check for NA values and also use other parameters within read.csv()
to avoid to generate factors taht are difficult to manipulate.
We can see that there are NA values that we need to remove.
Cleaning the data
Once we have loaded the data we want to make sure that all values have the type that we expect. We also have Excel's calculations for the t-tests -- both of which we don't need. It is traditional in data analysis for data cleansing to be a rather complex step.
Exercise 3:In the smokers
data the strings equal and unequal in column 2 create problems when R converts the data into a data.frame. What problems they create? Suggest ways of coping with this problem, without altering the dimensions of the the data.frame
, to ensure that column 2 is considered numeric. (TIP: Our solution was to avoid the problem when reading the file. Explore read.csv()
for futher information on this).
Exercise 4: Create a data.frame from smokers data with NA values and numerical data. Exclude the rows that are calculations from an Excel file (Could they be the last two rows?). Write the data in the directory data_wk3
in a .csv file called "smokers_clean.csv".
I picked out the rows 1-17 to write in order to remove the calculation data. In addition, I added row.names=TRUE
to keep track of row names. These can be changed in something more meaningful. If the second col has empty we need to remove them before calculting the mean, it would not happen if you write the data frame and then read it again to upload the data in your workspace.
Missing Data
Often in a data set we have missing values, expressed within R as NA
entries. NA stands for 'Not Available' and means that data is missing. In this context, if we were looking at an Excel file the corresponding cells in the Excel spreadsheet would be empty. These entries can cause problems with R functions and commands. We can tell R to exclude missing values with the parameter na.rm
, in functions such as mean()
.
Exercise 5: In the cleaned smoker data ( from Exercise 4) analyse the problem of missing values by calculating descriptive statistics of the data with and without missing values. Visualise the data in the way that is more appropriate to highlight the problem. Explain your choices in a markdown cell. Evaluate if there is a significant difference beetween smokers and non-smokers with a t-test. Plot results.
The barplot of the means (smokers and non-smokers) for the data with NA removed and the data with NA=0, shows that when the NA are removed the means are smaller than the ones with NA considered as zero. This is because it counts in the sample size. This proves that it is important to remove the NA from the data because it could potentially skew it to be more widely dispersed, which the next graph, a box plot, will illustrate.
The box plot, in which the range is set to 0, causing whiskers to extend to the most extreme values. This is for including outliers in teh plots. Clearly the data sets with missing data is more dispersed, with minimum outliers of 0. The missing values does not implicate the median value as much, however, since it is base on the rank order instead of the average.
The test is evaluating the significance between the blood pressure of smokers vs non-smokers using a t-test with Welch's correction. The data set with missing values as 0 had a p-value of 0.48, while the data set with missing values removed had a p-value of 0.0004. Clearly, removing missing values resulted in a statistically significant difference. Therefore, the implications of missing values on statistical significance can be quite massive.
Exponential and Log manipulations.
In R we can use the the command exp(x)
for computing and log10(x),log2(x),logb(x, base=b)
to compute , or .
Exercice 6: Using the clean smokers data take the of the data and compare the density distribution of the data with or without the log transformation. Plot the densities on the the same plot using the command lines()
. Explain in a markdown cell what you can derive.
To visualise the density distribution of the original data vs log2-transformed data, we need to adjust the ranges of the plot since the values are on different scales. A solution for this can be:
The plots are showing that, although on different scales, the density distributions of both datasets are the same. Therefore, it can be concluded that log-transformation does not alter the density distribution of the original data and allows comparisons with other datasets by normalising scale and order of magnitude of teh values.
Exercise 7: Build a MA plot of the smoker data using the above formulae. Enrich the plot with labels and colors and In a markdown cell explain why M represents the fold change between samples and A the over signal.
We can use a function to solve this exercise:
M represents the fold change between samples because logb(x)-logb(y) = logb(x/y)
, effectively converting it into a ratio.
A is the average log-signal intensity because (logb(x)+logb(y))/2
is simply the average between log(x) and log(y).
For Loops and conditional statements
Exercise 8: Use a for loop to plot the smokers data as shown in the example above, with and without the log transformation.
Real data case study
In this section we will analyse data collected from qPCR on single cell gene expression. The data is saved in a file called "Ct_data.csv"
saved in data_wk9
folder. The data are stored in a table of 384 rows and 3 columns, representing 384 qPCR reactions in 3 different runs. The genes that are quantified with qPCR are grouped in three different clusters of 8 genes (24 genes in total) and each run analyses one group of genes. In each colum (experiment) every 8 rows the same markers repeat, but the samples in which they are tested are different, they are products from different cells (experiments). The target names are stored in a file called "targetNames.csv"
.
The number of clusters that are repeated each column is equal to 48. These are the cells that have been analysed, each cell is considered an independent experiment.
In summary we have full data stored in a [384x3] and we want to build a data structure which is a matrix of [24x48], i.e the 24 different genes analysed and the 48 cells (experiments) that we performed.
Exercise 9: Write a workflow and its associated pipeline for the analysis of this data, including the intital reorganisation of the data, the diagnostics and the analysis to describe the data and saving results with column and row names.
Workflow
Reorganise data -> Transform data ->Analyse data-> Save results
Pipeline
Reorganise data: changing the data structure from a [384 x 3] matrix into a [24 x 48] matrix. Each row of the new matrix corresponds to a gene analysed, while each column represents the data collected from each experiment run (e.g. experiments on different genes in a different cell type).
Transform data: gene expression data is obtained by subtracting the background (RNA SPIKE), and using . However, in order to mantain the scale within same order of magnitude, we keep the log-form of the data.
Analyse data: compare the distribution of gene expression data in the 48 experiments using boxplots and density distribution curves. Pick out any visible outliers, and generate conjectures of cell function/behaviour based on these analyses.
Save results: results are saved with appropriate column and row names. Column names are the number of experiment run (1-48), while the row names are the names of gene analysed.
Exercise 10: Analyse the single cell qPCR data following your workflow. Ensure that you explain in a markdown your analysis and plot the diagnostics as well as the analysis on the transformed data. You might want to use any visulaisation tool that you feel is appropriate for better explaining this data.
Data was reorganised into the 24x48 matrix. The command full_target<-c(targetNames[1:8,1],targetNames[1:8,2],targetNames[1:8,3])
was used to organise the 24 gene names in a table of 3 columns where each column contains the 8 gene names.Note: I added the command stringsAsFactor=FALSE
to conserve the gene names as characters instead of factors.
Delta Ct is calculated and this enables to quantify the gene expression as . However, in our analysis we will use the log-transformed data and this is the
Once the data has been reorganised and transformed we can start the analysis with visulaising it using boxplot and density plots.
From the box plot, we can notice that about 6 of the experiments had a higher median than the rest. Most of the medians lie within the -10 to 0 range, while these 6 experiments had medians between 5 to 20. This could suggest that the overall expressions of these 24 genes are higher in these 6 experiments due to a functional diference or that they can be artifacts. The cells with similar distribution of gene expression can be cells that are synchronous cells or are functionally similar ( same stage of differentiation). However, it would be worth analysing expression of individual genes to find a more specific correlation.
The density curves (each curve = one experiment) are mostly similar, and the peaks for most experiments lie under 0.10. A few of the curves, however, peak between 0.10 and 0.17; these may correspond to the 6 higher experiments in the box plot. Conclusions could therefore be drawn in a similar manner with density plots.
Data Normalisation
Normalising the data in an important step in data analysis of gene expression level. There are two types of data normalisation that we will use in this module: scaling normalisation and quantile normalisation.
** Exercise 11**: Using the single cell data try different type of data normalisation and run the high level data analysis that you implemented in Exercise 10 on the normalised data. Explain in a markdown cell what you have found.
The normalisation shows an obvious change, the medians of the 48 experiments afre now alligned, giving a better visualisation of density distribution and highlights the peaks.
The boxplot show the radical effect that the quantile normalisation has had on the data. The experiments have the same quantiles and medians are alligned. However, this normalisation makes it difficult to visualise the difference in data and we loose the peaks that we had noticed above. We can therefore conclude that this normalisation highlits significant changes when data we are analysing has strong expression signals. In tjis case it flattens too much the tails and we loose possible differences that are here only expressed a low signal. The scaling normalisation appears to be more appropriate.