Project: Jingyi Xie - Autumn2016/BMS353

Path: Autumn2016 / Week3-graded / 0b0769ed-7bbc-489c-8684-a8399467d36e / Week_3_practical.ipynb

Views: ⁴⁰³¹

Kernel: R

Week 3 Practical

This Notebook contains practical assignments for Week 3.

It contains guidance on how

to read and write data in R
how to load data into a data frame and clean the data ready for analysis
basic usage of for and while loops and conditional statestements like if

It also contains practical tasks that you will have to implement yourself. We will use data that is stored in your SageMathCloud folder for the upload of the data into the workspace. Any other data can be used as long as it is specifically mentioned in your notebook and access to it is possible via SageMathCloud.

You are free to base your work on the examples given here but you are also welcome to use different methods if you prefer, adding and/or creating new ways of displaying the data. You will need to add descriptions of what you have done in the assigned tasks and a comment on the results obtained using Markdown cells.

You will need to create a new notebook in the Week 3 folder of your SageMathCloud account that you will call your username_week3.ipynb.

The notebooks will be marked and formative feedback given to you as shown in the class. The last version of your notebook saved by the deadlines indicated on the website will be the one that will receive feedback.

All the notebooks are meant to be used interactively. All the code needs to be written into code cells -- that can be executed by an R kernel. The outputs are not present in this notebook, but the code cells are executable.

Accessing the data

For this practical we will mainly be using the data sets that are provided in the SageMathCloud folder. They are in the form of .txt files or .csv files. This is because the Week 3 practical is intended to provide guidance on how to use basic R commands for reading data stored in files and writing output of your analysis into files. You are free to use other appropriate data of yours that you have uploaded into your R workspace. If you do so you MUST describe the data set using a markdown cell and comment.

We will access data that is stored in comma separated values, tab delimited values and as an example here .xls (excel files). This latter type will not be used for the final assignment, but it is useful to know that data can be uploaded in R workspace directly from an .xls file.

Comma or tab delimited files

A popular data file format is a text file format where columns are separated by a tab, space or comma. We will use comma and tab delimited files, which have the following extensions:

.csv comma delimeted
.txt tab or space delimited

*** Reading the data***

In this notebook we will read in R both types of files. R reads table of data from files with a command called read.table()

Exercise 1: Explore the command read.table() in R help. In a markdown cell give example of the use of this command for .csv and .txt files.

A typical use of the command read.csv() is as

data <- read.csv("nameFile.csv", header = TRUE)

this will read the file called "nameFile.csv" from your working directory. The headers will also be read and the names of columns, if present, will be stored. If the file is in a different location you need to specify where the file is using the full path. For example: data <- read.csv("data_wk3/nameFile.csv", header = TRUE) will read the file "smokers.csv" in your data folder called data_wk3.

After reading the data you need to create an appropriate data.frame for your analysis. The data.frame will have to contain all the information you need about the data. This will be important for when we starting using the Bioconductor framework.

Exercise 2: Load the data contained in the file smokers.csv and explore. Check what is its structure, using the command str() and explain in a markdown cell what you have found and if there are problems with missing data.

Cleaning the data

Once we have loaded the data we want to make sure that all values have the type that we expect. We also have Excel's calculations for the t-tests -- both of which we don't need. It is traditional in data analysis for data cleansing to be a rather complex step.

Exercise 3:In the smokers data the strings equal and unequal in column 2 create problems when R converts the data into a data.frame. What problems they create? Suggest ways of coping with this problem, without altering the dimensions of the the data.frame, to ensure that column 2 is considered numeric. (TIP: Our solution was to avoid the problem when reading the file. Explore read.csv() for futher information on this).

*** Write the data***

We can write our output from manipulations and calulations in R in our working directory or where is a path specified in the command. In R we use the command write.table(), or we can use the command write.csv() for comma separated files. Explore these options in R help before using them.

As an example we can use: write.csv(data,file="nameFile.csv")

Exercise 4: Create a data.frame from smokers data with NA values and numerical data. Exclude the rows that are calculations from an Excel file (Could they be the last two rows?). Write the data in the directory data_wk3 in a .csv file called "smokers_clean.csv".

Missing Data

Often in a data set we have missing values, expressed within R as NA entries. NA stands for 'Not Available' and means that data is missing. In this context, if we were looking at an Excel file the corresponding cells in the Excel spreadsheet would be empty. These entries can cause problems with R functions and commands. We can tell R to exclude missing values with the parameter na.rm, in functions such as mean().

You can use the na.rm as follows:

y<-c(27,40,72,NA,89)
my<-mean(y,na.rm=TRUE)

You can also check which are the NA values in a data set using the comand is.na() and set them to a fixed value. For example:

y[is.na(y)] <- 0
y

Exercise 5: In the cleaned smoker data ( from Exercise 4) analyse the problem of missing values by calculating descriptive statistics of the data with and without missing values. Visualise the data in the way that is more appropriate to highlight the problem. Explain your choices in a markdown cell. Evaluate if there is a significant difference beetween smokers and non-smokers with a t-test. Plot results.

Exponential and Log manipulations.

In R we can use the the command exp(x) for computing $e^x$ and log10(x),log2(x),logb(x, base=b) to compute $log_{10}(x)$ , $log_2(x)$ or $log_b(x)$ .

Exercice 6: Using the clean smokers data take the $log_2$ of the data and compare the density distribution of the data with or without the log transformation. Plot the densities on the the same plot using the command lines(). Explain in a markdown cell what you can derive.

An important plot that we use in bioinformatics is a plot called MA plot. It helps to identify the differences amongst conditions, the outliers will be easly detected using this way of visualising the data. The MA plot is a plot of the following quantities

M= log_2(X)-log_2(Y)

A = \frac{1}{2}(log_2(x)+log_2(y))

Where M represents the ratio (also known as fold change) and A the over all signals. The MA plot is build by plotting A vs M. This plot as well as the volcano plot are available in the work package limma under Bioconductor. We will explore them with built-in functions in Week 10.

Exercise 7: Build a MA plot of the smoker data using the above formulae. Enrich the plot with labels and colors and In a markdown cell explain why M represents the fold change between samples and A the over signal.

For Loops and conditional statements

Looping is an important part of a programming script. It allows you to repeat a set of instructions many times. There are many different types of looping structures. In this module we will use for and while loops and if as conditional statements.

For loops For loops are very simple repeating structure, they just repeat the set of instructions as many time as indicated by the max value of the counter i. For example

for(i in 1:10){
	print("Hello world!")
	print(i*i)
}

the set of instructions are always grouped in curly brackets {-}.

We use for loops to do any sort of repeated operation, not just printing or plotting. However, it is very useful when we need to create plots with many subplots. For example:

In [ ]:

mydata<-rnorm(100,mean=0,sd=1)
mydata2<-rnorm(100,mean=2,sd=1)

par(mfrow=c(2, 2))
for(i in 1:2){
    hist(mydata,col='red', prob=TRUE)
    lines(density(mydata), pch=3, lty=i, col="blue")
    hist(mydata2,col='red',prob=TRUE)
    lines(density(mydata2), pch=3, lty=i, col="blue")
    #title(main=paste("lty = ",i), col.main="blue")
}

Exercise 8: Use a for loop to plot the smokers data as shown in the example above, with and without the log transformation.

** While loops** Another important loop that is often used is a while loop. This looping structure enables you to repeat the same set of instructions while a given condition holds. For example:

z <- 0
while(z < 5) { 
    z <- z + 2
    print(z)  
}

this enables computations on a variable till it reaches a certain value (threshold).

** If statements** This is a slightly different structure from the ones above, it does not repeat instructions, but executes them conditional to a statement being TRUE. for and while loops can be nested in a if statement and therefore loops can be created conditional to a statement beeing TRUE. The syntax for the if statement is:

if(statement) {
    instruction 1
    instruction 2
} else { 
    instruction 3
    instruction 4
}

where statement can be an condition that gives a logical output. For example x==5, x<7, name== "GATA3".

Real data case study

In this section we will analyse data collected from qPCR on single cell gene expression. The data is saved in a file called "Ct_data.csv" saved in data_wk9 folder. The data are stored in a table of 384 rows and 3 columns, representing 384 qPCR reactions in 3 different runs. The genes that are quantified with qPCR are grouped in three different clusters of 8 genes (24 genes in total) and each run analyses one group of genes. In each colum (experiment) every 8 rows the same markers repeat, but the samples in which they are tested are different, they are products from different cells (experiments). The target names are stored in a file called "targetNames.csv".

The number of clusters that are repeated each column is equal to 48. These are the cells that have been analysed, each cell is considered an independent experiment.

In summary we have full data stored in a [384x3] and we want to build a data structure which is a matrix of [24x48], i.e the 24 different genes analysed and the 48 cells (experiments) that we performed.

Exercise 9: Write a workflow and its associated pipeline for the analysis of this data, including the intital reorganisation of the data, the diagnostics and the analysis to describe the data and saving results with column and row names.

Exercise 10: Analyse the single cell qPCR data following your workflow. Ensure that you explain in a markdown your analysis and plot the diagnostics as well as the analysis on the transformed data. You might want to use any visulaisation tool that you feel is appropriate for better explaining this data.

Below are some tips that might help you to execute this task.

Reorganisation of the data

In order to do this you need to use for loops. A way of doing this is with using both for and while loops. For example for only one column we can do:

load the data in Ct_data and targetNames

In [ ]:

Ct_data <- read.csv("./data_wk9/Ct_data.csv", header=TRUE)

i<-1 #initialise the counters
j<-1 #initialise the counters
k<-1 #initialise the counters

Ct<-matrix(0,8,48) #make matrix tamplate of zeros
Ct_full<-matrix(0,24,48) #make matrix tamplate of zeros
cell_number<-dim(Ct_data)[1]
step<-c(1,9,17) # set the three partition points for organising the full data structure 


for (k in 1:1){
	while (i<= (cell_number)) {
		Ct[,j]<-Ct_data[i:(i+7),k]
		i<-i+8
		j<-j+1
		}
	Ct_full[step[k]:(step[k]+7),]<-Ct
	i<-1
	j<-1
}

When you write the data you also need to reorganise the targetsNames so that they match the 24 rows of the data structure. For example something like this might help

full_target<-c(targetNames[1:8,1],targetNames[1:8,2],targetNames[1:8,3])

** Explain in a markdown cell why this could be a solution to obtain a complete set of 24 gene names used in this experiment **

Once the data structure is done write the output in a .csv file. The final structure should look like:

In [ ]:

targetNames<-read.csv("data_wk9/targetNames.csv", header=FALSE)
full_target<-c(targetNames[1:8,1],targetNames[1:8,2],targetNames[1:8,3])
rownames(Ct_full)<-full_target
targetNames

** Data Transformation**

After the diagnostics on your raw data, you want to evaluate the gene expression values from it. Since this is qPCR data to estimate the gene expression level we need

subtract the background from each gene and store the new values in a variable called deltaCt. The background is the Ct value of the first gene (RPLO)
calculate the gene expression as the $2^{-deltaCt}$

To implement the first step you might want to think of the function sweep(), that can be used as follow

In [ ]:

deltaCt<-sweep(Ct_full,MARGIN=2,Ct_full[1,],FUN="-")

** Analysis of the data**

When looking at the data diagnostics you might want to use box plots and density plot to check the difference in behaviour of each of thh 48 cells ( which now are the columns of your data structure). You might want to overlay those distributions to see whether there are groups of cells that behave differently despite beeing sorted and harvested as the same. Add some comment about the biology that you might deduce from this data analysis.

Data Normalisation

Normalising the data in an important step in data analysis of gene expression level. There are two types of data normalisation that we will use in this module: scaling normalisation and quantile normalisation.

** Exercise 11**: Using the single cell data try different type of data normalisation and run the high level data analysis that you implemented in Exercise 10 on the normalised data. Explain in a markdown cell what you have found.

We will use this technique within bioconductor, but for this practiical you can use the functions below for quantile normalisation and scaling normalisation.

In [ ]:

# to perform a centering of the data by its median we can use
med_att <- apply(deltaCt, 2, median) # meadian on the columns
scaledData<-sweep(data.matrix(deltaCt), 2, med_att)
# repeat using the Ct_full values


# A quantile Normalisation Function
quantile_normalisation <- function(df){
  df_rank <- apply(df,2,rank,ties.method="average")
  df_sorted <- data.frame(apply(df, 2, sort))
  df_mean <- apply(df_sorted, 1, mean)
   
  index_to_mean <- function(my_index, my_mean){
    return(my_mean[my_index])
  }
   
  df_final <- apply(df_rank, 2, index_to_mean, my_mean=df_mean)
  rownames(df_final) <- rownames(df)
  return(df_final)
}