Contact
CoCalc Logo Icon
StoreFeaturesDocsShareSupport News AboutSign UpSign In
| Download

R

Views: 4031
Kernel: R (R-Project)

Week 4 Practical notebook - Solutions

This notebook contains guidance on how

  • to load packages from Bioconductor and access the vignettes

  • how to load data from preprocessed data objects within bioconductor.

  • Analysis of gene expression data with different methods and normalisation techniques

  • Basic use of limma for Differential Expression Anaysis

These are just suggested solutions, you can find different ones and still correct, so please refer to them as a guidance and as suggestions.

options(jupyter.plot_mimetypes ='image/png')

Loading Packages

In Bioconductor thare are many packages that are normally installed on local R versions with the function biocLite(). On Sage Math Cloud this function has not been fully tested for use with our accounts, therefore we have many of them preinstalled. They can be activated using library() as you would normally do with any other R package.

Bioconductor packages may depend on other packages, which means that they can only work if the others are activated. As default behaviour Bioconductor automatically activates all dependencies, but it might happen that you need to do it manually with the command library(). To start our practical we need to load the basic Bioconductor package affy.

Exercise 1: Familiarise with the Bioconductor web portal. Load affy and check that the you have mas5() and rma() using the help function. In a markdown cell explain what are the objects that they require and what is the output ( Tip look out for Affybatch, ExpressionSet...)

library(affy)
Loading required package: BiocGenerics Loading required package: parallel Attaching package: ‘BiocGenerics’ The following objects are masked from ‘package:parallel’: clusterApply, clusterApplyLB, clusterCall, clusterEvalQ, clusterExport, clusterMap, parApply, parCapply, parLapply, parLapplyLB, parRapply, parSapply, parSapplyLB The following objects are masked from ‘package:stats’: IQR, mad, xtabs The following objects are masked from ‘package:base’: Filter, Find, Map, Position, Reduce, anyDuplicated, append, as.data.frame, cbind, colnames, do.call, duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted, lapply, lengths, mapply, match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, rank, rbind, rownames, sapply, setdiff, sort, table, tapply, union, unique, unsplit Loading required package: Biobase Welcome to Bioconductor Vignettes contain introductory material; view with 'browseVignettes()'. To cite Bioconductor, see 'citation("Biobase")', and for packages 'citation("pkgname")'.

mas5() is the MAS 5.0 Expression measure function, which converts an Affybatch onto an Expressionset using Affymetrix's MAS 5.0 gene expression estimation method. This method is a single point statistic method.

syntax: mas5(object, normalize = TRUE, sc=500, analysis="absolute")

where:

  • object = an instance of AffyBatch

  • normalize = TRUE/FALSE - if TRUE scale normalization is used after the ExpressionSet is obtained.

  • sc = Scaling value for the arrays

  • analysis = absolute or comparison analsis

help(mas5) help(rma)

rma() is the function that implements the Robust Multiarray Average method. It converts an AffyBatch object into an ExpressionSet data using the RMA expression algorithm.

syntax: rma(object, subset=NULL, verbose=TRUE, destructive=TRUE, normalize=TRUE, background=TRUE, bgversion=2, ...)

where:

  • object an AffyBatch object.

  • subset a character vector with the the names of the probesets to be used in expression calculation.

  • verbose logical value. If TRUE, it writes out some messages indicating progress. If FALSE nothing should be printed.

  • destructive logical value. If TRUE, works on the PM matrix in place as much as possible, good for large datasets.

  • normalize logical value. If TRUE, normalize data using quantile normalization.

  • background logical value. If TRUE, background correct using RMA background correction.

  • bgversion integer value indicating which RMA background to use 1: use background similar to pure R rma background given in affy version 1.0 - 1.0.2 2: use background similar to pure R rma background given in affy version 1.1 and above

Processing array data with Bioconductor

There is a vast amount of data in Bioconductor, raw or preprocessed, that we can be used for some of the exercises below. this notebook is the backbone for the final project notebook, in which we will be processing data from the raw data files called .CEL files. This can take some processing time for this reason here we will use preprocessed data.

To load preprocessed data we can use:

  • the command data() if the objects are already presnt in the package;

  • the command library() with the name of data package we want to use;

  • the command load("fileName.RDA") to upload preprocessed data from a previously saved workspace.

Exercise 2: Set your working directory to data_w4, or define a path variable that contains the path to that data folder. Ensure that you know where you are using getwd() and setwd() commands.

getwd()
[1] "/projects/ddda6a8e-2bca-47f5-b1d6-79b2c48d0e30/Autumn2016/solutions/week4"
setwd("~/Autumn2016/Week4/data_wk4") # Setting the working directory to be data_wk4
getwd() # Confirming that the working directory is data_wk4
[1] "/projects/ddda6a8e-2bca-47f5-b1d6-79b2c48d0e30/Autumn2016/Week4/data_wk4"

Gene Expression Analysis with Bioconductor Workflow

  • Step 1: build the data set of your raw values into an object called AffyBatch

  • Step 2: process the affybatch with selected methods

  • Step 3: Normalise the data if needed (some methods have this step included)

  • Step 4: Visualise data

  • Step 5: High level analysis : DE analysis, clustering, PCA ..

  • Step 6: Selection of final targets

This workbook covers Step 1-4

In data_wk4 there is a file called Dilution.RDA. It conatins an affybatch of a benchmark data set that was used to test methods for affymetrix gene epression estimation. The data consists of two sources of mRNA, human liver tissue and a central nervous system cell line (CNS), were hybridized to human geneChip arrays (HG-U95A) in a range of different dilutions and proportions.

Exercise 3: Load the data from Dilution.RDA and check if it is loaded with the command ls(). The dilution data is already in the form of an affybatch. Use the command show(Dilution) to check the features. Use a markdown cell to describe it.

load("Dilution.RDA") # Loading the Dilution.RDA file into the workspace ls() # listing the data that is loaded to confirm that Dilution.RDA file has loaded show(Dilution) # Checking the features of the AffyBatch object, returns the size of arrays (640x640) etc.
[1] "Dilution" "mapCdfName"
Warning message: “replacing previous import ‘stats::mad’ by ‘BiocGenerics::mad’ when loading ‘IRanges’”Warning message: “replacing previous import ‘stats::IQR’ by ‘BiocGenerics::IQR’ when loading ‘IRanges’”Warning message: “multiple methods tables found for ‘active’”Warning message: “multiple methods tables found for ‘active<-’”Warning message: “multiple methods tables found for ‘evalSeparately’”Warning message: “multiple methods tables found for ‘subsetByFilter’”Warning message: “multiple methods tables found for ‘params’”Warning message: “multiple methods tables found for ‘filterRules’”Warning message: “multiple methods tables found for ‘mad’”Warning message: “multiple methods tables found for ‘IQR’”Warning message: “multiple methods tables found for ‘unlist’”Warning message: “multiple methods tables found for ‘rep.int’”Warning message: “replacing previous import ‘BiocGenerics::mad’ by ‘IRanges::mad’ when loading ‘AnnotationDbi’”Warning message: “replacing previous import ‘S4Vectors::filterRules’ by ‘IRanges::filterRules’ when loading ‘AnnotationDbi’”Warning message: “replacing previous import ‘S4Vectors::.__C__FilterRules’ by ‘IRanges::.__C__FilterRules’ when loading ‘AnnotationDbi’”Warning message: “replacing previous import ‘S4Vectors::params’ by ‘IRanges::params’ when loading ‘AnnotationDbi’”Warning message: “replacing previous import ‘S4Vectors::evalSeparately’ by ‘IRanges::evalSeparately’ when loading ‘AnnotationDbi’”Warning message: “replacing previous import ‘S4Vectors::FilterRules’ by ‘IRanges::FilterRules’ when loading ‘AnnotationDbi’”Warning message: “replacing previous import ‘S4Vectors::active<-’ by ‘IRanges::active<-’ when loading ‘AnnotationDbi’”Warning message: “replacing previous import ‘S4Vectors::.__C__expressionORfunction’ by ‘IRanges::.__C__expressionORfunction’ when loading ‘AnnotationDbi’”Warning message: “replacing previous import ‘S4Vectors::.__C__FilterMatrix’ by ‘IRanges::.__C__FilterMatrix’ when loading ‘AnnotationDbi’”Warning message: “replacing previous import ‘BiocGenerics::IQR’ by ‘IRanges::IQR’ when loading ‘AnnotationDbi’”Warning message: “replacing previous import ‘S4Vectors::FilterMatrix’ by ‘IRanges::FilterMatrix’ when loading ‘AnnotationDbi’”Warning message: “replacing previous import ‘S4Vectors::subsetByFilter’ by ‘IRanges::subsetByFilter’ when loading ‘AnnotationDbi’”Warning message: “replacing previous import ‘S4Vectors::active’ by ‘IRanges::active’ when loading ‘AnnotationDbi’”Warning message: “replacing previous import ‘AnnotationDbi::tail’ by ‘utils::tail’ when loading ‘hgu95av2cdf’”Warning message: “replacing previous import ‘AnnotationDbi::head’ by ‘utils::head’ when loading ‘hgu95av2cdf’”
AffyBatch object size of arrays=640x640 features (35221 kb) cdf=HG_U95Av2 (12625 affyids) number of samples=4 number of genes=12625 annotation=hgu95av2 notes=

The AffyBatch object contains 640x640 arrays 35221kb in size

The number of samples is 4

Number of genes is 12,625

Gene expression estimation with RMA and MAS5

When the data is processed using one of the methods we discussed (RMA, puma, MAS5, etc.), Bioconductor creates a new object called ExpressionSet. This object contains the expression values but also the phenotypic data, descriptions, annotations, sample names etc...

Exercise 4: Activate the package affyand create two different types of ExpressionSet, using rma() and mas5(). For example, you can do so with the assignment eset_rma<-rma(Dilution). Use the command show() to describe what you have obtained.

To extract gene expression values from the eset, you need the command exprs() that must be used with an ExpressionSet. For example: e_rma<-exprs(eset_rma).

library(affy) # loading the package affy into the workspace eset_rma<-rma(Dilution) # Assigning the expression set from the rma function to a variable
Background correcting Normalizing Calculating Expression
eset_mas5<-mas5(Dilution) # Assigning the expression set from the mas5 function to a variable
background correction: mas PM/MM correction : mas expression values: mas background correcting...done. 12625 ids to be processed | | |####################|
eset_rma # checking the contents of the variable containing the expression set calculated using the rma function. This returns the same as show(eset_rma) command.
ExpressionSet (storageMode: lockedEnvironment) assayData: 12625 features, 4 samples element names: exprs protocolData: none phenoData sampleNames: 20A 20B 10A 10B varLabels: liver sn19 scanner varMetadata: labelDescription featureData: none experimentData: use 'experimentData(object)' Annotation: hgu95av2
eset_mas5 # checking the contents of the variable containing the expression set calculated using the mas function. This returns the same as show(eset_mas) command.
ExpressionSet (storageMode: lockedEnvironment) assayData: 12625 features, 4 samples element names: exprs, se.exprs protocolData: none phenoData sampleNames: 20A 20B 10A 10B varLabels: liver sn19 scanner varMetadata: labelDescription featureData: none experimentData: use 'experimentData(object)' Annotation: hgu95av2
e_rma<-exprs(eset_rma) e_mas5<-exprs(eset_mas5)
identical(eset_mas5,eset_rma) # showing that the two functions have returned different expression sets, this is because the functions are using two different methods for analysing the same data
[1] FALSE
identical(e_rma,e_mas5) # showing that the two functions have returned different gene expression values
[1] FALSE

Exercise 5: Extract the expression values for mas5 and rma and do a first diagnostic on the data using density() and boxplot(). Use a markdown cell to explain what you have found. (Tip: Do you need to transform the mas5 data? Remember log2() is the transformation we use). Visualise any manipulation that you do to the data.

density(e_rma) # performing diagnostic on the gene expression data obtained using rma density(e_mas5) # performing diagnostic on the gene expression data obtained using mas5
Call: density.default(x = e_rma) Data: e_rma (50500 obs.); Bandwidth 'bw' = 0.2 x y Min. : 1.620 Min. :1.210e-06 1st Qu.: 4.737 1st Qu.:7.066e-03 Median : 7.855 Median :6.411e-02 Mean : 7.855 Mean :8.012e-02 3rd Qu.:10.972 3rd Qu.:1.483e-01 Max. :14.089 Max. :2.005e-01
Call: density.default(x = e_mas5) Data: e_mas5 (50500 obs.); Bandwidth 'bw' = 40.5 x y Min. : -121.1 Min. :0.000e+00 1st Qu.:10747.9 1st Qu.:8.090e-08 Median :21616.9 Median :4.438e-07 Mean :21616.9 Mean :2.353e-05 3rd Qu.:32486.0 3rd Qu.:1.346e-06 Max. :43355.0 Max. :2.692e-03
boxplot(e_mas5)
MIME type unknown not supported

Boxplot before the data is transformed - this box plot is difficult to red since the outliers have very large values.The transformation of the data helps to change the scale and make plots readable.

e_mas5<-log2(e_mas5) # Transforming the e_mas5
density(e_mas5)
Call: density.default(x = e_mas5) Data: e_mas5 (50500 obs.); Bandwidth 'bw' = 0.2614 x y Min. :-2.205 Min. :3.500e-07 1st Qu.: 2.393 1st Qu.:4.883e-03 Median : 6.990 Median :3.176e-02 Mean : 6.990 Mean :5.433e-02 3rd Qu.:11.587 3rd Qu.:1.061e-01 Max. :16.184 Max. :1.531e-01
boxplot(e_mas5, ylab="Gene Expression levels", xlab="Sample Names")
MIME type unknown not supported

boxplot for gene expression values calculated using the mas5 function: Medians are alligned and the datat looks normalised.

boxplot(e_rma, ylab="Gene Expression levels", xlab="Sample Names")
MIME type unknown not supported

Boxplot for gene expression values calculated via the rma() function: Bulk of gene expression values lies a lot lower on the y axis - Medians are alligned and there are no negative outliers.

** Gene Expression estimation with probabilistic models** You can evaluate the gene expression within puma package using mmgmos().

Exercise 6: Activate the puma package and create an eset from the Dilution data. For example using: e_puma<-exprs(eset_puma), it will take a little longer than the others so be patient. Apply the same diagnostics of Exercise 5. Explain in a markdown cell what you have found

library(puma) # Activating the puma package
Error in library(puma): there is no package called ‘puma’ Traceback: 1. library(puma) 2. stop(txt, domain = NA)
eset_puma<-mmgmos(Dilution) # evaluating the gene expression of Dilution data within puma e_puma<-exprs(eset_puma) # creating an expression set from the dilution data e_puma
plot(density(e_puma), main="Dilution data from puma", xlab="bins", ylab="Density") # plotting the density of the gene expression

The density plots helps o identify any irregularity in thr data, especially if there are more modes, which in this particular case we don't expect to be there. Data looks fine and it is already log2 transformed.

Exercise 7: Are there negative values in the experssions? Why is that? Set all the negative expressions to zero. First extract the expression from the eset and check the dim() of your data frame. Compare the three methods and explain what you found. Is the data Normalised?

Tip: You can use a for loop on the columns and set to zero the negative elements one column at the time. to do so there are many way one can be

temp<-test[,i] temp[temp < 0] <- 0 test[,i]<-temp

Be inventive!

The negatives values are due to the fact that there are a set of gene expression estimates that have values between 0 and 1. This means that when log2 tranformed they become negative. The genes that have these gene expression values are not expressed in the sample.

for (i in 1:4) { temp<-e_puma[,i] temp[temp<0] <- 0 e_puma[,i] <- temp }

Estrogen Data

We now use data loaded in pumadata to look at the effect of a different type of experimental design called time series. In this data we want to see the effect of presence or absence of estrogene at 10hr or 48hr in a cell culture.

Activate the pumadata package and use data(affybatch.estrogen) to upload the data into the workspace. We can add phenotypic data to describe the affybatch as follows. You can check the results using the command pData()

library(pumadata) # activating the pumadata package data(affybatch.estrogen) # uploading the data into the workspace
pData(affybatch.estrogen) <- data.frame( "estrogen"=c("absent","absent","present","present" #adding phenotypic data ,"absent","absent","present","present") , "time.h"=c("10","10","10","10","48","48","48","48") , row.names=rownames(pData(affybatch.estrogen))) pData(affybatch.estrogen) # checking the results of above

You can estimate gene expression levels with

eset_estrogen_mmgmos <- mmgmos(affybatch.estrogen, gsnorm="none") eset_estrogen_rma <- rma(affybatch.estrogen)

but if you have lack of time in the practical you can load the data already processed

data(eset_estrogen_mmgmos) data(eset_estrogen_rma)

Exercise 8: Use boxplot to visualise the two esets. What do you find? (Tip: If you need to normalise the data use eset_estrogen_mmgmos_normd <- pumaNormalize(eset_estrogen_mmgmos)) Why is it important to normalise? Answer: allows the data to be corrected and remove systematic bias

data(eset_estrogen_mmgmos) #loading the expression set for puma into the workspace data(eset_estrogen_rma) #loading the expression set for rma into the workspace eset_estrogen_mmgmos_normd <- pumaNormalize(eset_estrogen_mmgmos)
eset_rma<-exprs(eset_estrogen_rma) eset_mmgmos<-exprs(eset_estrogen_mmgmos)
names(pData(affybatch.estrogen))
boxplot(eset_rma, main="Estrogen data from RMA", xlab="Samples", ylab="expression")
par(mfrow=c(1,2)) boxplot(eset_mmgmos,main="Estrogen data from puma", xlab="Samples", ylab="expression") boxplot(exprs(eset_estrogen_mmgmos_normd),main="Normalised data from puma", xlab="Samples", ylab="expression")

Write results

To write results from puma eset we use write.reslts(eset_estrogen_mmgmos, file="eset_estrogen")

To write results from RMA we use write.csv(exprs(eset_estrogen_rma),"rma_data.csv")

We will not need to write the full table of results for the final project, this is just for your information.

DE Analysis

To perform Differential expression we need to summarise the replicated data, so that we have one values set for each condition. In pumawe can do this using

eset_estrogen_comb <- pumaComb(eset_estrogen_mmgmos_normd)

this might take a long time (hrs) so for this practical we use preprocessed data

data(eset_estrogen_comb) show(eset_estrogen_comb)

Exercise 9: Check the dimension and the pData() for the eset of the combined values. Calculate the FC and plot the data with a MA plot using the command ma.plot()

pData(eset_estrogen_comb)
dim(eset_estrogen_comb)

Before calculating the FC, we set the negative gene expression values to zero to avoid problems.

estrog_comb_puma<-exprs(eset_estrogen_comb) for (i in 1:4) { temp<-estrog_comb_puma[,i] temp[temp<0] <- 0 estrog_comb_puma[,i] <- temp }
fc<- estrog_comb_puma[,1:2] - estrog_comb_puma[,3:4] # calculating the fold change colnames(fc)<-c("A-P 10H", "A-P 48H")

** Exercise 10**: Do the same for eset_estrogen_RMA, using the mean to combine the values.

eset_comb_rma<-matrix(0,nrow=dim(eset_rma)[1], ncol=dim(eset_rma)[2]) eset_comb_rma[,1]<-mean(eset_rma[,1:2]) eset_comb_rma[,2]<-mean(eset_rma[,3:4]) eset_comb_rma[,3]<-mean(eset_rma[,5:6]) eset_comb_rma[,4]<-mean(eset_rma[,7:8]) rownames(eset_comb_rma)<-rownames(eset_estrogen_rma) # calculating the FC fc_rma<- eset_comb_rma[,1:2] - eset_comb_rma[,3:4] # calculating the fold change colnames(fc_rma)<-c("A-P 10H", "A-P 48H")

Visualise the FC with MA plot

groups<-c("A10","P10","A48","P48") estrogen_table<-data.frame(sampleNames(eset_estrogen_comb), groups) group10hr<-factor(groups[1:2]) group48hr<-factor(groups[3:4]) group10hr
estrogen_table
par(mfrow=c(2,2)) MAplot(eset_estrogen_comb) # MA plot of the full eset

First Steps with limma

To start using limma activate the package as normal using library(limma). The limma user guide is very useful, that you can access in the additional reading.

Three steps are required for limma:

  • Step 1: build the design contrast matrix

  • Step 2: fit the linear model

  • Step 3: calculate the p-values and FDRs with a empirical Bayes test

To simplify Step 1 we can make use of two functions that are in puma package.

design_rma<-createDesignMatrix(eset_estrogen_rma) # create design matrix contrast_rma<-createContrastMatrix(eset_estrogen_rma) # create constrast
library(limma) # loading the limma package
design_rma<-createDesignMatrix(eset_estrogen_rma) # creating the design matrix contrast_rma<-createContrastMatrix(eset_estrogen_rma) # creating contrast matrix design_rma contrast_rma

The contrast matrix describes all the possible comaprisons, that you can have in the data, where is 0 the sample is ignored. Where there is 1 that sample gooes to the numerator of the FC ratio. Clearly the sample with -1 goes to the denominator. So for example the first column the comparison will be present.10/absent.10 or in log2 scale present.10 - absent.10.

Step 2: fit the linear model to the design that we have created.

fit_rma<-lmFit(eset_estrogen_rma,design_rma)

Step 3: calculate the p-values from the linear model we fitted.

fit_rma2 <- contrasts.fit(fit_rma, contrast_rma) fit_rma2 <- eBayes(fit_rma2)
contrasts<-colnames(contrast_rma) # check which is the comparison of interest by storing them in an array contrasts colnames(contrast_rma)<-c("P10vsA10","A48vsA10","P48vsP10","P48vsA48","E_AvsE_P","10hva48h","E_vsTime")

You can rename the columns of the constrat matrix to have shorter names.

** Identify DE genes**

You can access the field of that objects using $. For example if you want the p.values you can use fit_rma2$p.value. You can plot histgrams of them for the constrast of interest. For example hist(fit_rma2$p.value[,1]) for constrast 1 (present.10_vs_absent.10). Finally you can access the most different genes in that constrast using topTable()

pval<-fit_rma2$p.value
DEgenes<-topTable(fit_rma2,coef=1, adjust.method="BH") # identify genes that are DE # Builds a vennDiagram to see which genes constrasts have in common results_rma_1<-decideTests(fit_rma2, method="global") colnames(results_rma_1)<-colnames(contrast_rma) vennDiagram(results_rma_1[,c(1,4)]) # vennDiagram for constrast 1 and 4

Exercise 11: Repeat the limma analysis for eset_estrogen_mmgmos. How do the results compare? mmgmos has a lot more DE expressed genes detected (more sensitive?) and a lot more overlapping as well.

design_mmgmos<-createDesignMatrix(eset_estrogen_mmgmos) # creating the design matrix contrast_mmgmos<-createContrastMatrix(eset_estrogen_mmgmos) # creating contrast matrix
fit_mmgmos<-lmFit(eset_estrogen_mmgmos,design_mmgmos)
fit_mmgmos2 <- contrasts.fit(fit_mmgmos, contrast_mmgmos) fit_mmgmos2 <- eBayes(fit_mmgmos2)
DEgenes<-topTable(fit_mmgmos2,coef=1, adjust.method="BH") # identify genes that are DE # Builds a vennDiagram to see which genes constrasts have in common results_mmgmos_1<-decideTests(fit_mmgmos2, method="global") colnames(results_mmgmos_1)<-colnames(contrast_rma) vennDiagram(results_mmgmos_1[,c(1,4)]) # vennDiagram for constrast 1 and 4

You can see from the Venn diagram that puma is able to identify many more DE genes and therefore it is right to say that this method is more sensitive than RMA.