Week 4 Practical notebook - Solutions
This notebook contains guidance on how
to load packages from Bioconductor and access the vignettes
how to load data from preprocessed data objects within bioconductor.
Analysis of gene expression data with different methods and normalisation techniques
Basic use of limma for Differential Expression Anaysis
These are just suggested solutions, you can find different ones and still correct, so please refer to them as a guidance and as suggestions.
Loading Packages
In Bioconductor thare are many packages that are normally installed on local R versions with the function biocLite()
. On Sage Math Cloud this function has not been fully tested for use with our accounts, therefore we have many of them preinstalled. They can be activated using library()
as you would normally do with any other R package.
Bioconductor packages may depend on other packages, which means that they can only work if the others are activated. As default behaviour Bioconductor automatically activates all dependencies, but it might happen that you need to do it manually with the command library()
. To start our practical we need to load the basic Bioconductor package affy
.
Exercise 1: Familiarise with the Bioconductor web portal. Load affy
and check that the you have mas5()
and rma()
using the help function. In a markdown cell explain what are the objects that they require and what is the output ( Tip look out for Affybatch
, ExpressionSet
...)
mas5()
is the MAS 5.0 Expression measure function, which converts an Affybatch onto an Expressionset using Affymetrix's MAS 5.0 gene expression estimation method. This method is a single point statistic method.
syntax: mas5(object, normalize = TRUE, sc=500, analysis="absolute")
where:
object = an instance of AffyBatch
normalize = TRUE/FALSE - if TRUE scale normalization is used after the ExpressionSet is obtained.
sc = Scaling value for the arrays
analysis = absolute or comparison analsis
rma()
is the function that implements the Robust Multiarray Average method. It converts an AffyBatch object into an ExpressionSet data using the RMA expression algorithm.
syntax: rma(object, subset=NULL, verbose=TRUE, destructive=TRUE, normalize=TRUE, background=TRUE, bgversion=2, ...)
where:
object an AffyBatch object.
subset a character vector with the the names of the probesets to be used in expression calculation.
verbose logical value. If TRUE, it writes out some messages indicating progress. If FALSE nothing should be printed.
destructive logical value. If TRUE, works on the PM matrix in place as much as possible, good for large datasets.
normalize logical value. If TRUE, normalize data using quantile normalization.
background logical value. If TRUE, background correct using RMA background correction.
bgversion integer value indicating which RMA background to use 1: use background similar to pure R rma background given in affy version 1.0 - 1.0.2 2: use background similar to pure R rma background given in affy version 1.1 and above
Processing array data with Bioconductor
There is a vast amount of data in Bioconductor, raw or preprocessed, that we can be used for some of the exercises below. this notebook is the backbone for the final project notebook, in which we will be processing data from the raw data files called .CEL files. This can take some processing time for this reason here we will use preprocessed data.
To load preprocessed data we can use:
the command
data()
if the objects are already presnt in the package;the command
library()
with the name of data package we want to use;the command
load("fileName.RDA")
to upload preprocessed data from a previously saved workspace.
Exercise 2: Set your working directory to data_w4
, or define a path variable that contains the path to that data folder. Ensure that you know where you are using getwd()
and setwd()
commands.
Gene Expression Analysis with Bioconductor Workflow
Step 1: build the data set of your raw values into an object called
AffyBatch
Step 2: process the affybatch with selected methods
Step 3: Normalise the data if needed (some methods have this step included)
Step 4: Visualise data
Step 5: High level analysis : DE analysis, clustering, PCA ..
Step 6: Selection of final targets
This workbook covers Step 1-4
In data_wk4 there is a file called Dilution.RDA
. It conatins an affybatch of a benchmark data set that was used to test methods for affymetrix gene epression estimation. The data consists of two sources of mRNA, human liver tissue and a central nervous system cell line (CNS), were hybridized to human geneChip arrays (HG-U95A) in a range of different dilutions and proportions.
Exercise 3: Load the data from Dilution.RDA
and check if it is loaded with the command ls(). The dilution data is already in the form of an affybatch. Use the command show(Dilution)
to check the features. Use a markdown cell to describe it.
The AffyBatch object contains 640x640 arrays 35221kb in size
The number of samples is 4
Number of genes is 12,625
Gene expression estimation with RMA and MAS5
When the data is processed using one of the methods we discussed (RMA, puma, MAS5, etc.), Bioconductor creates a new object called ExpressionSet. This object contains the expression values but also the phenotypic data, descriptions, annotations, sample names etc...
Exercise 4: Activate the package affy
and create two different types of ExpressionSet, using rma()
and mas5()
. For example, you can do so with the assignment eset_rma<-rma(Dilution)
. Use the command show()
to describe what you have obtained.
To extract gene expression values from the eset, you need the command exprs()
that must be used with an ExpressionSet. For example: e_rma<-exprs(eset_rma)
.
Exercise 5: Extract the expression values for mas5
and rma
and do a first diagnostic on the data using density()
and boxplot()
. Use a markdown cell to explain what you have found. (Tip: Do you need to transform the mas5 data? Remember log2()
is the transformation we use). Visualise any manipulation that you do to the data.
Boxplot before the data is transformed - this box plot is difficult to red since the outliers have very large values.The transformation of the data helps to change the scale and make plots readable.
boxplot for gene expression values calculated using the mas5 function: Medians are alligned and the datat looks normalised.
Boxplot for gene expression values calculated via the rma()
function: Bulk of gene expression values lies a lot lower on the y axis - Medians are alligned and there are no negative outliers.
** Gene Expression estimation with probabilistic models** You can evaluate the gene expression within puma
package using mmgmos()
.
Exercise 6: Activate the puma
package and create an eset from the Dilution data. For example using: e_puma<-exprs(eset_puma)
, it will take a little longer than the others so be patient. Apply the same diagnostics of Exercise 5. Explain in a markdown cell what you have found
Error in library(puma): there is no package called ‘puma’
Traceback:
1. library(puma)
2. stop(txt, domain = NA)
The density plots helps o identify any irregularity in thr data, especially if there are more modes, which in this particular case we don't expect to be there. Data looks fine and it is already log2 transformed.
Exercise 7: Are there negative values in the experssions? Why is that? Set all the negative expressions to zero. First extract the expression from the eset and check the dim()
of your data frame. Compare the three methods and explain what you found. Is the data Normalised?
Tip: You can use a for
loop on the columns and set to zero the negative elements one column at the time. to do so there are many way one can be
Be inventive!
The negatives values are due to the fact that there are a set of gene expression estimates that have values between 0 and 1. This means that when log2 tranformed they become negative. The genes that have these gene expression values are not expressed in the sample.
Estrogen Data
We now use data loaded in pumadata
to look at the effect of a different type of experimental design called time series. In this data we want to see the effect of presence or absence of estrogene at 10hr or 48hr in a cell culture.
Activate the pumadata
package and use data(affybatch.estrogen)
to upload the data into the workspace. We can add phenotypic data to describe the affybatch as follows. You can check the results using the command pData()
You can estimate gene expression levels with
but if you have lack of time in the practical you can load the data already processed
Exercise 8: Use boxplot to visualise the two esets. What do you find? (Tip: If you need to normalise the data use eset_estrogen_mmgmos_normd <- pumaNormalize(eset_estrogen_mmgmos)
) Why is it important to normalise? Answer: allows the data to be corrected and remove systematic bias
Write results
To write results from puma eset we use write.reslts(eset_estrogen_mmgmos, file="eset_estrogen")
To write results from RMA we use write.csv(exprs(eset_estrogen_rma),"rma_data.csv")
We will not need to write the full table of results for the final project, this is just for your information.
DE Analysis
To perform Differential expression we need to summarise the replicated data, so that we have one values set for each condition. In puma
we can do this using
this might take a long time (hrs) so for this practical we use preprocessed data
Exercise 9: Check the dimension and the pData()
for the eset of the combined values. Calculate the FC and plot the data with a MA plot using the command ma.plot()
Before calculating the FC, we set the negative gene expression values to zero to avoid problems.
** Exercise 10**: Do the same for eset_estrogen_RMA, using the mean to combine the values.
Visualise the FC with MA plot
First Steps with limma
To start using limma activate the package as normal using library(limma)
. The limma user guide is very useful, that you can access in the additional reading.
Three steps are required for limma
:
Step 1: build the design contrast matrix
Step 2: fit the linear model
Step 3: calculate the p-values and FDRs with a empirical Bayes test
To simplify Step 1 we can make use of two functions that are in puma
package.
The contrast matrix describes all the possible comaprisons, that you can have in the data, where is 0 the sample is ignored. Where there is 1 that sample gooes to the numerator of the FC ratio. Clearly the sample with -1 goes to the denominator. So for example the first column the comparison will be present.10/absent.10
or in log2 scale present.10 - absent.10
.
Step 2: fit the linear model to the design that we have created.
Step 3: calculate the p-values from the linear model we fitted.
You can rename the columns of the constrat matrix to have shorter names.
** Identify DE genes**
You can access the field of that objects using $
. For example if you want the p.values you can use fit_rma2$p.value
. You can plot histgrams of them for the constrast of interest. For example hist(fit_rma2$p.value[,1])
for constrast 1 (present.10_vs_absent.10). Finally you can access the most different genes in that constrast using topTable()
Exercise 11: Repeat the limma analysis for eset_estrogen_mmgmos. How do the results compare? mmgmos has a lot more DE expressed genes detected (more sensitive?) and a lot more overlapping as well.
You can see from the Venn diagram that puma is able to identify many more DE genes and therefore it is right to say that this method is more sensitive than RMA.