Project: Jingyi Xie - Autumn2016/BMS353

Path: Autumn2016 / Week4 / Week_4_practical.ipynb

Views: ⁴⁰³¹

Kernel: R (R-Project)

Week 4 Practical

This Notebook contains practical assignments for Week 4.

It contains guidance on how

to load packages from Bioconductor and access the vignettes
how to load data from preprocessed data objects within bioconductor.
Analysis of gene expression data with different methods and normalisation techniques
Basic use of limma for Differential Expression Anaysis

As for the other practicals, this notebook also contains practical tasks that you will have to implement yourself. We will use data that is stored in the preprocessed Bioconductor packages and if we will use different type of data, it will be stored in your SageMathCloud folder. Any other data can be used as long as it is specifically mentioned in your notebook and access to it is possible via SageMathCloud.

You are free to base your work on the examples given here but you are also welcome to use different methods if you prefer, adding and/or creating new ways of displaying the data. It is always a good idea to familiarise with the Bioconductor website, to get your way around help and searches: https://www.bioconductor.org/

** Please be aware ** that Bioconductor has many different packages which change/update very frequently, this might couse problem with the install of them onto SageMathCloud. If you need help on packeges installation please contact Dr Marta Milo and/or Dr Mike Croucher.

To complete this practical, you will need to create a new notebook in the Week 4 folder of your SageMathCloud account that you will call your username_week4.ipynb.

The notebooks will be self-marked following a set of guidelines that you will receive with a notebook that containes the solutions to the exercises. THIS is FORMATIVE feedback that you can use to improve your coding skills.

The last version of your notebook saved by the deadlines indicated on the module website will be the one that will be considered for self-marking. It will be moved in your assigment folder where you will find the guidelines and the solved notebook.

All the notebooks are meant to be used interactively. All the code needs to be written into code cells -- that can be executed by an R kernel. The outputs are not present in this notebook, but the code cells are executable.

We will be plotting large amounts of data. To avoid problems with file size, you should run the following line before any other which changes plot outputs to small .png files

In [ ]:

options(jupyter.plot_mimetypes ='image/png')

Loading Packages

In Bioconductor thare are many packages that are normally installed on local R versions with the function biocLite(). On Sage Math Cloud this function has not been fully tested for using within our accounts, therefore we have many of them preinstalled. They can be activated using library() as you would normally do with any other R package.

Bioconductor packages may depend on other packages, which means that they can only work if the others are activated. As default behaviour Bioconductor automatically activates all dependencies, but it might happen that you need to do it manually with the command library(). To start our practical we need to load the basic Bioconductor package affy.

Exercise 1: Familiarise with the Bioconductor web portal. Load affy and check that the you have mas5() and rma() using the help function. In a markdown cell explain what are the objects that they require and what is the output ( Tip: look out for Affybatch, ExpressionSet...)

Processing array data with Bioconductor

There is a substantial amount of preprocessed data in Bioconductor that we can use for some of the exercises below. In the final project we will be processing data from the raw files called .CEL files, which might take some time to process.

To load preprocessed data we can use:

the command data() if the objects are already presnt in the package;
the command library() with the name of data package we want to use;
the command load("fileName.RDA") to upload preprocessed data from a previously saved workspace.

In this practical we will use preprocessed data and raw .CEL files.

Exercise 2: Set your working directory to data_w4, or define a path variable that contains the path to that data folder. Ensure that you know where you are using getwd() and setwd() commands.

Bioconductor Workflow

Step 1: build the data set of your raw values into an object called AffyBatch
Step 2: process the affybatch with selected methods
Step 3: Normalise the data if needed (some methods have this step included)
Step 4: Visualise data
Step 5: High level analysis : DE analysis, clustering etc..
Step 6: Selection of final targets

In this practical we will cover the first four steps and start working on DE analysis using limma. In week 5 we will complete the workflow and add some functional analysis on the selected targets.

In data_wk4 there is a file called Dilution.RDA. It conatins an affybatch of a benchmark data set that was used to test methods for affymetrix gene epression estimation. The data consists of two sources of mRNA, human liver tissue and a central nervous system cell line (CNS), were hybridized to human geneChip arrays (HG-U95A) in a range of different dilutions and proportions.

Be aware that within Bioconductor it is possible to create conflict when same packages use same functions with different names. In our specific case puma creates a conflict with affy and in particular on the function rma(). SO PLEASE if you want to use rma() do so before activating puma or pumadata pacakages.

Exercise 3: Load the data from Dilution.RDA and check if it is loaded with the command ls(). The dilution data is already in the form of an affybatch. Use the command show(Dilution) to check the features. Use a markdown cell to describe it.

** Gene expression estimation with RMA and MAS5**

When the data is processed using one of the methods we discussed (RMA, puma, MAS5, etc.), Bioconductor creates a new object called ExpressionSet. This object contains the expression values but also the phenotypic data, descriptions, annotations, sample names etc...

Exercise 4: Activate the package affy and create two different types of ExpressionSet, using rma() and mas5(). For example, you can do so with the assignment eset_rma<-rma(Dilution). Use the command show() to describe what you have obtained.

To extract gene expression values from the eset, you need the command exprs() that must be used with an ExpressionSet. For example: e_rma<-exprs(eset_rma).

Exercise 5: Extract the expression values for mas5 and rma and do a first diagnostic on the data using density() and boxplot(). Use a markdown cell to explain what you have found. (Tip: Do you need to transform the mas5 data? Remember log2() is the transformation we use). Visualise any manipulation that you do to the data.

** Gene Expression estimation with probabilistic models** You can evaluate the gene expression within puma package using mmgmos().

Exercise 6: Activate the puma package and create an eset from the Dilution data. For example using: e_puma<-exprs(eset_puma), it will take a little longer than the others so be patient. Apply the same diagnostics of Exercise 5. Explain in a markdown cell what you have found

Exercise 7: Are there negative values in the experssions? Why is that? Set all the negative expressions to zero. First extract the expression from the eset and check the dim() of your data frame. Compare the three methods and explain what you found. Is the data Normalised?

Tip: You can use a for loop on the columns and set to zero the negative elements one column at the time. to do so there are many way one can be

temp<-test[,i]
temp[temp < 0] <- 0
test[,i]<-temp

Be inventive!

Estrogen Data

We now use data loaded in pumadata to look at the effect of a different type of experimental design called time series. In this data we want to see the effect of presence or absence of estrogene at 10hr or 48hr in a cell culture.

To upload your data in your workspace, you can activate the pumadata package and use data(affybatch.estrogen), or you can use load() to load the data from your data directory in teh Week4 folder. Once the data is uploaded in your workspace, we can add phenotypic data to describe the affybatch as follows. You can check the results using the command pData().

In [ ]:

pData(affybatch.estrogen) <- data.frame(
"estrogen"=c("absent","absent","present","present"
,"absent","absent","present","present")
, "time.h"=c("10","10","10","10","48","48","48","48")
, row.names=rownames(pData(affybatch.estrogen)))

You can estimate gene expression levels with

eset_estrogen_mmgmos <- mmgmos(affybatch.estrogen, gsnorm="none")
eset_estrogen_rma <- rma(affybatch.estrogen)

but if you have lack of time in the practical or you have activated pumadata package, you can load the data already processed from your data directory

load("eset_estrogen_mmgmos.RDA") 
load("eset_estrogen_rma.RDA")

Exercise 8: Use boxplot to visualise the two esets. What do you find? (Tip: If you need to normalise the data use eset_estrogen_mmgmos_normd <- pumaNormalize(eset_estrogen_mmgmos)) Why is it important to normalise?

Write results

To write results from puma eset we use write.reslts(eset_estrogen_mmgmos, file="eset_estrogen")

To write results from RMA we use write.csv(exprs(eset_estrogen_rma),"rma_data.csv")

We will not need to write the full table of results for the final project, this is just for your information.

DE Analysis

To perform Differential expression we need to summarise the replicated data, so that we have one values set for each condition. In puma we can do this using

eset_estrogen_comb <- pumaComb(eset_estrogen_mmgmos_normd)

this MIGHT take a long time (hrs) so for this practical DO NOT EXECUTE THIS COMMAND but ** use preprocessed the data** as above

Exercise 9: Check the dimension and the pData() for the eset of the combined values. Calculate the FC and plot the data with a MA plot using the command ma.plot()

** Exercise 10**: Do the same for eset_estrogen_RMA, using the mean to combine the values.

In [ ]:

?ma.plot()

First Steps with limma

To start using limma activate the package as normal using library(limma). The limma user guide is very useful, that you can access in the additional reading.

Three steps are required for limma:

Step 1: build the design contrast matrix
Step 2: fit the linear model
Step 3: calculate the p-values and FDRs with a empirical Bayes test

To simplify Step 1 we can make use of two functions that are in puma package.

design_rma<-createDesignMatrix(eset_estrogen_rma) # create design matrix
contrast_rma<-createContrastMatrix(eset_estrogen_rma) # create constrast

In [ ]:

library(limma)

In [ ]:

design_rma<-createDesignMatrix(eset_estrogen_rma)
contrast_rma<-createContrastMatrix(eset_estrogen_rma)

Step 2: fit the linear model to the design that we have created.

In [ ]:

fit_rma<-lmFit(eset_estrogen_rma,design_rma)

Step 3: calculate the p-values from the linear model we fitted.

In [ ]:

fit_rma2 <- contrasts.fit(fit_rma, contrast_rma)
fit_rma2 <- eBayes(fit_rma2)

In [ ]:

contrasts<-colnames(contrast_rma) # check which is the comparison of interest by storing them in an array

** Identify DE genes**

You can access the field of that objects using $. For example if you want the p.values you can use fit_rma2$p.value. You can plot histgrams of them for the constrast of interest. For example hist(fit_rma2$p.value[,1]) for constrast 1 (present.10_vs_absent.10). Finally you can access the most different genes in that constrast using topTable()

In [ ]:

DEgenes<-topTable(fit_rma2,coef=1, adjust.method="BH") # identify genes that are DE

# Builds a vennDiagram to see which genes constrasts have in common
results_rma_1<-decideTests(fit_rma2, method="global") 
results_rma_1
vennDiagram(results_rma_1[,c(1,4)]) # vennDiagram for constrast 1 and 4

Exercise 11: Repeat the limma analysis for eset_estrogen_mmgmos. How do the results compare? Have you found common genes?

In [ ]: