Jupyter notebook Autumn2015/Week12_cAt/mda13gsl_Week12.ipynb
BMS 353 Bioinformatics Project
Gi Shuen Kirsty Lee
This project investigates the role of a transcription factor, SP1, in colon cells. The data being analysed will be from cell cultures 48 hours after SP1 was silenced, and control cells where it was not silenced. There are two samples in each condition, and the gene expression profiles were quantified using Affymetrix GeneChip HGU133 PLUS 2.
The pipeline for the analysis of this project is as follows:
Load data and convert to affymetrix
Pre-process data with RMA to normalize data
Visualise data with MA plots and box plot
Analyse differential gene expression with Puma
Visualise data with MA plots and box plots
Extract the top differentially expressed genes
Annotate probeset with gene names and symbols
Principal component analysis
Hierarchical clustering with a heatmap
Functional pathway analysis with PANTHER
1. Loading data
This line of code was executed so as to make sure output plots are executed as PNG files, which are small in size - to keep the file size of this notebook small.
The affy package is then loaded, and the working directory set, in order to be able to load the data and convert it into an Affybatch Object for pre-processing.
Data is now loaded into an AffyBatch object, sp1.
Now that the data is loaded in an Affybatch Object, sp1, the data will be pre-processed using rma to normalize the data. The results will be extracted from an expression set into an object, eset_rma.
2. Pre-processing data
This pre-processed data will now be visualized with a boxplot, and MA plot to make sure that the data is normalized.
This MA plot allows us to visualize that the data has been normalized by RMA pre-processing, as a flat relationship can be observed between the ratio of signal between two arrays and signal level.
The comparison of the boxplots before and after RMA pre-processing further confirms that the data has been normalized - as it can be seen that the mean of the distributions are the same and are centered.
3. Differential expression analysis with Puma
The puma package will now be loaded so as to be able to perform differential expression analysis of the genes.
Phenodata for the dataframe was annotated so that the data can be combined with pumaCombImproved after expression levels of genes are estimated with the mmgmos package in puma.
The data was then processed with the mmgmos package in puma to estimate the expression levels of the genes.
This analysis by puma's mmgmos function was then visualized as an MA plot and a boxplot, seen below.
This MA plot allows us to see that the outliers of the data have been removed, and that the difference between the two conditions is symmetrical, and the spread is mostly centered around the mean.
From the boxplot, it can be observed that the data is already normalized and does not need any further normalisation. However it can be seen that there is variation outside of the upper and lower quartiles of the data. As the outliers have been removed, it can be seen that the means have decreased in value slightly, when comparing with the boxplot of the original data shown earlier.
The data was then combined with pumaCombImproved so as to be able to perform differential expression analysis. This combining was done using a bayesian Hierarchical model in puma. Due to the long processing time that this took, the code was executed and saved to a file, sp1_comb.RDA, which is seen to be loaded in the following cell.
The code that was used to combine the data is as follows:
Differential expression analysis was then carried out using the puma package, and results written into a .csv file. This allows us to compare relative gene expression between the control and silenced cell conditions, in order to see which genes are affected the most by the silencing of SP1.
The P-like value of the results from pumaDE was also calculated, and visualized as a histogram.
P-like values range between 0 and 1, where 0 has the highest probability of being differentially expressed, and 1 the lowest. From this histogram, it can be seen that a large majority of the genes were likely not differentially expressed, and there is a small set, of frequency roughly 100-200 that have a high probability of being differentially expressed.
The top differentially expressed genes were then extracted using >1 and <-1 for cut off values in fold change, and PPLR cut off values of >0.99 and <0.01 to compare different methods of extracting the top differentially expressed genes.
It can be seen that 261 genes were found to be the genes most likely to be affected by the the silencing of SP1.
The top differentially expressed genes are now to be extracted using the PPLR values, and merged into a dataset called topgenes.
It can be seen that the two ways of extracting the genes most affected by the silencing of SP1 differ in their stringency, and using the fold change is a much stricter method. To be on the conservative side, the genes extracted by their PPLR value will be used.
Topgenes, containing the probesets, was then annotated with the symbol, gene name, and probe ID - annotated_list, and the NAs present omitted. The final list of genes that display the most change between conditions is loaded under the name 'annotated_list_na'.
Principle Components Analysis was then carried out on the puma analysed data to visualise the variation between the conditions.
4. Principal Component Analysis
PCA allows for the data to be separated into unrelated dimensions, allowing for the data to be visualized without being affected by the other factors. From this PCA plot it can be seen that the two conditions where SP1 was silenced (red triangles) have a large variance in relation to the control conditions - silencing of SP1 affected gene transcription of a large number of genes.
** 5. Hierarchical Clustering**
Hierarchical clustering was then performed, with the output being a heatmap - seen below. Heatmaps allow for gene expression levels to be visualized across a number of samples or conditions, allowing for comparison across conditions.
It can be clearly observed from the heatmap generated that there is a clear difference in gene expression between the silenced cells and control cells. Genes that are seen to be downregulated in the control conditions are seen to be upregulated, and vice versa. A comparison of M48.2 and S48.2 (Sample 2 of Control and Silenced cells) reveals that the colours on the heatmap are completely opposite of each other (red becomes green, and vice versa) - allowing the inference that after silencing of SP1, the effects on gene transcription were reversed.
6. Functional pathway analysis with PANTHER
The data in the Symbols column of the annotated list of top differentially expressed genes was saved, and inputted to the Panther database in order to investigate which pathways were involved. The list of pathways generated by the Panther database was saved and has been loaded below, under the dataframe 'genepathways'.
The symbols inputted into Panther were also available for visualisation as a pie chart depending on the ontology. The pie charts for the biological process ontology and molecular function ontology can be seen below.
Pie chart showing the biological process ontology of the top differentially expressed genes 48 hours after SP1 silencing
It can be seen from this pie chart that majority of the genes that have been altered by the silencing of SP1 are largely related to metabolic or cellular processes in the cell, and to a lesser extent, biological regulation, developmental processes and localisation. As SP1 is a transcription factor that binds DNA, this is unsurprising - Tan and Khachigian (2009) have reported that it regulates the expression of a multitude of genes that control cellular processes.
Pie chart showing the molecular function ontology of the top differentially expressed genes 48 hours after SP1 silencing
From this pie chart, it is observed that majority of the genes are associated with binding and catalytic activity in the cell, which is unsurprising as SP1 is a transcription factor, and would therefore be associated with a number of binding factors and enzymatic reactions in the cell.
Conclusion
From the analysis of the results obtained from the AffyBatch GeneChip, it can be concluded that SP1 has a large effect on gene transcription in the colon cell. Up to 99 pathways are affected - seen in the 'genepathways' dataframe; majority of which are common signalling pathways seen in most cells, such as the p53 pathway, Wnt signalling pathway, and TGF beta signalling pathway. The genes changing the most between the two conditions can be observed through the dataset 'topgenes'. Through the analysis of the top differentially expressed genes, it was found that majority of the genes affected were related to metabolic and cellular processes in the cell, and their molecular function were associated to binding and catalytic activity. This is expected as SP1 being a transcription factor would likely be associated with cellular and metabolic activity, which would require catalytic reactions. Silencing of SP1 resulted in the genes being either upregulated or downregulated, opposite to what they would have been in a control condition.