In this lab you will use PCA to visualise some single-cell gene expression data from Guo et al. "Resolution of Cell Fate Decisions Revealed by Single-Cell Gene Expression Analysis from Zygote to Blastocyst" Developmental Cell, Volume 18, Issue 4, 20 April 2010, Pages 675-685, available from http://dx.doi.org/10.1016/j.devcel.2010.02.012. The paper pdf is available in the handouts folder for Week 10.
The data are from a qPCR TaqMan array single cell experiment in mouse. The data is taken from the early stages of development when the Blastocyst is forming. At the 32 cell stage the data is already separated into the trophectoderm (TE) which goes onto form the placenta and the inner cellular mass (ICM). The ICM further differentiates into the epiblast (EPI)---which gives rise to the endoderm, mesoderm and ectoderm---and the primitive endoderm (PE) which develops into the amniotic sack. Guo et al selected 48 genes for expression measurement. They labelled the resulting cells and their labels are included as an aide to visualization.
Below we show how to get the data into your notebook
Exercise 1: Adapt the code from the Iris-InteractivePCA notebook to carry out PCA and visualise the data
Questions (add your answers below each question):
1.1 How many principal components (PCs) are required to explain more than 80% of the variance in the data?
Answer: Using the cumulative plot below we see that eleven PCs explain 80.83% of the variance in the data.
1.2 Which PC do you think is most useful for separating the TE cell-types (32TE and 64TE) from the rest?
Answer: Projecting onto the 1st PC separates them very well from the rest - there is just some mixing with two 32ICM cells which look like outliers. If you project onto the 2nd or 3rd PC then there are significant overlaps with other cell-types.
1.3 Which combined pair of PCs 1,2 and 3 best separates the 2-cell stage from the rest (e.g. PCs 1&2, 1&3 or 2&3)?
Answer: PCs 1&3 gives the cleanest separation of these cells from the rest.
1.4 If you wanted to choose one gene to measure in order to separate the TE cell-types from the others, which would you choose? Explain why this is a good choice.
Answer: Esrrb contributes a lot to PC1 (in the negative direction) which separates these cell-types from the rest. Using the interactive plot (final plot below) we can see that low levels of this gene are indicative of being in the TE cell-type. Other good choices include Pecam1 (low in TE) and DppaI (high in TE) which are also strongly aligned with PC1.
import matplotlib.pyplot as plt import numpy as np import pandas as pd import visualisation # some functions defined for this lab import plotly import plotly.graph_objs as go from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot init_notebook_mode(connected=False) # use seaborn plotting style defaults doInteractive = True def interactivePlots(fig, axes): # helper function to decide to use plotly interactive plots or not if(doInteractive): plotly.offline.iplot_mpl(fig, show_link=False, strip_style=True) # offline ipython notebook