Visualisation of single-cell expression data using PCA

In this lab you will use PCA to visualise some single-cell gene expression data from Guo et al. "Resolution of Cell Fate Decisions Revealed by Single-Cell Gene Expression Analysis from Zygote to Blastocyst" Developmental Cell, Volume 18, Issue 4, 20 April 2010, Pages 675-685, available from http://dx.doi.org/10.1016/j.devcel.2010.02.012. The paper pdf is available in the handouts folder for Week 10.

The data are from a qPCR TaqMan array single cell experiment in mouse. The data is taken from the early stages of development when the Blastocyst is forming. At the 32 cell stage the data is already separated into the trophectoderm (TE) which goes onto form the placenta and the inner cellular mass (ICM). The ICM further differentiates into the epiblast (EPI)---which gives rise to the endoderm, mesoderm and ectoderm---and the primitive endoderm (PE) which develops into the amniotic sack. Guo et al selected 48 genes for expression measurement. They labelled the resulting cells and their labels are included as an aide to visualization.

Below we show how to get the data into your notebook

Exercise 1: Adapt the code from the Iris-InteractivePCA notebook to carry out PCA and visualise the data

Questions (add your answers below each question):

1.1 How many principal components (PCs) are required to explain more than 80% of the variance in the data?

Answer: Using the cumulative plot below we see that eleven PCs explain 80.83% of the variance in the data.

1.2 Which PC do you think is most useful for separating the TE cell-types (32TE and 64TE) from the rest?

Answer: Projecting onto the 1st PC separates them very well from the rest - there is just some mixing with two 32ICM cells which look like outliers. If you project onto the 2nd or 3rd PC then there are significant overlaps with other cell-types.

1.3 Which combined pair of PCs 1,2 and 3 best separates the 2-cell stage from the rest (e.g. PCs 1&2, 1&3 or 2&3)?

Answer: PCs 1&3 gives the cleanest separation of these cells from the rest.

1.4 If you wanted to choose one gene to measure in order to separate the TE cell-types from the others, which would you choose? Explain why this is a good choice.

Answer: Esrrb contributes a lot to PC1 (in the negative direction) which separates these cell-types from the rest. Using the interactive plot (final plot below) we can see that low levels of this gene are indicative of being in the TE cell-type. Other good choices include Pecam1 (low in TE) and DppaI (high in TE) which are also strongly aligned with PC1.

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import visualisation # some functions defined for this lab
import plotly
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=False)
# use seaborn plotting style defaults
doInteractive = True
def interactivePlots(fig, axes):
    # helper function to decide to use plotly interactive plots or not
    if(doInteractive):
        plotly.offline.iplot_mpl(fig, show_link=False, strip_style=True) # offline ipython notebook 
In [2]:
GuoData = pd.read_csv('GuoData.csv', index_col=[0])
labels = GuoData.index # The labels give the cell-type for each cell 
N, D = GuoData.shape
print('Cells: %s, Genes: %s'%(N, D))
Cells: 437, Genes: 48

Exercise 1

Adapt the code from the Iris-InteractivePCA notebook to carry out PCA and visualise the data

Run PCA on the Guo data

In [3]:
Wt, scores, fracs = visualisation.do_pca(GuoData)
scores = scores/abs(scores).max().max() # Scale the scores by the maximum value in the score matrix
In [4]:
fig=plt.figure(figsize=(5, 5))
plt.plot(range(1,len(fracs)+1),fracs)
plt.xlabel('Principal component')
plt.ylabel('proportion of variance')
interactivePlots(fig, None)
In [5]:
fig=plt.figure(figsize=(5,5))
plt.plot(range(1,len(fracs)+1),np.cumsum(fracs))
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance')
interactivePlots(fig, None)

2D PCA plot

Change the values to answer Q1.2 and Q1.3 above.

In [6]:
XaxisPC = 1 # Plot PC1 on X axis
YaxisPC = 2 # Plot PC2 on Y axis
# Create a trace
traceCelltype = list()
labelList = np.unique(labels)
# assert set(np.unique(labels)) == set(labelList), 'Missing labels'
for lbl in labelList:
    traceCelltype.append(go.Scatter(
        x = scores[labels==lbl].iloc[:,XaxisPC-1], # array index is from zero so we subtract 1 from PC number
        y = scores[labels==lbl].iloc[:,YaxisPC-1],
        mode='markers',
        name = lbl,
        text = lbl
        ))
iplot(traceCelltype, filename='IrisPCA')

If you change YaxisPC to 3 then you can see better separation of the early embryo cells but at the bottom left there is still a mixture of cell types

In [7]:
XaxisPC = 1 # Plot PC1 on X axis
YaxisPC = 3 # Plot PC2 on Y axis
# Create a trace
traceCelltype = list()
labelList = np.unique(labels)
# assert set(np.unique(labels)) == set(labelList), 'Missing labels'
for lbl in labelList:
    traceCelltype.append(go.Scatter(
        x = scores[labels==lbl].iloc[:,XaxisPC-1], # array index is from zero so we subtract 1 from PC number
        y = scores[labels==lbl].iloc[:,YaxisPC-1],
        mode='markers',
        name = lbl,
        text = lbl
        ))
iplot(traceCelltype, filename='IrisPCA')

3D PCA plot

Rotating the 3D plot shows better separation of all the developmental stages but PCA seems to be unable to separate 64PE and 32ICM here. This explains why Guo et al. only used data from the later embryos in the PCA plots in their paper (see Exercise 2). It seems impossible to separate all the cell types perfectly with a linear method like PCA.

In [8]:
# 3D plot
XaxisPC = 1
YaxisPC = 2
ZaxisPC = 3
traceCelltype3d = list()
layout = go.Layout(showlegend=True, title='Scatter plot of 3-D PCA space', annotations=list())
for lbl in labelList:
    traceCelltype3d.append(go.Scatter3d(
        x = scores[labels==lbl].iloc[:,XaxisPC-1],
        y = scores[labels==lbl].iloc[:,YaxisPC-1],
        z = scores[labels==lbl].iloc[:,ZaxisPC-1],
        mode='markers',
        name = lbl,
        text = lbl
        ))
fig = go.Figure(data=traceCelltype3d, layout=layout)
iplot(fig, filename='IrisPCA3D')

Plot each feature's contribution to factor loadings in 2D PCA space

I change topN to 10 so I can see more genes

In [9]:
XaxisPC = 1 # Plot PC1 on X axis
YaxisPC = 2 # Plot PC2 on Y axis
# Create a trace
traceCelltype = list()
labelList = np.unique(labels)
# assert set(np.unique(labels)) == set(labelList), 'Missing labels'
for lbl in labelList:
    traceCelltype.append(go.Scatter(
        x = scores[labels==lbl].iloc[:,XaxisPC-1], # array index is from zero so we subtract 1 from PC number
        y = scores[labels==lbl].iloc[:,YaxisPC-1],
        mode='markers',
        name = lbl,
        text = lbl
        ))
t, l = visualisation.GetPCVectorsForEachFeature(Wt, fLine=True, topN=10, XPC='PC'+str(XaxisPC), YPC='PC'+str(YaxisPC))
fig = go.Figure(data=traceCelltype+t, layout=l)
iplot(fig, filename='IrisPCs')

 Interactive plot where user can select features and other options

You have to change references from irisdata to GuoData to get this to work and I've made the default feature Klf5

In [10]:
from __future__ import print_function
from ipywidgets import interact, interactive, fixed
import ipywidgets as widgets

def plotinteractPCA(feature='Klf5', Nfeatures=2, XaxisPC=1, YaxisPC=2):
    layout = go.Layout(showlegend=True, title='Interactive PCA plot',
                       annotations=list(), legend={'orientation': 'h'})
    traceFeature = list()
    
    if(feature == 'class'):
        for lbl in labelList:
            traceFeature.append(go.Scatter(
                x = scores[labels==lbl].iloc[:,XaxisPC-1],
                y = scores[labels==lbl].iloc[:,YaxisPC-1],
                mode='markers',
                name = lbl,
                text = lbl
                ))
    else:
        tmp = np.array([feature + " "]*len(GuoData))
        markertext = list()
        for i in range (1,len(GuoData)):
            markertext.append(tmp[i] + str(GuoData[feature].values[i]))
        traceFeature.append(go.Scatter(
            x = scores.iloc[:,XaxisPC-1],
            y = scores.iloc[:,YaxisPC-1],
            mode='markers',
            marker = {'color': GuoData[feature].values, 'size': 10, 'colorscale': 'Viridis',
                      'colorbar':{'title': 'level'}},
            text = markertext,
            name = 'Iris Data'
            ))
    if(Nfeatures > 0):
        t, l = visualisation.GetPCVectorsForEachFeature(Wt,topN=Nfeatures, 
                                                        fLine='True', XPC='PC'+str(XaxisPC), YPC='PC'+str(YaxisPC))
        layout.annotations += l.annotations
        traceFeature += t
    fig = go.Figure(data=traceFeature, layout=layout)
    iplot(fig, filename='GuoPCAInteractive')
_=interact(plotinteractPCA, feature=['class']+list(GuoData.columns.values), 
           Nfeatures=(0, GuoData.shape[1]), XaxisPC=(1, 4), YaxisPC=(2, 4))
In [0]: