Contact
CoCalc Logo Icon
StoreFeaturesDocsShareSupport News AboutSign UpSign In
| Download
Project: Testing 18.04
Views: 349
Kernel: Python 3 (Ubuntu Linux)

Licensed under the Apache License, Version 2.0 (the "License");

#@title Licensed under the Apache License, Version 2.0 (the "License"); { display-mode: "form" } # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # https://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License.

Probabilistic PCA

You can run this in CoCalc! View source on GitHub

Probabilistic principal components analysis (PCA) is a dimensionality reduction technique that analyzes data via a lower dimensional latent space (Tipping and Bishop 1999). It is often used when there are missing values in the data or for multidimensional scaling.

Imports

import matplotlib.pyplot as plt import numpy as np import seaborn as sns import tensorflow as tf import tensorflow_probability as tfp from tensorflow_probability import edward2 as ed import warnings plt.style.use("ggplot") warnings.filterwarnings('ignore')
/usr/local/lib/python3.6/dist-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`. from ._conv import register_converters as _register_converters

The Model

Consider a data set X={xn}\mathbf{X} = \{\mathbf{x}_n\} of NN data points, where each data point is DD-dimensional, xnRD\mathbf{x}_n \in \mathbb{R}^D. We aim to represent each xn\mathbf{x}_n under a latent variable znRK\mathbf{z}_n \in \mathbb{R}^K with lower dimension, K<DK < D. The set of principal axes W\mathbf{W} relates the latent variables to the data.

Specifically, we assume that each latent variable is normally distributed,

znN(0,I).\begin{equation*} \mathbf{z}_n \sim N(\mathbf{0}, \mathbf{I}). \end{equation*}

The corresponding data point is generated via a projection,

xnznN(Wzn,σ2I),\begin{equation*} \mathbf{x}_n \mid \mathbf{z}_n \sim N(\mathbf{W}\mathbf{z}_n, \sigma^2\mathbf{I}), \end{equation*}

where the matrix WRD×K\mathbf{W}\in\mathbb{R}^{D\times K} are known as the principal axes. In probabilistic PCA, we are typically interested in estimating the principal axes W\mathbf{W} and the noise term σ2\sigma^2.

Probabilistic PCA generalizes classical PCA. Marginalizing out the the latent variable, the distribution of each data point is

xnN(0,WW+σ2I).\begin{equation*} \mathbf{x}_n \sim N(\mathbf{0}, \mathbf{W}\mathbf{W}^\top + \sigma^2\mathbf{I}). \end{equation*}

Classical PCA is the specific case of probabilistic PCA when the covariance of the noise becomes infinitesimally small, σ20\sigma^2 \to 0.

We set up our model below. In our analysis, we assume σ\sigma is known, and instead of point estimating W\mathbf{W} as a model parameter, we place a prior over it in order to infer a distribution over principal axes.

def probabilistic_pca(data_dim, latent_dim, num_datapoints, stddv_datapoints): # (unmodeled) data w = ed.Normal(loc=tf.zeros([data_dim, latent_dim]), scale=2.0 * tf.ones([data_dim, latent_dim]), name="w") # parameter z = ed.Normal(loc=tf.zeros([latent_dim, num_datapoints]), scale=tf.ones([latent_dim, num_datapoints]), name="z") # parameter x = ed.Normal(loc=tf.matmul(w, z), scale=stddv_datapoints * tf.ones([data_dim, num_datapoints]), name="x") # (modeled) data return x, (w, z) log_joint = ed.make_log_joint_fn(probabilistic_pca)

The Data

We can use the Edward2 model to generate data.

num_datapoints = 5000 data_dim = 2 latent_dim = 1 stddv_datapoints = 0.5 model = probabilistic_pca(data_dim=data_dim, latent_dim=latent_dim, num_datapoints=num_datapoints, stddv_datapoints=stddv_datapoints) with tf.Session() as sess: x_train, (actual_w, actual_z) = sess.run(model) print("Principal axes:") print(actual_w)
Principal axes: [[-0.19042172] [ 3.2923143 ]]

We visualize the dataset.

plt.scatter(x_train[0, :], x_train[1, :], color='blue', alpha=0.1) plt.axis([-20, 20, -20, 20]) plt.title("Data set") plt.show()
Image in a Jupyter notebook

Maximum a Posteriori Inference

We first search for the point estimate of latent variables that maximizes the posterior probability density. This is known as maximum a posteriori (MAP) inference, and is done by calculating the values of W\mathbf{W} and Z\mathbf{Z} that maximise the posterior density p(W,ZX)p(W,Z,X)p(\mathbf{W}, \mathbf{Z} \mid \mathbf{X}) \propto p(\mathbf{W}, \mathbf{Z}, \mathbf{X}).

tf.reset_default_graph() w = tf.Variable(np.ones([data_dim, latent_dim]), dtype=tf.float32) z = tf.Variable(np.ones([latent_dim, num_datapoints]), dtype=tf.float32) def target(w, z): """Unnormalized target density as a function of the parameters.""" return log_joint(data_dim=data_dim, latent_dim=latent_dim, num_datapoints=num_datapoints, stddv_datapoints=stddv_datapoints, w=w, z=z, x=x_train) energy = -target(w, z) optimizer = tf.train.AdamOptimizer(learning_rate=0.05) train = optimizer.minimize(energy) init = tf.global_variables_initializer() t = [] num_epochs = 200 with tf.Session() as sess: sess.run(init) for i in range(num_epochs): sess.run(train) if i % 5 == 0: cE, cw, cz = sess.run([energy, w, z]) t.append(cE) w_inferred_map = sess.run(w) z_inferred_map = sess.run(z)
x = range(1, num_epochs, 5) plt.plot(x, t)
[<matplotlib.lines.Line2D at 0x7f8f00f6b6a0>]
Image in a Jupyter notebook

We can use the Edward2 model to sample data for the inferred values for W\mathbf{W} and Z\mathbf{Z}, and compare to the actual dataset we conditioned on.

print("MAP-estimated axes:") print(w_inferred_map) def replace_latents(w=actual_w, z=actual_z): def interceptor(rv_constructor, *rv_args, **rv_kwargs): """Replaces the priors with actual values to generate samples from.""" name = rv_kwargs.pop("name") if name == "w": rv_kwargs["value"] = w elif name == "z": rv_kwargs["value"] = z return rv_constructor(*rv_args, **rv_kwargs) return interceptor with ed.interception(replace_latents(w_inferred_map, z_inferred_map)): generate = probabilistic_pca( data_dim=data_dim, latent_dim=latent_dim, num_datapoints=num_datapoints, stddv_datapoints=stddv_datapoints) with tf.Session() as sess: x_generated, _ = sess.run(generate) plt.scatter(x_train[0, :], x_train[1, :], color='blue', alpha=0.1, label='Actual data') plt.scatter(x_generated[0, :], x_generated[1, :], color='red', alpha=0.1, label='Simulated data (MAP)') plt.legend() plt.axis([-20, 20, -20, 20]) plt.show()
MAP-estimated axes: [[-0.21629956] [ 3.4028006 ]]
Image in a Jupyter notebook

Variational Inference

MAP can be used to find the mode (or one of the modes) of the posterior distribution, but does not provide any other insights about it. We next use variational inference, where the posterior distribtion p(W,ZX)p(\mathbf{W}, \mathbf{Z} \mid \mathbf{X}) is approximated using a variational distribution q(W,Z)q(\mathbf{W}, \mathbf{Z}) parametrised by λ\boldsymbol{\lambda}. The aim is to find the variational parameters λ\boldsymbol{\lambda} that minimize the KL divergence between q and the posterior, KL(q(W,Z)p(W,ZX))\mathrm{KL}(q(\mathbf{W}, \mathbf{Z}) \mid\mid p(\mathbf{W}, \mathbf{Z} \mid \mathbf{X})), or equivalently, that maximize the evidence lower bound, Eq(W,Z;λ)[logp(W,Z,X)logq(W,Z;λ)]\mathbb{E}_{q(\mathbf{W},\mathbf{Z};\boldsymbol{\lambda})}\left[ \log p(\mathbf{W},\mathbf{Z},\mathbf{X}) - \log q(\mathbf{W},\mathbf{Z}; \boldsymbol{\lambda}) \right].

tf.reset_default_graph() def variational_model(qw_mean, qw_stddv, qz_mean, qz_stddv): qw = ed.Normal(loc=qw_mean, scale=qw_stddv, name="qw") qz = ed.Normal(loc=qz_mean, scale=qz_stddv, name="qz") return qw, qz log_q = ed.make_log_joint_fn(variational_model) def target_q(qw, qz): return log_q(qw_mean=qw_mean, qw_stddv=qw_stddv, qz_mean=qz_mean, qz_stddv=qz_stddv, qw=qw, qz=qz) qw_mean = tf.Variable(np.ones([data_dim, latent_dim]), dtype=tf.float32) qz_mean = tf.Variable(np.ones([latent_dim, num_datapoints]), dtype=tf.float32) qw_stddv = tf.nn.softplus(tf.Variable(-4 * np.ones([data_dim, latent_dim]), dtype=tf.float32)) qz_stddv = tf.nn.softplus(tf.Variable(-4 * np.ones([latent_dim, num_datapoints]), dtype=tf.float32)) qw, qz = variational_model(qw_mean=qw_mean, qw_stddv=qw_stddv, qz_mean=qz_mean, qz_stddv=qz_stddv) energy = target(qw, qz) entropy = -target_q(qw, qz) elbo = energy + entropy optimizer = tf.train.AdamOptimizer(learning_rate = 0.05) train = optimizer.minimize(-elbo) init = tf.global_variables_initializer() t = [] num_epochs = 100 with tf.Session() as sess: sess.run(init) for i in range(num_epochs): sess.run(train) if i % 5 == 0: t.append(sess.run([elbo])) w_mean_inferred = sess.run(qw_mean) w_stddv_inferred = sess.run(qw_stddv) z_mean_inferred = sess.run(qz_mean) z_stddv_inferred = sess.run(qz_stddv)
print("Inferred axes:") print(w_mean_inferred) print("Standard Deviation:") print(w_stddv_inferred) plt.plot(range(1, num_epochs, 5), t) plt.show() with ed.interception(replace_latents(w_mean_inferred, z_mean_inferred)): generate = probabilistic_pca( data_dim=data_dim, latent_dim=latent_dim, num_datapoints=num_datapoints, stddv_datapoints=stddv_datapoints) with tf.Session() as sess: x_generated, _ = sess.run(generate) plt.scatter(x_train[0, :], x_train[1, :], color='blue', alpha=0.1, label='Actual data') plt.scatter(x_generated[0, :], x_generated[1, :], color='red', alpha=0.1, label='Simulated data (VI)') plt.legend() plt.axis([-20, 20, -20, 20]) plt.show()
Inferred axes: [[-0.1579684] [ 2.5440261]] Standard Deviation: [[0.00821651] [0.00942224]]
Image in a Jupyter notebookImage in a Jupyter notebook

Acknowledgements

This tutorial was originally written in Edward 1.0 (source). We thank all contributors to writing and revising that version.

References

[1]: Michael E. Tipping and Christopher M. Bishop. Probabilistic principal component analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 61(3): 611-622, 1999.