Probabilistic Programming and Bayesian Methods for Hackers Chapter 3
Original content (this Jupyter notebook) created by Cam Davidson-Pilon (@Cmrn_DP
)
Ported to Tensorflow Probability by Matthew McAteer (@MatthewMcAteer0
), with help from Bryan Seybold, Mike Shwe (@mikeshwe
), Josh Dillon, and the rest of the TFP team at Google ([email protected]
).
Welcome to Bayesian Methods for Hackers. The full Github repository is available at github/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers. The other chapters can be found on the project's homepage. We hope you enjoy the book, and we encourage any contributions!
Table of Contents
Dependencies & Prerequisites
Opening the black box of MCMC
The Bayesian landscape
Exploring the landscape using the MCMC
Why Thousands of Samples?
Algorithms to perform MCMC
Other aproximation solutions to the posterior
Example: Unsupervised Clustering Using a Mixture Model
Cluster Investigation
Important: Don't mix posterior samples
Returning to Clustering: Prediction
Diagnosing Convergence
Autocorrelation
How does this relate to MCMC convergence?
Thinning
Additional Plotting Options
Useful tips for MCMC
Intelligent starting values
Priors
Covariance matrices and eliminating parameters
The Folk Theorem of Statistical Computing
Conclusion
References
Dependencies & Prerequisites
If you're running this notebook in Jupyter on your own machine (and you have already installed Tensorflow), you can use the following
- For the most recent nightly installation:
pip3 install -q tfp-nightly
- For the most recent stable TFP release:
pip3 install -q --upgrade tensorflow-probability
- For the most recent stable GPU-connected version of TFP:
pip3 install -q --upgrade tensorflow-probability-gpu
- For the most recent nightly GPU-connected version of TFP:
pip3 install -q tfp-nightly-gpu
Opening the black box of MCMC
The previous two chapters hid the inner-mechanics of TFP, and more generally Markov Chain Monte Carlo (MCMC), from the reader. The reason for including this chapter is three-fold. The first is that any book on Bayesian inference must discuss MCMC. I cannot fight this. Blame the statisticians. Secondly, knowing the process of MCMC gives you insight into whether your algorithm has converged(Converged to what? We will get to that). Thirdly, we'll understand why we are returned thousands of samples from the posterior as a solution, which at first thought can be odd.
The Bayesian landscape
When we setup a Bayesian inference problem with unknowns, we are implicitly creating an dimensional space for the prior distributions to exist in. Associated with the space is an additional dimension, which we can describe as the surface, or curve, that sits on top of the space, that reflects the prior probability of a particular point. The surface on the space is defined by our prior distributions. For example, if we have two unknowns and , and priors for both are , the space created is a square of length 5 and the surface is a flat plane that sits on top of the square (representing that every point is equally likely).
Alternatively, if the two priors are and , then the space is all positive numbers on the 2-D plane, and the surface induced by the priors looks like a water fall that starts at the point (0,0)
and flows over the positive numbers.
The plots below visualize this. The more dark red the color, the more prior probability is assigned to that location. Conversely, areas with darker blue represent that our priors assign very low probability to that location.
These are simple examples in 2D space, where our brains can understand surfaces well. In practice, spaces and surfaces generated by our priors can be much higher dimensional.
If these surfaces describe our prior distributions on the unknowns, what happens to our space after we incorporate our observed data ? The data does not change the space, but it changes the surface of the space by pulling and stretching the fabric of the prior surface to reflect where the true parameters likely live. More data means more pulling and stretching, and our original shape becomes mangled or insignificant compared to the newly formed shape. Less data, and our original shape is more present. Regardless, the resulting surface describes the posterior distribution.
Again I must stress that it is, unfortunately, impossible to visualize this in large dimensions. For two dimensions, the data essentially pushes up the original surface to make tall mountains. The tendency of the observed data to push up the posterior probability in certain areas is checked by the prior probability distribution, so that less prior probability means more resistance. Thus in the double-exponential prior case above, a mountain (or multiple mountains) that might erupt near the (0,0)
corner would be much higher than mountains that erupt closer to (5,5)
, since there is more resistance (low prior probability) near (5,5)
. The peak reflects the posterior probability of where the true parameters are likely to be found. Importantly, if the prior has assigned a probability of 0
, then no posterior probability will be assigned there.
Suppose the priors mentioned above represent different parameters of two Poisson distributions. We observe a few data points and visualize the new landscape:
The plot on the left is the deformed landscape with the priors, and the plot on the right is the deformed landscape with the exponential priors. Notice that the posterior landscapes look different from one another, though the data observed is identical in both cases. The reason is as follows. Notice the exponential-prior landscape, top right figure, puts very little posterior weight on values in the upper right corner of the figure: this is because the prior does not put much weight there. On the other hand, the uniform-prior landscape is happy to put posterior weight in the upper-right corner, as the prior puts more weight there.
Notice also the highest-point, corresponding the the darkest red, is biased towards (0,0)
in the exponential case, which is the result from the exponential prior putting more prior weight in the (0,0)
corner.
The black dot represents the true parameters. Even with 1 sample point, the mountains attempts to contain the true parameter. Of course, inference with a sample size of 1
is incredibly naive, and choosing such a small sample size was only illustrative.
It's a great exercise to try changing the sample size to other values (try 2
, 5
, 10
, 100
?...) and observing how our "mountain" posterior changes.
Exploring the landscape using the MCMC
We should explore the deformed posterior space generated by our prior surface and observed data to find the posterior mountain. However, we cannot naively search the space: any computer scientist will tell you that traversing -dimensional space is exponentially difficult in : the size of the space quickly blows-up as we increase (see the curse of dimensionality). What hope do we have to find these hidden mountains? The idea behind MCMC is to perform an intelligent search of the space. To say "search" implies we are looking for a particular point, which is perhaps not an accurate as we are really looking for a broad mountain.
Recall that MCMC returns samples from the posterior distribution, not the distribution itself. Stretching our mountainous analogy to its limit, MCMC performs a task similar to repeatedly asking "How likely is this pebble I found to be from the mountain I am searching for?", and completes its task by returning thousands of accepted pebbles in hopes of reconstructing the original mountain.
When I say MCMC intelligently searches, I really am saying MCMC will hopefully converge towards the areas of high posterior probability. MCMC does this by exploring nearby positions and moving into areas with higher probability. Again, perhaps "converge" is not an accurate term to describe MCMC's progression. Converging usually implies moving towards a point in space, but MCMC moves towards a broader area in the space and randomly walks in that area, picking up samples from that area.
Why Thousands of Samples?
At first, returning thousands of samples to the user might sound like being an inefficient way to describe the posterior distributions. I would argue that this is extremely efficient. Consider the alternative possibilities:
Returning a mathematical formula for the "mountain ranges" would involve describing a N-dimensional surface with arbitrary peaks and valleys.
Returning the "peak" of the landscape, while mathematically possible and a sensible thing to do as the highest point corresponds to most probable estimate of the unknowns, ignores the shape of the landscape, which we have previously argued is very important in determining posterior confidence in unknowns.
Besides computational reasons, likely the strongest reason for returning samples is that we can easily use The Law of Large Numbers to solve otherwise intractable problems. I postpone this discussion for the next chapter. With the thousands of samples, we can reconstruct the posterior surface by organizing them in a histogram.
Algorithms to perform MCMC
There is a large family of algorithms that perform MCMC. Most of these algorithms can be expressed at a high level as follows: (Mathematical details can be found in the appendix.)
Start at current position.
Propose moving to a new position (investigate a pebble near you).
Accept/Reject the new position based on the position's adherence to the data and prior distributions (ask if the pebble likely came from the mountain).
If you accept: Move to the new position. Return to Step 1.
Else: Do not move to new position. Return to Step 1.
After a large number of iterations, return all accepted positions.
This way we move in the general direction towards the regions where the posterior distributions exist, and collect samples sparingly on the journey. Once we reach the posterior distribution, we can easily collect samples as they likely all belong to the posterior distribution.
If the current position of the MCMC algorithm is in an area of extremely low probability, which is often the case when the algorithm begins (typically at a random location in the space), the algorithm will move in positions that are likely not from the posterior but better than everything else nearby. Thus the first moves of the algorithm are not reflective of the posterior.
In the above algorithm's pseudocode, notice that only the current position matters (new positions are investigated only near the current position). We can describe this property as memorylessness, i.e. the algorithm does not care how it arrived at its current position, only that it is there.
Other approximation solutions to the posterior
Besides MCMC, there are other procedures available for determining the posterior distributions. A Laplace approximation is an approximation of the posterior using simple functions. A more advanced method is Variational Bayes. All three methods, Laplace Approximations, Variational Bayes, and classical MCMC have their pros and cons. We will only focus on MCMC in this book.
Example: Unsupervised Clustering using a Mixture Model
Suppose we are given the following dataset:
What does the data suggest? It appears the data has a bimodal form, that is, it appears to have two peaks, one near 120
and the other near 200
. Perhaps there are two clusters within this dataset.
This dataset is a good example of the data-generation modeling technique from last chapter. We can propose how the data might have been created. I suggest the following data generation algorithm:
For each data point, choose cluster 1 with probability , else choose cluster 2.
Draw a random variate from a Normal distribution with parameters and where was chosen in step 1.
Repeat.
This algorithm would create a similar effect as the observed dataset, so we choose this as our model. Of course, we do not know or the parameters of the Normal distributions. Hence we must infer, or learn, these unknowns.
Denote the Normal distributions and (having variables' index start at 0
is just Pythonic). Both currently have unknown mean and standard deviation, denoted and respectively. A specific data point can be from either or , and we assume that the data point is assigned to with probability .
An appropriate way to assign data points to clusters is to use a TF Categorical
variable. Its parameter is a -length array of probabilities that must sum to one and its value
attribute is a integer between 0
and randomly chosen according to the crafted array of probabilities (In our case ). A priori, we do not know what the probability of assignment to cluster 1 is, so we form a uniform variable on . We call call this , so the probability of belonging to cluster 2 is therefore .
Fortunately, we can we just give [p1, p2]
to our Categorical
variable. If needed, we can also to use tf.stack()
to combine and into a vector that it can understand. We pass this vector into the Categorical
variableto give an idea of the odds of choosing from our multiple distributions.
Looking at the above dataset, I would guess that the standard deviations of the two Normals are different. To maintain ignorance of what the standard deviations might be, we will initially model them as uniform on 0
to 100
. We will include both standard deviations in our model using a single line of TFP code:
Here, we are using a batch shape of 2, creating two independent distributions, that happen to have the same parameters. See the colab on TFP shapes for more info.
We also need to specify priors on the centers of the clusters. The centers are really the parameters in these Normal distributions. Their priors can be modeled by a Normal distribution. Looking at the data, I have an idea where the two centers might be — I would guess somewhere around 120
and 190
respectively, though I am not very confident in these eyeballed estimates. Hence I will set and .
Finally, we use the MixtureSameFamily distribution to implement a mixture of our two Normal distributions, employing our Categorical distribution as our selecting function.
Similarly, in the joint log_prob
function below, we create two clusters each with our priors on the centers and standard deviations. Then, we mix them in proportion to their weights as determined by our Categorical
variable, creating a mixture of gaussians distribution. Finally, for each data point, we generate a sample from that mixture.
Note that this model is marginalizing out the cluster assignment variable, so that all the remaining random variables are continuous, making it all amenable to simple MCMC--HamiltonianMonteCarlo
in particular.
We will use our HMC sampling methods to explore the space by using 25000 sample iterations below.
Let's examine the traces for our unknown parameters. In other words, the routes the unknown parameters (centers, precisions, and p ) have taken thus far.
Notice the following characteristics:
The traces converge, not to a single point, but to a distribution of possible points. This is convergence in an MCMC algorithm.
Inference using the first few thousand points is a bad idea, as they are unrelated to the final distribution we are interested in. Thus is it a good idea to discard those samples before using the samples for inference. We call this period before converge the burn-in period.
The traces appear as a random "walk" around the space, that is, the paths exhibit correlation with previous positions. This is both good and bad. We will always have correlation between current positions and the previous positions, but too much of it means we are not exploring the space well. This will be detailed in the Diagnostics section later in this chapter.
To achieve further convergence, we will perform more MCMC steps. In the pseudo-code algorithm of MCMC above, the only position that matters is the current position (new positions are investigated near the current position). To continue where we left off, we pass the current values of the unknown parameters into the initial_chain_state()
variable. The values that we have already calculated will not be overwritten. This ensures that our sampling continues where it left off in the same way that it left off.
We will sample the MCMC fifty thousand more times and visualize the progress below:
Cluster Investigation
We have not forgotten our main challenge: identify the clusters. We have determined posterior distributions for our unknowns. We plot the posterior distributions of the center and standard deviation variables below:
The MCMC algorithm has proposed that the most likely centers of the two clusters are near 120 and 200 respectively. Similar inference can be applied to the standard deviation.
In the PyMC3 version of this chapter, we depict the posterior distributions for the labels each data point.However, in our TFP version, since our model marginalized out the assignment variable, we don't have traces for that variable from the MCMC.
As a substitute, below we can generate a posterior predictive distribution over the assignments, and then generate some samples from it.
Below is a visualization of this. The y-axis represents our samples from the posterior predictive distribution. The x-axis are the sorted values of the original data points. A red square is an assignment to cluster 0, and a blue square is an assignment to cluster 1.
Looking at the above plot, it appears that the most uncertainty is between 150 and 170. The above plot slightly misrepresents things, as the x-axis is not a true scale (it displays the value of the th sorted data point.) A more clear diagram is below, where we have estimated the frequency of each data point belonging to the labels 0 and 1.
Even though we modeled the clusters using Normal distributions, we didn't get just a single Normal distribution that best fits the data (whatever our definition of best is), but a distribution of values for the Normal's parameters. How can we choose just a single pair of values for the mean and variance and determine a sorta-best-fit gaussian?
One quick and dirty way (which has nice theoretical properties we will see in Chapter 5), is to use the mean of the posterior distributions. Below we overlay the Normal density functions, using the mean of the posterior distributions as the chosen parameters, with our observed data:
Important: Don't mix posterior samples
In the above example, a possible (though less likely) scenario is that cluster 0 has a very large standard deviation, and cluster 1 has a small standard deviation. This would still satisfy the evidence, albeit less so than our original inference. Alternatively, it would be incredibly unlikely for both distributions to have a small standard deviation, as the data does not support this hypothesis at all. Thus the two standard deviations are dependent on each other: if one is small, the other must be large. In fact, all the unknowns are related in a similar manner. For example, if a standard deviation is large, the mean has a wider possible space of realizations. Conversely, a small standard deviation restricts the mean to a small area.
During MCMC, we are returned vectors representing samples from the unknown posteriors. Elements of different vectors cannot be used together, as this would break the above logic: perhaps a sample has returned that cluster 1 has a small standard deviation, hence all the other variables in that sample would incorporate that and be adjusted accordingly. It is easy to avoid this problem though, just make sure you are indexing traces correctly.
Another small example to illustrate the point. Suppose two variables, and , are related by . We model as a Normal random variable with mean 4 and explore 500 samples.
As you can see, the two variables are not unrelated, and it would be wrong to add the th sample of to the th sample of , unless .
Returning to Clustering: Prediction
The above clustering can be generalized to clusters. Choosing allowed us to visualize the MCMC better, and examine some very interesting plots.
What about prediction? Suppose we observe a new data point, say , and we wish to label it to a cluster. It is foolish to simply assign it to the closer cluster center, as this ignores the standard deviation of the clusters, and we have seen from the plots above that this consideration is very important. More formally: we are interested in the probability (as we cannot be certain about labels) of assigning to cluster 1. Denote the assignment of as , which is equal to 0 or 1, and we are interested in .
A naive method to compute this is to re-run the above MCMC with the additional data point appended. The disadvantage with this method is that it will be slow to infer for each novel data point. Alternatively, we can try a less precise, but much quicker method.
We will use Bayes Theorem for this. If you recall, Bayes Theorem looks like:
In our case, represents and is the evidence we have: we observe that . For a particular sample set of parameters for our posterior distribution, , we are interested in asking "Is the probability that is in cluster 1 greater than the probability it is in cluster 0?", where the probability is dependent on the chosen parameters. As the denominators are equal, they can be ignored (and good riddance, because computing the quantity can be difficult).
Giving us a probability instead of a label is a very useful thing. Instead of the naive
we can optimize our guesses using a loss function, which the entire fifth chapter is devoted to.
Diagnosing Convergence
Autocorrelation
Autocorrelation is a measure of how related a series of numbers is with itself. A measurement of 1.0 is perfect positive autocorrelation, 0 no autocorrelation, and -1 is perfect negative correlation. If you are familiar with standard correlation, then autocorrelation is just how correlated a series, , at time is with the series at time :
For example, consider the two series:
which have example paths like:
One way to think of autocorrelation is "If I know the position of the series at time , can it help me know where I am at time ?" In the series , the answer is No. By construction, are random variables. If I told you that , could you give me a better guess about ? No.
On the other hand, is autocorrelated. By construction, if I knew that , I can be very confident that will not be very far from 10. Similarly, I can even make a (less confident guess) about : it will probably not be near 0 or 20, but a value of 5 is not too unlikely. I can make a similar argument about , but again, I am less confident. Taking this to it's logical conclusion, we must concede that as , the lag between time points, increases the autocorrelation decreases. We can visualize this:
Notice that as increases, the autocorrelation of decreases from a very high point. Compare with the autocorrelation of which looks like noise (which it really is), hence we can conclude no autocorrelation exists in this series.
How does this relate to MCMC convergence?
By the nature of the MCMC algorithm, we will always be returned samples that exhibit autocorrelation (this is because of the step from your current position, move to a position near you
).
A chain that is not exploring the space well will exhibit very high autocorrelation. Visually, if the trace seems to meander like a river, and not settle down, the chain will have high autocorrelation.
This does not imply that a converged MCMC has low autocorrelation. Hence low autocorrelation is not necessary for convergence, but it is sufficient. TFP has a built-in autocorrelation tools as well.
Thinning
Another issue can arise if there is high-autocorrelation between posterior samples. Many post-processing algorithms require samples to be independent of each other. This can be solved, or at least reduced, by only returning to the user every th sample, thus removing some autocorrelation. Below we perform an autocorrelation plot for with differing levels of thinning:
With more thinning, the autocorrelation drops quicker. There is a tradeoff though: higher thinning requires more MCMC iterations to achieve the same number of returned samples. For example, 10 000 samples unthinned is 100 000 with a thinning of 10 (though the latter has less autocorrelation).
What is a good amount of thinning? The returned samples will always exhibit some autocorrelation, regardless of how much thinning is done. So long as the autocorrelation tends to zero, you are probably ok. Typically thinning of more than 10 is not necessary.
Useful tips for MCMC
Bayesian inference would be the de facto method if it weren't for MCMC's computational difficulties. In fact, MCMC is what turns most users off practical Bayesian inference. Below I present some good heuristics to help convergence and speed up the MCMC engine:
Intelligent starting values
It would be great to start the MCMC algorithm off near the posterior distribution, so that it will take little time to start sampling correctly. We can aid the algorithm by telling where we think the posterior distribution will be by specifying the testval
parameter in the Stochastic
variable creation. In many cases we can produce a reasonable guess for the parameter. For example, if we have data from a Normal distribution, and we wish to estimate the parameter, then a good starting value would be the mean of the data.
For most parameters in models, there is a frequentist estimate of it. These estimates are a good starting value for our MCMC algorithms. Of course, this is not always possible for some variables, but including as many appropriate initial values is always a good idea. Even if your guesses are wrong, the MCMC will still converge to the proper distribution, so there is little to lose.
Priors
If the priors are poorly chosen, the MCMC algorithm may not converge, or at least have difficulty converging. Consider what may happen if the prior chosen does not even contain the true parameter: the prior assigns 0 probability to the unknown, hence the posterior will assign 0 probability as well. This can cause pathological results.
For this reason, it is best to carefully choose the priors. Often, lack of covergence or evidence of samples crowding to boundaries implies something is wrong with the chosen priors (see Folk Theorem of Statistical Computing below).
The Folk Theorem of Statistical Computing
If you are having computational problems, probably your model is wrong.
Conclusion
TFP provides a very strong backend to performing Bayesian inference, mostly because it allows the user to fine-tune the inner workings of MCMC.
References
[1] Tensorflow Probability API docs. https://www.tensorflow.org/probability/api_docs/python/tfp