CoCalc -- 1934.sagews

All published worksheets from http://sagenb.org

Project: sagenb.org published worksheets

Path: pub / 1901-2001 / 1934.sagews

Views: ¹⁶⁸⁷⁵⁴
Image: ubuntu2004

Flávio Codeço Coelho - (http://pyinsci.blogspot.com)

license: Creative Commons non-commercial share-alike

This tutorial is concerned with the exploration of the problem of sampling from high dimensional probability distributions. To run this worksheet you will need to install an additional package into Sage with this command: ./sage -python -m easy_install -U BIP

To illustrate the problem lets start with 2 dimensions, and a bivariate normal distribution.

%auto
import numpy as np
from numpy.random import normal, multivariate_normal
import scipy.stats as st
import time
import pylab as P
from matplotlib import mlab
from scipy.special import gamma
from BIP.Bayes import lhs, like

After some basic imports, let's look at the shape of the bivariate normal with means $\mu_1$ and $\mu_2$ equal to zero, standard deviations, $\sigma_1$ and $\sigma_2$ , equal to 1, and no covariance.

%auto
x,y = np.meshgrid(np.arange(-2.1,2.1,.1),np.arange(-2.1,2.1,.1))
z = mlab.bivariate_normal(x,y,1,1,0,0,0)
P.clf()
P.contour(x,y,z)
P.imshow(z,extent=[-2.1,2.1,-2.1,2.1])
P.savefig('binormal.png')

We built this bivariate normal contour plot by uniformly sampling along x, y. If we happened to be interested only in samples within 2 standard deviations from the mean, we can easily conclude, by looking at the figure above, that the all the samples coming from the dark blue regions of the picture, would represent wasted sampling effort. So the proportion of wasted effort here would be aproximately the difference between the area of the figure (square) and the area of largest circle we see in the figure divided by the area of the square:

(16-pi*2**2)/16

\newcommand{\Bold}[1]{\mathbf{#1}}-\frac{1}{4} \, \pi + 1

or, in numerical terms:

n((16-pi*2**2)/16)

\newcommand{\Bold}[1]{\mathbf{#1}}0.214601836602552

Approximately 21% of wasted sampling effort.

But what happens to that ratio as we increase the dimensionality of the target distribution?

the volume of the unit hypersphere is given by the following equation:

$V_n=\frac{\pi^{n/2}}{\Gamma(n/2+1)}*R^2$

where n is its dimensionality, $R$ is its radius and $\Gamma$ is the gamma function.

The volume of the circunscribing hypercube, which has edge of length 2, is given by $2^n$ . With these formulas, we can easily calculate the efficiency of sampling from an n-dimensional gaussian.

%auto
w = [N(pi**(d/2)/gamma(d/2+1)/2**d) for d in range(20)]

@interact
def waste(nd=slider(0,20,1,label="number of dimensions",default=2)):
    """
    Calculates the ratio of the volume of the unit hypersphere to its circumscribed hypercube
    """
    P.clf()
    P.plot(range(20),w);P.xlabel("Dimensionality");P.ylabel("Fraction of samples inside the hypersphere")
    P.scatter(nd,w[nd],color='r');P.grid()
    P.savefig('dim.png')

number of dimensions

[removed]

[removed]

[removed]

As we can see, sampling in multidimensional spaces quickly becomes prohibitively inneficient.

However these results assume uniform sampling along the axes. What if there was a way to concentrate our sampling effort to the higher probability areas? The good news is that there are ways.

The most basic way to do random sampling from a probability distribution works by some form of rejection sampling. This works basically by sampling uniformly from the support of the distribution (in the case of a normal distribution, from $-\inf$ to $\inf$ ), and retaining them according to their probability as given by the PDF. Even though the final sample will have less points in the tails, the underlying sampling algorithm still suffers from the curse of dimensionality and is going to be very inneficcient at high dimensionalities.

However more clever implementations of random number generators have succesfully avoided the curse of dimensionality. Let's see how Numpy multivariate normal fares in this category:

%python
def time_sampling(n):
    t0 = time.time()
    if n == 1:
        normal(size=10000)
    else:
        multivariate_normal(np.zeros(n),np.identity(n),size=10000)
    return time.time()-t0
P.clf()  
P.plot(range(1,30),[time_sampling(i) for i in range(1,30)],'o')
P.xlabel('dimensions')
P.ylabel('time to generate 10000 samples(s)')
P.savefig('mvsampleeff.png')

As we can see from the time cost to generates a 10000 valid samples increases linearly with dimensionality, and not exponentially as the curse of dimensionality might predict.

But loss of efficiency is not the only ill-effects of the curse, Whenever we sample from a PDF, we want our sample histogram to resemble the PDF we're sampling from. This can be a big problem especially for distributions with complex shapes (e.g. multimodal) in high dimensions.

To addres this particular issue we can resort to a clever sampling technique know as Latin Hypercube Sampling (LHS). This technique basically aims to guarantee that our samples are "well" spread out, guaranteeing better adherence of the sample distribution to the original PDF.

The BIP package has an implementation of LHS in Python which integrates well with Numpy's and scipy's RNGs. Like the multivariate_normal, from Numpy, it can also take into account the correlation structure between the variables of our joint probability distribution, and also allows for each variable have any distribution type supported by Numpy. One difference to be aware of, is that it takes the correlation matrix as an argument and not the covariance.

lets take a look at it:

cm = np.array([[1,.8],[.8,1]])
c=lhs.lhs([st.norm,st.beta], [(50,1),(10,2)],2000,False, cm)
r,p = st.spearmanr(c[0],c[1])
print "Correlation: %1.4f, p-value: %s"%(r,p)
P.clf()
P.scatter(c[0],c[1]);P.xlabel("N(50,1)");P.ylabel("Beta(10,2)")
P.savefig('lhs.png')

Correlation: 0.7909, p-value: 0.0

The figure above contains a scatter plot of a bivariate Normal-Beta distribution havin a correlation of 0.8 between its variables.

Now that we have seen that it works, Let's switch back to our discussion about the Curse, and how LHS can help us to avoid its ill effects. To check if LHS sampling can improve the quality of our sampling, let's go back to multivariate normals, for simplicity sake.

We need a little more code first:

def get_mvn_samples(n,siz=100):
    siz //=n #makes the total number of samples be siz, not siz*n
    ls = np.array(lhs.lhs([st.norm]*n,[(0,1)]*n,siz,False,np.identity(n))).T
    rs = multivariate_normal(np.zeros(n),np.identity(n),size=siz)
    ls.shape = (siz,n)
    return rs,ls

def normality_test(n,siz):
    rs,ls = get_mvn_samples(n,siz)
    rsl = np.sum([like.Normal(rs[:,i],0,1) for i in range(n)])
    lsl = np.sum([like.Normal(ls[:,i],0,1) for i in range(n)])
    return np.exp(rsl-lsl)

The first function above, generates samples from multivariate normal distributions of a given size and dimensianality. It returns two samples: one samples randomly, and the other sampled through LHS.

The second function above calculates the log-likelihood of the samplesunder a normal model. It returns the likelihood ratio ( $\frac{\Lambda{random}}{\Lambda_{LHS}}$ ), which for log-likelihoods, translates into the difference between them.

So let's look at some results:

%python
P.clf()
P.plot(range(1,60),[normality_test(i,4000) for i in xrange(1,60)],'o')
P.title('Sample size = 4000');P.yscale('log')
P.xlabel('dimensions')
P.ylabel('Likelihood ratio of samples N(0,1)')
P.savefig('lik_dims.png')

The plot above makes it clear that with in higher dimensionality, LHS sampling better resembles a multivariate Normal distribution while at lower dimensions it does not give us much advantage over random sampling even at mildly high dimensionalities. In the plot above the smaller the number the better the LHS compared to random sampling.

%python
P.clf()
P.plot(range(100,500),[normality_test(5,i) for i in xrange(100,500)],'o')
P.plot(range(100,500),ones(400),'r:')#LR=1 line
P.xlabel('sample size - 5D MVN');P.yscale("log")
P.ylabel('Likelihood ratio of samples N(0,1)')
P.savefig('lik_ssize.png')

The plot above indicates that sample size alone, when you fix dimensionality (5D, in this case), does not seem to affect how much improvement in the normality of the sample provided by the LHS method. LHS is still beter at just about all sample sizes tested here (LR < 1).