Contact
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutSign UpSign In
| Download

📚 The CoCalc Library - books, templates and other resources

Views: 96105
License: OTHER
Kernel: Python [conda root]

Section 4: Bayesian Regression

Previously, we addressed the question: "is my chat response time effected by who I'm talking to?". We have estimated model parameters for each individual I've had conversations with. But sometimes we want to understand the effect of more factors such as "day of week," "time of day," etc. We can use GLM (generalized linear models) to better understand the effects of these factors.

import matplotlib.pyplot as plt import numpy as np import pandas as pd import pymc3 as pm import scipy import scipy.stats as stats import seaborn.apionly as sns import statsmodels.api as sm import theano.tensor as tt from sklearn import preprocessing %matplotlib inline plt.style.use('bmh') colors = ['#348ABD', '#A60628', '#7A68A6', '#467821', '#D55E00', '#CC79A7', '#56B4E9', '#009E73', '#F0E442', '#0072B2'] messages = pd.read_csv('data/hangout_chat_data.csv')

Linear regression reminder

When we have a response yy that is continuous from −∞-\infty to ∞\infty, we can consider using a linear regression represented by:

y∼N(μ,σ)y \sim \mathcal{N}(\mu, \sigma)μ=β0+β1X1...βnXn\mu = \beta_0 + \beta_1 X_1 ... \beta_n X_n

We read this as: our response is normally distributed around μ\mu with a standard deviation of σ\sigma. The value of μ\mu is described by a linear function of explanatory variables XβX \beta with a baseline intercept β0\beta_0.

In the event you're not modeling a continuous response variable from −∞-\infty to ∞\infty, you may need to use a link function to transform your response range. For a Poisson distribution, the canonical link function used is the log link. This can be formally described as:

y∼Poi(μ)y \sim Poi(\mu)log(μ)=β0+β1X1...βnXnlog(\mu) = \beta_0 + \beta_1 X_1 ... \beta_n X_nμ=e(β0+β1X1...βnXn)\mu = e^{(\beta_0 + \beta_1 X_1 ... \beta_n X_n)}

This is considered to be a fixed effects model. The β\beta coefficients are estimated across the entire population as opposed to estimating separate parameters for each person (like in the pooled and partially pooled model in Section 3).

Fixed effects Poisson regression

To construct a Poisson regression in PyMC3, we need to apply the log link function to μ\mu. The underlying data model in PyMC3 uses theano and hence we need to use the theano tensor method theano.tensor.exp() as shown below.

X = messages[['is_weekend','day_of_week','message_length','num_participants']].values _, num_X = X.shape with pm.Model() as model: intercept = pm.Normal('intercept', mu=0, sd=100) beta_message_length = pm.Normal('beta_message_length', mu=0, sd=100) beta_is_weekend = pm.Normal('beta_is_weekend', mu=0, sd=100) beta_num_participants = pm.Normal('beta_num_participants', mu=0, sd=100) mu = tt.exp(intercept + beta_message_length*messages.message_length + beta_is_weekend*messages.is_weekend + beta_num_participants*messages.num_participants) y_est = pm.Poisson('y_est', mu=mu, observed=messages['time_delay_seconds'].values) start = pm.find_MAP() step = pm.Metropolis() trace = pm.sample(200000, step, start=start, progressbar=True)
100%|██████████| 200000/200000 [01:20<00:00, 2491.30it/s]
_ = pm.traceplot(trace)
Image in a Jupyter notebook

As you can see from the above results, the baseline intercept β0\beta_0 has an estimated value of between 2.5 and 2.9. So what does this mean?

Unfortunately, interpreting the parameters of a Poisson regression is more involved than a simple linear regression (y = β\beta x). In this linear regression, we could say for every unit increase in x, y^\hat{y} increases by β\beta. However, in the Poisson regression we need to consider the link function. The following cross validated post explains in great detail how we arrive at the below formulation.

For a Poisson model, given a unit change in xx, the fitted y^\hat y changes by y^(eβ−1)\hat y \left( e^\beta - 1 \right)

The main takeaway from this is that the effect of changing x depends on the current value of y. Unlike the simple linear regression, a unit change in x does not cause a consistent change in y.

Marginal and pairwise density plots

The below plot shows the marginal densities (across the diagonal) and the pairwise densities (lower and upper panes). This plot is very useful for understanding how covariates interact with one another. In the above example, we can see that as the number of participants increases, the baseline intercept decreases.

_ = sns.pairplot(pm.trace_to_dataframe(trace[20000:]), plot_kws={'alpha':.5})
Image in a Jupyter notebook

Mixed effects poisson regression

We can make a small extension to the above model by including a random intercept parameter. This will allow us to estimate a baseline parameter value β0\beta_0 for each person I communicate with. For all other parameters I will estimate a parameter across the entire population. For each person i and each message j, this is formally represented as:

yji∼Poi(μ)y_{ji} \sim Poi(\mu)μ=β0i+β1x1...βnxn\mu = \beta_{0_i} + \beta_1 x_1 ... \beta_n x_n

By introducing this random effects parameter β0\beta_0 for each person i, it allows the model to establish a different baseline for each person responded to - whilst estimating the effects of the covariates on the response for the entire population.

# Convert categorical variables to integer le = preprocessing.LabelEncoder() participants_idx = le.fit_transform(messages['prev_sender']) participants = le.classes_ n_participants = len(participants) with pm.Model() as model: intercept = pm.Normal('intercept', mu=0, sd=100, shape=n_participants) slope_message_length = pm.Normal('slope_message_length', mu=0, sd=100) slope_is_weekend = pm.Normal('slope_is_weekend', mu=0, sd=100) slope_num_participants = pm.Normal('slope_num_participants', mu=0, sd=100) mu = tt.exp(intercept[participants_idx] + slope_message_length*messages.message_length + slope_is_weekend*messages.is_weekend + slope_num_participants*messages.num_participants) y_est = pm.Poisson('y_est', mu=mu, observed=messages['time_delay_seconds'].values) start = pm.find_MAP() step = pm.Metropolis() trace = pm.sample(200000, step, start=start, progressbar=True)
100%|██████████| 200000/200000 [01:25<00:00, 2333.30it/s]
_ = pm.traceplot(trace[20000:])
Image in a Jupyter notebook

The interpretation of the above results are interesting:

  • Each person has a different baseline response rate (as shown in the pooled and partially pooled model in Section 3)

  • Longer messages take marginally longer to respond to

  • You are more likely to get a slow response if you message me on the weekend

  • I tend to reply more quickly to conversations that have multiple people added to it (group chat)

And after accounting for the effect of each covariate on the response, the model estimates the below β0\beta_0 parameters.

_ = plt.figure(figsize=(5, 6)) _ = pm.forestplot(trace[20000:], varnames=['intercept'], ylabels=participants)
Image in a Jupyter notebook
# Apply pretty styles from IPython.core.display import HTML def css_styling(): styles = open("styles/custom.css", "r").read() return HTML(styles) css_styling()