Section 3: Hierarchal modeling
A key strength of Bayesian modeling is the easy and flexibility with which one can implement a hierarchical model. This section will implement and compare a pooled & partially pooled model.
Model Pooling
Let's explore a different way of modeling the response time for my hangout conversations. My intuition would suggest that my tendency to reply quickly to a chat depends on who I'm talking to. I might be more likely to respond quickly to my girlfriend than to a distant friend. As such, I could decide to model each conversation independently, estimating parameters and for each conversation .
One consideration we must make, is that some conversations have very few messages compared to others. As such, our estimates of response time for conversations with few messages will have a higher degree of uncertainty than conversations with a large number of messages. The below plot illustrates the discrepancy in sample size per conversation.
For each message j and each conversation i, we represent the model as:
The above plots show the observed data (left) and the posterior predictive distribution (right) for 3 example conversations we modeled. As you can see, the posterior predictive distribution can vary considerably across conversations. This could accurately reflect the characteristics of the conversation or it could be inaccurate due to small sample size.
If we combine the posterior predictive distributions across these models, we would expect this to resemble the distribution of the overall dataset observed. Let's perform the posterior predictive check.
Yes, the posterior predictive distribution resembles the distribution of the observed data. However, I'm concerned that some of the conversations have very little data and hence the estimates are likely to have high variance. One way to mitigate this risk to to share information across conversations - but still estimate for each conversation. We call this partial pooling.
Partial pooling
Just like in the pooled model, a partially pooled model has paramater values estimated for each conversation i. However, parameters are connected together via hyperparameters. This reflects our belief that my response_time
's per conversation have similarities with one another via my own natural tendancy to respond quickly or slowly.
Following on from the above example, we will estimate parameter values and for a Poisson distribution. Rather than using a uniform prior, I will use a Gamma distribution for both and . This will enable me to introduce more prior knowledge into the model as I have certain expectations as to what vales and will be.
First, let's have a look at the Gamma distribution. As you can see below, it is very flexible.
The partially pooled model can be formally described by:
You can see for the estimates of and that we have multiple plots - one for each conversation i. The difference between the pooled and the partially pooled model is that the parameters of the partially pooled model ( and ) have a hyperparameter that is shared across all conversations i. This brings two benefits:
Information is shared across conversations, so for conversations that have limited sample size, they "borrow" knowledge from other conversations during estimation to help reduce the variance of the estimate
We get an estimate for each conversation and an overall estimate for all conversations
Let's have a quick look at the posterior predictive distribution.
Shrinkage effect: pooled vs hierarchical model
As discussed, the partially pooled model shared a hyperparameter for both and . By sharing knowledge across conversations, it has the effect of shrinking the estimates closer together - particularly for conversations that have little data.
This shrinkage effect is illustrated in the below plot. You can see how the and parameters are drawn together by the effect of the hyperparameter.
Asking questions of the posterior
Let's start to take advantage of one of the best aspects of Bayesian statistics - the posterior distribution. Unlike frequentist techniques, we get a full posterior distribution as opposed to a single point estimate. In essence, we have a basket full of credible parameter values. This enables us to ask some questions in a fairly natural and intuitive manner.
What are the chances I'll respond to my friend in less than 10 seconds?
To estimate this probability, we can look at the posterior predctive distribution for Timothy & Andrew's response_time
and check how many of the samples are < 10 seconds. When I first heard of this technique, I thought I misunderstood because it seemed overly simplistic.
I find this methodology to be very intuitive and flexible. The plot above left separates the samples from the posterior predictive in terms of being greater than or less than 10 seconds. We can compute the probability by calculating the proportion of samples that are less than 10. The plot on the right simply computes this probability for each response time value from 0 to 60. So, it looks like Anna & Yonas have a 36% & 20% chance of being responded to in less than 10 seconds, respectively.
How do my friends pair off against each other?
>> Go to the Next Section
References
The Best Of Both Worlds: Hierarchical Linear Regression in PyMC3 by Thomas Wiecki