This is the first of a two-part series of case studies in which I will use a simple data analysis example to illustrate different methods for Bayesian inference.
Until recently, I only had a very vague idea of what Bayesian inference is, and I knew even less about Markov Chain Monte Carlo methods and other related concepts. I realized that starting with a simple example and a fair amount of visualizations really helped me grasp the key ideas behind these Bayesian techniques. With this case study I hope to provide precisely that. My goal is to give a very basic notion of the Bayesian inference process, explain how some popular computational methods for Bayesian inference work and, hopefully, shed some light on why sophisticated sampling methods are needed in actual, complex scenarios. Needless to say, I am not an expert on this matter and this really is more a part of my own learning process than an authoritative explanation of the topic.
In this first demo, I will describe the problem at hand and show a very naïve way to perform Bayesian inference that works in very simple cases. In the second part of the series, I will describe how more sophisticated methods work and apply them to the same example. Let's begin!
The key aspect of Bayesian inference is that it defines a full probability distribution over the quantities of interest i.e. the parameters (in parametric models) or the predictions of the target variable. This is achieved by specifying a sampling distribution for the data conditioned on the parameters and a prior distribution over our parameters, which gives us, via Bayes' theorem, a posterior distribution:
$$ p(\theta | y) = \dfrac{p(y | \theta) p(\theta)}{\int p(y | \theta) p(\theta) d\theta } $$where $\theta$ represents the parameters and $y$ is our data.
The integral in the denominator ($p(y)$, also known as evidence) is a normalizing constant, that does not depend on $\theta$. Taking this into account, the posterior is usually expressed as proportional to the product of likelihood $p(y | \theta)$ and prior $p(\theta)$ up to a multiplicative factor:
$$ p(\theta | y) \propto p(y | \theta) p(\theta) $$Much of the inference we are interested in can be expressed as expectations of functions over the posterior distribution (think of the mean or variance, but it really could be anything). Recall that expectations are simply integrals over the target probability distribution:
$$ \mathbb{E}[f(\theta)] = \int f(\theta)p(\theta)d\theta $$What makes Bayesian analysis hard is precisely computing these integrals. In most cases, there is no analytical solution to them and we need to approximate them using numerical methods, some of which we will explore here.
To illustrate Bayesian data analysis in action I will use a very simple example, based on data from a study that measured whether subjects of different ages developed Coronary Heart Disease (Hosmer, 2013).
Taking a look at our data we see that for each record we have the age, age group and a binary indicator for Coronary Heart Disease (CHD):
id | age | agegrp | chd | |
---|---|---|---|---|
1 | 1 | 20 | 20-39 | No |
2 | 2 | 23 | 20-39 | No |
3 | 3 | 24 | 20-39 | No |
4 | 4 | 25 | 20-39 | No |
5 | 5 | 25 | 20-39 | Yes |
We can represent the data on a scatter plot, with age on the horizontal axis and the CHD indicator on the vertical axis (0=No CHD, 1=CHD). As we can see, and as expected, it looks like the amount of subjects with CHD grows with age.
x = data['age'].values
y = (data['chd'] == 'Yes').values.astype(np.int64)
Now, given data like these, we might be interested in understanding the effect of age on the likelihood of Coronary Heart Disease. As you may have already guessed, this example fits well with a logistic regression model, which allows us to "map" a continuous independent variable to a binary one.
Let us describe our logistic regression data distribution model as follows:
$$ y_i | \alpha, \beta, x_i \sim \text{Bernouilli}(\sigma(\alpha + \beta x_i)) $$where $\sigma(\cdot)$ is the sigmoid or inverse logit function: $\sigma(z) = \text{logit}^{-1}(z) = \frac{1}{1 + e^{-z}}$
Another way to represent this model is considering the (auxiliary) Bernouilli parameter $\theta_i$:
$$ y_i | \theta_i \sim \text{Bernouilli}(\theta_i) $$$$ \theta_i = \sigma(\alpha + \beta x_i) = \text{logit}^{-1}(\alpha + \beta x_i) $$In simple terms, the binary target variable $y$, which represents the presence of CHD, is Bernouilli-distributed with a parameter that depends on a single covariate $x$ i.e. the subject's age.
Note that:
To fully specify our Bayesian model, we need to define a prior for the model's parameters $\alpha, \beta = p(\alpha, \beta)$. There is a lot to be discussed about the how and why of choosing a particular prior but we will not get into it at the moment.
Following a similar example (Gelman, 2013), let us define a non-informative, uniform prior over $\alpha$ and $\beta$, as follows:
$$ p(\alpha, \beta) \propto 1 $$Note that using a uniform prior like the one above is not always possible! It might lead to "improper" posteriors or non-sensical results. It is typical to use weakly-informative priors instead, that include a minimum of prior information
Remember that we want to make inferences about the model's parameters $\alpha$ and $\beta$. Given the above likelihood and prior, we can express the posterior distribution, up to a multiplicative factor, as:
$$ p(\alpha, \beta | y) \propto p(\alpha, \beta) p(y | \alpha, \beta) \propto \prod_{i=1}^{n} p(y_i | \alpha, \beta, x_i) = \prod_{i=1}^{n} \sigma(\alpha + \beta x_i)^{y_i} (1 - \sigma(\alpha + \beta x_i))^{1 - y_i} $$where the likelihood is simply factored into the product of each data point's likelihood, assuming the data are i.i.d.
For computational and other purposes it is typical to work with the logarithm of the posterior:
$$ \text{log} p(\alpha, \beta | y) \propto \sum_{i=1}^{n} y_i \text{log}(\sigma(\alpha + \beta x_i)) + (1 - y_i) \text{log}(1 - \sigma(\alpha + \beta x_i)) $$We can define likelihood, prior and posterior as Python functions, shown in the code block below:
def lik(y, x, alpha, beta):
theta = expit(alpha[None, :, None] + beta[:, None, None] * x[None, None, :])
return np.prod(theta ** y[None, None, :] * (1 - theta) ** (1 - y[None, None, :]), axis=-1)
def log_lik(y, x, alpha, beta):
theta = expit(alpha[None, :, None] + beta[:, None, None] * x[None, None, :])
return np.sum(np.log(theta) * y[None, None, :] + np.log(1 - theta) * (1 - y[None, None, :]), axis=-1)
def prior(alpha, beta):
return 1
def log_prior(alpha, beta):
return 0
def posterior(y, x, alpha, beta, prior, lik):
return prior(alpha, beta) * lik(y, x, alpha, beta)
def log_posterior(y, x, alpha, beta, log_prior, log_lik):
return log_prior(alpha, beta) + log_lik(y, x, alpha, beta)
As simple as this model might be, the posterior distribution above does not have a closed-form solution from which we can readily sample or evaluate its properties (as it would be the case with well-known distributions such as the Gaussian).
Let's say we want to give a 90% posterior probability interval for each of our parameters or find out their posterior mean. How do we use our (unnormalized) posterior density in order to make inferences about our model's parameters?
One naïve approach is to evaluate the posterior for several possible parameter values, normally referred to as a grid of parameter values, and use that discrete set of evaluations as an approximation to the actual posterior. Using grid evaluation, we could empirically calculate the parameters' mean, standard deviation and other quantities of interest. More formally, this is equivalent to approximating expectations (integrals) using a quadrature rule:
$$ \mathbb{E}[f(\theta)] = \int f(\theta)p(\theta)d\theta \approx \sum_{s=1}^S w_s f(\theta_s) p(\theta_s) $$where $w_s$ are weights that represent the area or volume of parameter space around the $s-th$ grid point.
In general, there are some challenges with this approach. First, it does not scale to higher dimensions, that is, to models with more parameters. In high-dimensional settings the number of parameter-space points required to cover a sufficient part of the actual posterior increases drastically. This is yet another incarnation of the curse of dimensionality. In this case we only have two parameters to estimate, which makes grid evaluation feasible.
That said, there is a second challenge, which is defining the grid of points itself. Ideally, we would like to have enough posterior evaluations to capture a substantial part of the actual posterior distribution. What should be the range of the parameter grid? And what should be the grid's granularity?
To answer these last two questions, we can use a coarse estimate of the parameters values, based on their maximum likelihood estimate, and define the grid around that point.
The code-block below uses SciPy's minimize
method to find the minimum of the negative log-likelihood for our example, which turns out to be:
def log_lik_objective(arr, x, y):
alpha, beta = arr
theta = expit(alpha + beta * x)
return np.sum(np.log(theta) * y + np.log(1 - theta) * (1 - y))
res = optimize.minimize(lambda arr: (-1) * log_lik_objective(arr, x=x, y=y),
x0=np.array([0, 0]),
method='L-BFGS-B')
alpha_mle, beta_mle = res.x
After some tweaking and experimenting, we define our grid centered at the MLE estimate of $(-5.3, 0.11)$ over $\alpha \in [-10, -1]$ and $\beta \in [0.03, 0.21]$. We create a 200-by-200 grid over these ranges of values. We can then evaluate and visualize the posterior over this 2D grid, as well as the marginal distributions for both $\alpha$ and $\beta$, which are simply calculated by summing the posterior values over the appropriate axis.
As you can see below, the distribution is more or less centered around the MLE. From this visualization we can also gather how uncertain we are about our parameters, and the degree of correlation between them, which in this case is quite high.
Like we would expect, age has a positive relationship with the probability of CHD, encoded in the parameter $\beta$, with a very high probablity (pretty much all of the posterior mass), although there is some uncertainty around the actual strength of that connection. The intercept term $\alpha$ is a bit harder to interpret, since the "probability of CHD at age 0" does not really make a lot of sense for this use-case.
alphas = np.linspace(-10, -1, 200)
betas = np.linspace(0.03, 0.21, 200)
alphas_grid, betas_grid = np.meshgrid(alphas, betas)
posterior_grid = posterior(y, x, alphas, betas, prior=prior, lik=lik)
marginal_alpha = posterior_grid.sum(axis=0)
marginal_alpha = marginal_alpha / marginal_alpha.sum()
marginal_cdf_alpha = marginal_alpha.cumsum() / marginal_alpha.sum()
marginal_beta = posterior_grid.sum(axis=1)
marginal_beta = marginal_beta / marginal_beta.sum()
marginal_cdf_beta = marginal_beta.cumsum() / marginal_beta.sum()
We can produce approximate samples from the posterior by sampling each grid point with probability proportional to its density value and adding some small jitter, as shown in the scatter plot on the left side, below. For each sampled parameter pair $\alpha_s, \beta_s$ we can visualize the resulting logistic curve on top of our data, as you can see in the plot to the right. One particular posterior sample and corresponding curve has been highlighted in red for illustration purposes This way, we can get an idea, not only of the distribution of our model's parameters, but of the kind of predictions we would make with them i.e. of the possible logistic curves that could fit our data, with varying probabilities. In addition, we can find the mean curve and construct probability intervals for our logistic curve, which give us an idea of the uncertainty in predicting the probability of CHD for a given subject based on their age.
The procedure described above, by which we visualize the data distribution averaged over possible values of the parameters from our posterior is an approximation of the posterior predictive distribution, more formally:
$$ p(\tilde{y} | y) = \int p(\tilde{y} | \theta)p(\theta | y) d\theta $$Bayesian analysis allows us to account for the uncertainty in our knowledge of the parameters when making predictions about unseen instances, which is, in my view, a very powerful concept.
With the ability to sample from our posterior, we can answer all sorts of questions, beyond the simple parameter inferences and predictive visualizations above. For example, we might be interested in understanding what the probability of CHD for a subject of age 60 is, as shown in the histogram below. According to our model and data the probability of developing CHD for a 60-year old subject (from the population from which our data was gathered) seems to lie between 66% and 88% with 90% probability.
We might also want to understand what the age at which the probability of CHD reaches around 50% is, which, given our model, is simply the ratio $-\frac{\alpha}{\beta}$. We can easily use our posterior samples to construct a histogram for this particular quantity, as shown below. In this case, our posterior tells us that the median age at which the probability of CHD equals 50% is roughly 48 years-old.
It would be fair to question whether these inferences are reasonable. Given the above, our model might lead us to believe that Coronary Heart Disease is way more likely than we might think. We would need to understand how our data were sampled and what population are our inferences valid for.
That said, this example simply aims to illustrate how Bayesian inference works, at its most basic level. Without going into whether these inferences regarding CHD are reasonable or not, I hope that these examples show how Bayesian data analysis provides us with powerful tools to represent the uncertainty around our estimates.
To summarize, we have walked through a fairly simple data analysis example to show how Bayesian inference works. The key idea behind Bayesian data analysis is that we define probability distributions over both data $y$ and parameters $\theta$, and we make inferences about our parameters (and any other quantity, including future predictions) through the posterior probability distribution $p(\theta | y)$.
In this simple case, we have used a straightforward evaluation of the posterior in a discrete set of possible parameter values. However, this naïve method based on a grid-evaluation of our posterior is less than ideal. We have already seen that some tweaking was required to construct the parameter grid and grid-evaluation is completely impractical the moment we scale up to slightly more complex models.
In the next part of this series, while keeping the same simple logistic regression example, we will look at other methods based on sampling, with a particular focus on some relevant MCMC methods, that will allow us to make posterior inferences in a much more efficient way.
You can check out the code used to generate this report on my GitHub.
Hosmer, D.W., Lemeshow, S. and Sturdivant, R.X. (2013) Applied Logistic Regression, 3rd ed., New York: Wiley
Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D., Vehtari, A. & Rubin, D. B. (2013). Bayesian data analysis (3rd ed). Chapman & Hall/CRC