In this post we explain a Bayesian approach to infering the impact of interventions or actions. We show that representing causality within a standard Bayesian approach softens the boundary between tractable and impossible queries and opens up potential new approaches to causal inference. This post is a detailed but informal presentation of our Arxiv papers: Replacing the do calculus with Bayes rule, and Causal inference with Bayes rule - also see our video presentation Bayesian Causality
Causality – what it is and how to infer it – has been one of the most controversial subjects of machine learning and statistics. The recent publication of the Book of Why has re-ignited a long running debate on whether causal inference can be done within the standard Bayesian modelling paradigm or if it requires a fundamentally different approach. This debate began between Pearl and Rubin in the 90’s and continues today – particularly on Andrew Gelman's blog - see Gelman and Pearl. In this post we discuss some of our recent work that aims to bridge this debate.
But first, much disagreement has arisen from differences in terminology – so let’s clarify what we mean by causal inference.
(Observational) causal inference: estimating the impact of an action or intervention in some system from data collected without directly testing the intervention.
The figure below contrasts observational causal inference with standard statistics. In standard statistical problems, we have data generated by some system and we want to use that data to infer some property of the system. In observational causal inference, we want to use data generated by one system – the system prior to some intervention – to infer properties about another system – the system after intervention. This requires that we make assumptions about how these two systems are related (or equivalently, how the intervention changes the original system) and that we model these assumptions to determine what data sampled from System A can tell us about System B.
Fundamentally, causal inference requires two things
A model that relates the data we can observe (pre-intervention) to the question we care about (post-intervention)
A way to combine this model with (finite) data to perform inference
The first of these two points is at the core of the controversy. Whether causal inference requires something beyond statistics depends on what you think of as statistics.
If you define statistics narrowly as a set of operators (eg mean, variance, covariance, etc) that can be applied to a matrix of data, then notions of causality are indeed extra-statistical. Without some assumptions about how the system pre-intervention is related to the system post-intervention, data collected from the former is uninformative about the latter.
However, if you view statistics as the process of drawing inference about (latent) quantities of interest by combining a model with data, causal inference is no different to any other inference – your model must capture the relationship between the system pre and post intervention, and as with any inference, the validity of your conclusions will depend on the model being correctly specified. This viewpoint allows you to use standard Bayesian inference to answer causal questions, but provides little guidance on how to construct a model that captures how an intervention will change a system.
In the next section, we demonstrate how we algorithmically construct ordinary Bayesian models that encode identical assumptions about a specific intervention in a system as are implicit in a Pearlian causal graphical model. The goal of doing so is to construct a shared viewpoint between Bayesian modellers and causality people – and to provide a framework combining the strengths of both approaches: Causal graphs for modelling intervention, and Bayesian inference for combining model assumptions with data.
The heart of our proposal is Algorithm 1, which transforms a Pearl style intervention on a causal graph to a probabilistic graphical model.
Let's demonstrate how this works with a classic example that was previously elucidated using causal graphical models [9].
Suppose you are a doctor and offer two treatment alternatives, denoted by \(T = 0\) and \(T = 1\) for some condition. You have collected data on the treatment each patient selected, \(T\), a binary patient characteristic, \(Z\), and the success of the treatment, \(Y\).
You would like to know which treatment is better overall so you compute how many patients recover under each treatment.
$$P(Y = 1| T = 0) = 0.56 \qquad P(Y=1|T=1) = 0.54$$
It appears treatment \(T = 0\) is better overall.
However, wondering if treatment effectiveness depends on the patient characteristic, \(Z\), you decide to compute the recovery rates separately each group of patients – those with \(Z = 0\) and those with \(Z = 1\).
$$
\begin{align*}
P(Y=1|Z=0,T=0) &= 0.25 \\
P(Y=1|Z=0,T=1) &= 0.5 \\
P(Y=1|Z=1,T=0) &= 0.8 \\
P(Y=1|Z=1,T=1) &= 0.9 \\
\end{align*}
$$
Treatment \(T = 1\) seems to be better within both groups. How can this be if treatment \(T = 0\) is better overall? If you had to pick just one treatment for all patients, which should it be?
The key to resolving Simpson’s Paradox is to recognise that the question “which treatment should you recommend” is a causal one. You are asking about the probability a patient will recover if you intervene to select the treatment they take, not the probability they will recover conditional on them selecting a given treatment.
Which treatment you should recommend depends on how the characteristic \(Z\) is related to treatment selection currently, and whether or not this will change if you decide the treatment for each patient (rather than allowing them to choose).
Now let's see how we can resolve Simpson's paradox and answer the doctors question using either causal graphical models and the do-calculus, or by constructing a standard Bayesian model to represent the same assumptions.
Suppose the characteristic \(Z\) was gender and you believe it may be influencing the choice of treatment because one of the drugs is primarily marketed toward women. You also believe that gender may directly influence the likelihood of recovery (via some biochemical pathway).
The causal graphical model representing these assumptions is shown below.
Arrows are interpreted causally - i.e. the arrow from \(Z\) to \(T\) represents the assumption that gender influences treatment selection. This model encodes the assumption that intervening on \(T\) removes the influence of gender on treatment, since \(T\) is then set endogenously. Everything else about the data generating process is assumed to be invariant under intervention. In particular, the distributions \(P(Z)\) and \(P(Y|Z,T)\).
Here's the assumptions this causal model encodes about how an intervention will change the system:
Given the invariance assumptions encoded in the causal graphical model, the do-calculus is a set of rules that transform post-interventional quantities (ie questions about the distribution generated by the on the right in the video) into functions of the original joint distribution. The do-calculus is expressed in terms of the do-notation, which is just a short hand for referring to distributions generated by the post-intervention model; \(P(Y|do(T = t))\) is just the distribution of \(Y\) post-intervention.
Here is a simplified version of the do-calculus for interventions on a single variable (for the full version see [9]
The do-calculus is complete [4,17]; A post-interventional quantity can be expressed purely in terms of the original joint distribution (and thus estimated exactly at the infinite data limit) if and only if it be transformed from a do-type statement to an ordinary probability statement via a series of applications of the do-calculus. Applying the do calculus in this case yields:
$$
\begin{align*}
P(Y=1|do(T=t)) &= \sum_{z \in Z} P(z)P(Y=1|z,t) \\
&= P(Z=0)P(Y=1|Z=0,t) + P(Z=1)P(Y=1|Z=1,t)
\end{align*}
$$
So,
$$
\begin{align*}
P(Y|do(T=0)) &= \frac{50}{200}\frac{560}{850} + \frac{200}{250}\frac{290}{850} \\
& \approx 0.438
\end{align*}
$$
and
$$
\begin{align*}
P(Y|do(T=1))& = \frac{180}{360}\frac{560}{850} + \frac{36}{40}\frac{290}{850} \\
& \approx 0.636
\end{align*}
$$
Treatment $T = 1$ is better overall under these causal assumptions - because it is the itervention that leads to the highest probability of recovery.
Following Algorithm 1, we construct a probabilistic graphical model to represent intervention on \(T\) given the assumptions encoded in the causal graphical model. Note we make exactly the same assumptions about what the generative model is before and after intervention - but represent these assumptions explicitly within a single model. The result is a standard Probabilistic Graphical Model (PGM) - on which we can do inference with standard probability theory.
To estimate how effective each treatment is, we need to compute the posterior over \(Y^*\) given the observed data, $\boldsymbol D = \{Z_m,T_m,Y_m\}_{m = 1,…,M} $, and the intervention \(T^* = t\), that is:
$$P(Y^*{=}1 | \boldsymbol D, T^*{=}t) = \displaystyle\iiint\limits_{\phi,\gamma,\psi} P(\phi,\gamma,\psi, | \boldsymbol D) \sum_{z^*} P(Y^* {=}1,z^* |T^*{=}t,\phi,\gamma,\psi) \textrm{d}\phi \textrm{d}\gamma \textrm{d}\psi$$
From the equation above, we see that estimation splits naturally into two parts:
If all variables are binary and we use uniform priors we get;
$$P(Y^*{=}1 | \boldsymbol D, T^*{=}0) = \frac{51}{202}\frac{561}{852} + \frac{201}{252}\frac{291}{852}$$ and $$P(Y^*{=}1 | \boldsymbol D, T^*{=}1)= \frac{181}{362}\frac{561}{852} + \frac{31}{42}\frac{291}{852}$$
This is exactly the same as the result from the do-calculus - except for the Laplace smoothing that comes from from applying the uniform prior.
Now suppose that, instead of gender, $Z$ represents whether or not patients met the minimum weekly exercise recommendations after starting treatment. You (as the doctor) believe that the amount of exercise patients do is influenced by the treatment (as it reduces pain) and that exercise improves the likelihood of recovery. You are confident there are no other variables that might effect both which treatment was selected and the success of treatment. Under these assumptions, $Z$ is a consequence rather than a cause of $T$ in the causal graphical model, and there is no common cause of $T$ and $Y$.
Here is the causal graphical model representing these assumptions. Note that as $T$ is now assumed to cause $Z$, so the direction of that arrow has changed.
Here's what this causal graphical model implies about the relationship between systems pre and post-intervention.
The do-calculus gives:
$$P(Y=1|do(T=t)) = P(Y = 1| T = t)$$
This should not be surprising; Since intervening on $T$ does not change the system, the distributions pre and post intervention are the same.
So, $P(Y = 1|do( T = 0)) = 0.56$ and $P(Y=1|do(T=1)) = 0.54$
Treatment $T = 0$ is better under these assumptions. Note: the data is identical for case 1 and case 2. It is the model we put around the data, encoding our expectations of what will change when we intervene in the system that is different.
Here's the probabilistic graphical model we obtain by applying Algorithm 1 to represent the assumptions under case 2.
As before, we need to parameterise the model, estimate a posterior for the parameters and solve this big integral;
$$P(Y^*{=}1 | \boldsymbol D, T^*{=}t) = \displaystyle\iiint\limits_{\phi,\gamma,\psi} P(\phi,\gamma,\psi, | \boldsymbol D) \sum_{z^*} P(Y^* {=}1,z^* |T^*{=}t,\phi,\gamma,\psi) \textrm{d}\phi \textrm{d}\gamma \textrm{d}\psi$$
However, the parameterisation and post-intervention predictive distributions are not the same as for case 1, because of the differences in the structure of the model. This time, working through the same steps, and again assuming binary variables and uniform priors, we get;
$$
\begin{align*}
P(Y^*{=}1 | \boldsymbol D, T^*{=}0) &= \frac{51}{202}\frac{201}{452} +\frac{201}{252}\frac{251}{452} \\
& \approx 0.555
\end{align*}
$$
and,
$$
\begin{align*}
P(Y^*{=}1 | \boldsymbol D, T^*{=}1) &= \frac{181}{362}\frac{361}{402} +\frac{37}{42}\frac{41}{402}\\
& \approx 0.539
\end{align*}
$$
Again, in agreement with the result expected via the do-calculus.
This case nicely clarifies the confusion over conditioning on variables. The causal modelling community repeatedly warn that you must not condition on $Z$ in this example - by which they mean adjust for it or include it in a regression model. In contrast in Bayesian terminology, conditioning is simply learning by Bayes rule and the Bayesian position is that all learning should be done by conditioning on all variables. In our calculations we do condition on $Z$ (along with all the other observable information) when computing the posterior over the parameters. However, because of the structure of the model, this information has essentially no impact on the posterior predictive of the treatment effect.
This difference in terminology has certainly contributed to confusion and added heat to debates including the Statistics and Medicine debate [12,16,18] and on blogs see Wasserman and on Andrew Gelman's blog, with extensive comments by Pearl and Bareinboim see here, here, here and here.
The primary goal of this post was to clarify the somewhat philosophical point that Bayesian inference is sufficiently general to model causality. The critical component for causal inference is a model that encodes assumptions about how an intervention will change a system. How we represent that model (as a Bayesian network explicitly showing the connections between the system pre and post intervention, as a causal graphical model or using counterfactuals) is a secondary consideration.
The goal is not to replace the do-calculus - which is an incredibly powerful tool, leveraging just the conditional independence information encoded in a causal graph to determine if and how we can obtain an expression for the quantity of interest post-intervention as a function over the original joint distribution. The solutions obtained via the do-calculus are clean and simple, compared to the cumbersome computation required for the direct probabilistic approach.
However, a limitation of the do calculus is that inference is impossible if the problem is not identifiable. It is also non-trivial to introduce additional information on the functional form of the relationship between variables.
Our approach offers some hope of addressing these limitations and softening the boundary between identifiable and non-identifiable problems. It is possible to obtain informative posteriors in cases where the query is non-identifiable via the do calculus (although unless we have sufficient additional information on the functional form of the relationships between variables, the posterior will remain sensitive to the prior even at the infinite data limit). The technical challenge of course is the presence of the very high dimensional integrals - particularly if there are latent variables - which we will explore in a future post.
We have provided only the most rudimentary of summary of causal graphical models in our attempt to hint at a synthesis with Bayesian inference. Some may find our approach controversial (honestly this is probably inevitable given the diversity of opinion in causality) – we suggest interested readers solicit broad views in addition to the one we present here. The start of a good reading list is:
[1] Bernardo, J. M., & Smith, A. F. (2009). Bayesian theory (Vol. 405). John Wiley & Sons. Chicago.
[5] Jordan, M. I. (2004). Graphical models. Statistical Science, 19(1):140–155.
[6] Kadane, J. B. (2011). Principles of uncertainty. Chapman and Hall/CRC.
[7] Lad, F. (1996). Operational Subjective Statistical Methods: a mathematical, philosophical, and historical introduction (Vol. 315). Wiley-Interscience. Chicago
[9] Pearl, J. (1995). Causal diagrams for empirical research. Biometrika, 82(4):669–688.
[11] Pearl, J. (2000). Causality. Cambridge university press.
[12] Pearl, J. (2009). Myth, confusion, and science in causal analysis.
[13] Pearl, J. (2014). Comment: understanding Simpson’s paradox. The American Statistician, 68(1):8–13.
[15] Rosenblum, P. R., Rubin, D M. The central role of the propensity score in observational studies for causal effect. Biometrika, 70(1):41–55, 1983.
[16] Rubin, D. B. (2009). Should observational studies be designed to allow lack of balance in covariate distributions across treatment groups?. Statistics in Medicine, 28(9), 1420-1423.
[18] Shrier, I. (2009). Propensity scores. Statistics in Medicine, 28(8), 1317-1318.
Lattimore, F. and Rohde, D. (2019) Bayesian Causality - Video Presentation
Lattimore, F. and Rohde, D. (2019) Replacing the do calculus with Bayes rule
Lattimore, F. and Rohde, D. (2019) Causal inference with Bayes rule
Lattimore, F. (2017) Learning how to act: making good decisions with machine learning