# Causal Inference with Bayes Rule

An in-depth look at modelling interventions: joint work between Finn Lattimore (Gradient Institute) & David Rohde (Criteo AI Lab)

In this post we explain a Bayesian approach to inferring the impact of interventions or actions. We show that representing causality within a standard Bayesian approach softens the boundary between tractable and impossible queries and opens up potential new approaches to causal inference. This post is a detailed but informal presentation of our Arxiv papers: Replacing the do calculus with Bayes rule, and Causal inference with Bayes rule - also see our video presentation Bayesian Causality

Causality – what it is and how to infer it – has been one of the most controversial subjects of machine learning and statistics. The recent publication of the Book of Why has re-ignited a long running debate on whether causal inference can be done within the standard Bayesian modelling paradigm or if it requires a fundamentally different approach. This debate began between Pearl and Rubin in the 90’s and continues today – particularly on Andrew Gelman’s blog - see Gelman and Pearl. In this post we discuss some of our recent work that aims to bridge this debate.

But first, much disagreement has arisen from differences in terminology – so let’s clarify what we mean by causal inference.

(Observational) causal inference: estimating the impact of an action or intervention in some system from data collected without directly testing the intervention.

Observational causal inference example: estimate what the average reading level of five year-olds would be if all children attended three days of pre-school, based on historical data on pre-school attendance and subsequent reading ability (noting that currently the decision as to whether or not to attend preschool is made by the kid's parents) .

The figure below contrasts observational causal inference with standard statistics. In standard statistical problems, we have data generated by some system and we want to use that data to infer some property of the system. In observational causal inference, we want to use data generated by one system – the system prior to some intervention – to infer properties about another system – the system after intervention. This requires that we make assumptions about how these two systems are related (or equivalently, how the intervention changes the original system) and that we model these assumptions to determine what data sampled from System A can tell us about System B.

## Does causality require something outside statistics?

Fundamentally, causal inference requires two things

1. A model that relates the data we can observe (pre-intervention) to the question we care about (post-intervention)

2. A way to combine this model with (finite) data to perform inference

The first of these two points is at the core of the controversy. Whether causal inference requires something beyond statistics depends on what you think of as statistics.

If you define statistics narrowly as a set of operators (eg mean, variance, covariance, etc) that can be applied to a matrix of data, then notions of causality are indeed extra-statistical. Without some assumptions about how the system pre-intervention is related to the system post-intervention, data collected from the former is uninformative about the latter.

However, if you view statistics as the process of drawing inference about (latent) quantities of interest by combining a model with data, causal inference is no different to any other inference – your model must capture the relationship between the system pre and post intervention, and as with any inference, the validity of your conclusions will depend on the model being correctly specified. This viewpoint allows you to use standard Bayesian inference to answer causal questions, but provides little guidance on how to construct a model that captures how an intervention will change a system.

In the next section, we demonstrate how we algorithmically construct ordinary Bayesian models that encode identical assumptions about a specific intervention in a system as are implicit in a Pearlian causal graphical model. The goal of doing so is to construct a shared viewpoint between Bayesian modellers and causality people – and to provide a framework combining the strengths of both approaches: Causal graphs for modelling intervention, and Bayesian inference for combining model assumptions with data.

## The do-calculus vs Bayes rule

We now honour Pearl’s oft repeated (and sensible) request to tackle a simple problem and work through it in detail. This work leverages a significant body of work where standard probability theory is used in combination with graph theory; building on Jordan, Ghahramani and many others including Pearl.

The heart of our proposal is Algorithm 1, which transforms a Pearl style intervention on a causal graph to a probabilistic graphical model.

Let’s demonstrate how this works with a classic example that was previously elucidated using causal graphical models [9].

Suppose you are a doctor and offer two treatment alternatives, denoted by $T = 0$ and $T = 1$ for some condition. You have collected data on the treatment each patient selected, $T$, a binary patient characteristic, $Z$, and the success of the treatment, $Y$.

You would like to know which treatment is better overall so you compute how many patients recover under each treatment.

$$P(Y = 1| T = 0) = 0.56 \qquad P(Y=1|T=1) = 0.54$$

It appears treatment $T = 0$ is better overall.

However, wondering if treatment effectiveness depends on the patient characteristic, $Z$, you decide to compute the recovery rates separately each group of patients – those with $Z = 0$ and those with $Z = 1$. \begin{align*} P(Y=1|Z=0,T=0) &= 0.25 \\ P(Y=1|Z=0,T=1) &= 0.5 \\ P(Y=1|Z=1,T=0) &= 0.8 \\ P(Y=1|Z=1,T=1) &= 0.9 \\ \end{align*}

Treatment $T = 1$ seems to be better within both groups. How can this be if treatment $T = 0$ is better overall? If you had to pick just one treatment for all patients, which should it be?

The key to resolving Simpson’s Paradox is to recognise that the question “which treatment should you recommend” is a causal one. You are asking about the probability a patient will recover if you intervene to select the treatment they take, not the probability they will recover conditional on them selecting a given treatment.

Which treatment you should recommend depends on how the characteristic $Z$ is related to treatment selection currently, and whether or not this will change if you decide the treatment for each patient (rather than allowing them to choose).

Now let’s see how we can resolve Simpson’s paradox and answer the doctors question using either causal graphical models and the do-calculus, or by constructing a standard Bayesian model to represent the same assumptions.

Suppose the characteristic $Z$ was gender and you believe it may be influencing the choice of treatment because one of the drugs is primarily marketed toward women. You also believe that gender may directly influence the likelihood of recovery (via some biochemical pathway).

#### Solution via the do calculus

The causal graphical model representing these assumptions is shown below.

Arrows are interpreted causally - i.e. the arrow from $Z$ to $T$ represents the assumption that gender influences treatment selection. This model encodes the assumption that intervening on $T$ removes the influence of gender on treatment, since $T$ is then set endogenously. Everything else about the data generating process is assumed to be invariant under intervention. In particular, the distributions $P(Z)$ and $P(Y|Z,T)$.

Aside: this graph represents much more than a toy problem. This, seemingly simple, graph can actually represent a huge range of interesting settings, including medical trials, recommender systems and all the observational studies in economics or social science where causal effects are estimated by "adjusting" for other variables. Many researchers live long and prosper using only this graph. The variable $Z$ need not represent a single binary variable - it could be an arbitrarily large collection of continuous variables, with interactions between them, provided they are observable and are not consequences of $T$ or $Y$ (this assumption is encoded by the absence of arrows from $T$ and $Y$ to $Z$).

Here’s the assumptions this causal model encodes about how an intervention will change the system:

Given the invariance assumptions encoded in the causal graphical model, the do-calculus is a set of rules that transform post-interventional quantities (ie questions about the distribution generated by the on the right in the video) into functions of the original joint distribution. The do-calculus is expressed in terms of the do-notation, which is just a short hand for referring to distributions generated by the post-intervention model; $P(Y|do(T = t))$ is just the distribution of $Y$ post-intervention.

Here is a simplified version of the do-calculus for interventions on a single variable (for the full version see [9]

The do-calculus is complete [4,17]; A post-interventional quantity can be expressed purely in terms of the original joint distribution (and thus estimated exactly at the infinite data limit) if and only if it be transformed from a do-type statement to an ordinary probability statement via a series of applications of the do-calculus. Applying the do calculus in this case yields:

\begin{align*} P(Y=1|do(T=t)) &= \sum_{z \in Z} P(z)P(Y=1|z,t) \\ &= P(Z=0)P(Y=1|Z=0,t) + P(Z=1)P(Y=1|Z=1,t) \end{align*}

So, \begin{align*} P(Y|do(T=0)) &= \frac{50}{200}\frac{560}{850} + \frac{200}{250}\frac{290}{850} \\ & \approx 0.438 \end{align*} and \begin{align*} P(Y|do(T=1))& = \frac{180}{360}\frac{560}{850} + \frac{36}{40}\frac{290}{850} \\ & \approx 0.636 \end{align*}

Treatment $T = 1$ is better overall under these causal assumptions - because it is the itervention that leads to the highest probability of recovery.

Show the steps of the do-calculus Where the relevent graphs (given the causal graphical model) are:

#### Solution via Bayes rule

Following Algorithm 1, we construct a probabilistic graphical model to represent intervention on $T$ given the assumptions encoded in the causal graphical model. Note we make exactly the same assumptions about what the generative model is before and after intervention - but represent these assumptions explicitly within a single model. The result is a standard Probabilistic Graphical Model (PGM) - on which we can do inference with standard probability theory.

To estimate how effective each treatment is, we need to compute the posterior over $Y^*$ given the observed data, $\boldsymbol D = \{Z_m,T_m,Y_m\}_{m = 1,…,M}$, and the intervention $T^* = t$, that is:

$$P(Y^*{=}1 | \boldsymbol D, T^*{=}t) = \displaystyle\iiint\limits_{\phi,\gamma,\psi} P(\phi,\gamma,\psi, | \boldsymbol D) \sum_{z^*} P(Y^* {=}1,z^* |T^*{=}t,\phi,\gamma,\psi) \textrm{d}\phi \textrm{d}\gamma \textrm{d}\psi$$

From the equation above, we see that estimation splits naturally into two parts:

1. Estimate the posterior over the model parameters given the observed data $P(\phi,\gamma,\psi, | \boldsymbol D)$
2. Estimate the distribution over the latent, post-intervention variables conditional on the model parameters $P(Y^*,Z^* |\phi,\gamma,\psi)$.

If all variables are binary and we use uniform priors we get;

$$P(Y^*{=}1 | \boldsymbol D, T^*{=}0) = \frac{51}{202}\frac{561}{852} + \frac{201}{252}\frac{291}{852}$$ and $$P(Y^*{=}1 | \boldsymbol D, T^*{=}1)= \frac{181}{362}\frac{561}{852} + \frac{31}{42}\frac{291}{852}$$

This is exactly the same as the result from the do-calculus - except for the Laplace smoothing that comes from from applying the uniform prior.

Show me the calculations! Lets start by looking at the parameter-posterior. Factoring the likelihood and putting independent priors on the parmaters gives; Since the variables are all binary, we can parameterise the model in terms of bernoulli distributions; Note: parameterizing $P(z)$ requires one parameter, $\gamma$. Parameterizing the conditional distribution $P(t|z)$ requires two, $\phi_0$ and $\phi_1$ to represent $P(t|Z=0)$ and $P(t|Z=1)$ respectively; and parameterizing $P(y|z,t)$ requires four, one for each combination of $Z$ and $T$. Now we can use beta distributions as a conjugate priors to get the posteriors for each parameter analytically. Let $\alpha$ and $\beta$ be the parameters for the prior and $N(.)$ represent the count of the number of times a variable combination occured in the data; Here's what they look like given our data, using a uniform prior; The post-intervention predictive can also be factorised; Now we have to put the parameter posterior and post-intervention predictive together and integrate out the parameters. (we can drop the posterior over $\phi$ as the predictive distribution has no dependence on it - we can pull it out the front and integrate it away to 1). Each of the four integrals in the last term expression is the expectation of a beta distribution - which has an analytical solution: $\frac{\alpha}{\alpha + \beta}$
A philosophical aside on probability: Pearl appears to take a frequentist view of probability in framing the do-calculus, defining the problem of causation in terms of a stochastic system from which we observe many realisations in the pre-intervention world and where we consider a post-intervention world as a new but related stochastic system:

“probability theory deals with beliefs about an uncertain, yet static world, while causality deals with changes that occur in the world itself, (or in one’s theory of such changes). More specifically, causality deals with how probability functions change in response to influences (e.g., new conditions or interventions) that originate from outside the probability space, while probability theory, even when given a fully specified joint density function on all (temporally-indexed) variables in the space, cannot tell us how that function would change under such external influences. Thus, “doing” is not reducible to “seeing”, and there is no point trying to fuse the two together.” Pearl (2001)[10]

With the Bayesian approach, we have only a single realisation of the entire system including both the pre and post intervention graphs simultaneously. Probability is now a decision theoretic primitive, rather than representing a long run frequency. The framework we adopt can be viewed as the operational subjective view of de Finetti (also see text book treatments in [1,6,7]).

Now suppose that, instead of gender, $Z$ represents whether or not patients met the minimum weekly exercise recommendations after starting treatment. You (as the doctor) believe that the amount of exercise patients do is influenced by the treatment (as it reduces pain) and that exercise improves the likelihood of recovery. You are confident there are no other variables that might effect both which treatment was selected and the success of treatment. Under these assumptions, $Z$ is a consequence rather than a cause of $T$ in the causal graphical model, and there is no common cause of $T$ and $Y$.

#### Solution via the do calculus

Here is the causal graphical model representing these assumptions. Note that as $T$ is now assumed to cause $Z$, so the direction of that arrow has changed.

Here’s what this causal graphical model implies about the relationship between systems pre and post-intervention.

The do-calculus gives:

$$P(Y=1|do(T=t)) = P(Y = 1| T = t)$$

This should not be surprising; Since intervening on $T$ does not change the system, the distributions pre and post intervention are the same.

So, $P(Y = 1|do( T = 0)) = 0.56$ and $P(Y=1|do(T=1)) = 0.54$

Treatment $T = 0$ is better under these assumptions. Note: the data is identical for case 1 and case 2. It is the model we put around the data, encoding our expectations of what will change when we intervene in the system that is different.

Show me the calculations via the do-calculus! Where the relevent graphs (given the causal graphical model) are:

#### Solution via Bayes rule

Here’s the probabilistic graphical model we obtain by applying Algorithm 1 to represent the assumptions under case 2.

As before, we need to parameterise the model, estimate a posterior for the parameters and solve this big integral;

$$P(Y^*{=}1 | \boldsymbol D, T^*{=}t) = \displaystyle\iiint\limits_{\phi,\gamma,\psi} P(\phi,\gamma,\psi, | \boldsymbol D) \sum_{z^*} P(Y^* {=}1,z^* |T^*{=}t,\phi,\gamma,\psi) \textrm{d}\phi \textrm{d}\gamma \textrm{d}\psi$$

However, the parameterisation and post-intervention predictive distributions are not the same as for case 1, because of the differences in the structure of the model. This time, working through the same steps, and again assuming binary variables and uniform priors, we get;

\begin{align*} P(Y^*{=}1 | \boldsymbol D, T^*{=}0) &= \frac{51}{202}\frac{201}{452} +\frac{201}{252}\frac{251}{452} \\ & \approx 0.555 \end{align*} and, \begin{align*} P(Y^*{=}1 | \boldsymbol D, T^*{=}1) &= \frac{181}{362}\frac{361}{402} +\frac{37}{42}\frac{41}{402}\\ & \approx 0.539 \end{align*}

Again, in agreement with the result expected via the do-calculus.

Show me the calculations! We parameterise the model starting with $\phi$, Again using beta distributions as a conjugate priors we get; And here is what the posteriors look like given our data and a uniform prior; The post-intervention predictive is; Putting it all together, we get Again, each of these integrals is the expectation of a beta distribution so we can simply compute the relevant ratios of counts, yielding a result equivalent to the do-calculus except for a minor contribution from the parameter priors. .

This case nicely clarifies the confusion over conditioning on variables. The causal modelling community repeatedly warn that you must not condition on $Z$ in this example - by which they mean adjust for it or include it in a regression model. In contrast in Bayesian terminology, conditioning is simply learning by Bayes rule and the Bayesian position is that all learning should be done by conditioning on all variables. In our calculations we do condition on $Z$ (along with all the other observable information) when computing the posterior over the parameters. However, because of the structure of the model, this information has essentially no impact on the posterior predictive of the treatment effect.

This difference in terminology has certainly contributed to confusion and added heat to debates including the Statistics and Medicine debate [12,16,18] and on blogs see Wasserman and on Andrew Gelman’s blog, with extensive comments by Pearl and Bareinboim see here, here, here and here.

## Conclusion

The primary goal of this post was to clarify the somewhat philosophical point that Bayesian inference is sufficiently general to model causality. The critical component for causal inference is a model that encodes assumptions about how an intervention will change a system. How we represent that model (as a Bayesian network explicitly showing the connections between the system pre and post intervention, as a causal graphical model or using counterfactuals) is a secondary consideration.

The goal is not to replace the do-calculus - which is an incredibly powerful tool, leveraging just the conditional independence information encoded in a causal graph to determine if and how we can obtain an expression for the quantity of interest post-intervention as a function over the original joint distribution. The solutions obtained via the do-calculus are clean and simple, compared to the cumbersome computation required for the direct probabilistic approach.

However, a limitation of the do calculus is that inference is impossible if the problem is not identifiable. It is also non-trivial to introduce additional information on the functional form of the relationship between variables.

Our approach offers some hope of addressing these limitations and softening the boundary between identifiable and non-identifiable problems. It is possible to obtain informative posteriors in cases where the query is non-identifiable via the do calculus (although unless we have sufficient additional information on the functional form of the relationships between variables, the posterior will remain sensitive to the prior even at the infinite data limit). The technical challenge of course is the presence of the very high dimensional integrals - particularly if there are latent variables - which we will explore in a future post.

We have provided only the most rudimentary of summary of causal graphical models in our attempt to hint at a synthesis with Bayesian inference. Some may find our approach controversial (honestly this is probably inevitable given the diversity of opinion in causality) – we suggest interested readers solicit broad views in addition to the one we present here. The start of a good reading list is:

[1] Bernardo, J. M., & Smith, A. F. (2009). Bayesian theory (Vol. 405). John Wiley & Sons. Chicago.

[7] Lad, F. (1996). Operational Subjective Statistical Methods: a mathematical, philosophical, and historical introduction (Vol. 315). Wiley-Interscience. Chicago

[11] Pearl, J. (2000). Causality. Cambridge university press.

[15] Rosenblum, P. R., Rubin, D M. The central role of the propensity score in observational studies for causal effect. Biometrika, 70(1):41–55, 1983.

[16] Rubin, D. B. (2009). Should observational studies be designed to allow lack of balance in covariate distributions across treatment groups?. Statistics in Medicine, 28(9), 1420-1423.

[18] Shrier, I. (2009). Propensity scores. Statistics in Medicine, 28(8), 1317-1318.

## More discussion of our idea

Lattimore, F. and Rohde, D. (2019) Bayesian Causality - Video Presentation

Lattimore, F. and Rohde, D. (2019) Replacing the do calculus with Bayes rule

Lattimore, F. and Rohde, D. (2019) Causal inference with Bayes rule

Lattimore, F. (2017) Learning how to act: making good decisions with machine learning