Blog

Home » Practicing Principles » Modern Causal Inference » Augmenting » On media » Introducing DoWhy

Introducing DoWhy

The human mind has a remarkable ability to associate causes with a specific event. From the outcome of an election to an object dropping on the floor, we are constantly associating chains of events that cause a specific effect. Neuropsychology refers to this cognitive ability as causal reasoning. Computer science and economics study a specific form of causal reasoning known as causal inference which focuses on exploring relationships between two observed variables. Over the years, machine learning has produced many methods for causal inference but they remain mostly difficult to use in mainstream applications. Recently, Microsoft Research open sourced DoWhy, a framework for causal thinking and analysis.

The challenge with causal inference is not that is a new discipline, quite the opposite, but that the current methods represent a very small and simplistic version of causal reasoning. Most models that try to connect causes such as linear regression rely on empirical analysis that makes some assumption about the data. Pure causal inference relies on counterfactual analysis which is a closer representation to how humans make decisions. Imagine a scenario in which you are traveling with your families for vacations to an unknown destination. Before and after the vacation you are wrestling with a few counterfactual questions:

Answering these questions is the focus of causal inference. Unlike supervised learning, causal inference depends on estimation of unobserved quantities. This if often known as the “fundamental problem” of causal inference which implies that a model never has a purely objective evaluation through a held-out test set. In our vacation example, you can either observe the effects on going on vacation or not going on vacations but never both. This challenge forces causal inference to make critical assumptions about the data generation process. Traditional machine learning frameworks for causal inference try to take shortcuts around the “fundamental problem” resulting on a very frustrating experience for data scientists and developers.

Introducing DoWhy

Microsoft’s DoWhy is a Python-based library for causal inference and analysis that attempts to streamline the adoption of causal reasoning in machine learning applications. Inspired by Judea Pearl’s do-calculus for causal inference, DoWhy combines several causal inference methods under a simple programming model that removes many of the complexities of traditional approaches. Compared to its predecessors, DoWhy makes three key contributions to the implementation of causal inference models.

Conceptually, DoWhy was created following two guiding principles: asking causal assumptions explicit and testing robustness of the estimates to violations of those assumptions. In other words, DoWhy separates the identification of a causal effect from the estimation of its relevance which enables the inference of very sophisticated causal relationships.

To accomplish its goal, DoWhy models any causal inference problem in a workflow with four fundamental steps: model, identify, estimate and refute.

Using DoWhy

Developers can start using DoWhy by installing the Python module using the following command:

python setup.py install

Like any other machine learning program, the first step of a DoWhy application is to load the dataset. In this example, imagine that we are trying to infer the correlation between different medical treatments and outcomes represented by the following dataset.

Treatment    Outcome        w0
0   2.964978   5.858518 -3.173399
1   3.696709   7.945649 -1.936995
2   2.125228   4.076005 -3.975566
3   6.635687  13.471594  0.772480
4   9.600072  19.577649  3.922406

DoWhy relies on pandas dataframes to capture the input data:

rvar = 1 if np.random.uniform() >0.5 else 0 
data_dict = dowhy.datasets.xy_dataset(10000, effect=rvar, sd_error=0.2) 
df = data_dict['df']
print(df[["Treatment", "Outcome", "w0"]].head())

At this point, we simply need about four steps to infer causal relationships between the variables. The four steps correspond to the four operations of DoWhy: model, estimate, infer and refute. We can start by modeling the problem as a causal graph:

model= CausalModel(
        data=df,
        treatment=data_dict["treatment_name"],
        outcome=data_dict["outcome_name"],
        common_causes=data_dict["common_causes_names"],
        instruments=data_dict["instrument_names"])
model.view_model(layout="dot")
from IPython.display import Image, display
display(Image(filename="causal_model.png"))

The next step is to identify the causal relationships in the graph:

identified_estimand = model.identify_effect()

Now we can estimate the causal effect and determine if the estimation is correct. This example uses linear regression for simplicity:

estimate = model.estimate_effect(identified_estimand,
        method_name="backdoor.linear_regression")
# Plot Slope of line between treamtent and outcome =causal effect
dowhy.plotter.plot_causal_effect(estimate, df[data_dict["treatment_name"]], df[data_dict["outcome_name"]])

Finally, we can use different technique to refute the causal estimator:

res_random=model.refute_estimate(identified_estimand, estimate, method_name="random_common_cause")

DoWhy is a very simple and useful framework to implement causal inference models. The current version can be used as a standalone library or integrated into popular deep learning frameworks such as TensorFlow or PyTorch. The combination of multiple causal inference methods under a single framework and the four-step simple programming model makes DoWhy incredibly simple to use for data scientist tackling causal inference problems.

References

  1. https://towardsdatascience.com/introducing-dowhy-cc58b75d61ac