Radical Empiricism and Machine Learning Research

July 26, 2020

Enticed by a recent seminar on this subject, I have re-read Breiman’s influential paper and would like to share with readers a re-assessment of its contributions to the art of statistical modeling.

When the paper first appeared, in 2001, I had the impression that, although the word “cause” did not appear explicitly, Breiman was trying to distinguish data-descriptive models from models of the data-generation process, also called “causal,” “substantive,” “subject-matter,” or “structural” models. Unhappy with his over-emphasis on prediction, I was glad nevertheless that a statistician of Breiman’s standing had recognized the on-going confusion in the field, and was calling for making the distinction crisp.

Upon re-reading the paper in 2020 I have realized that the two cultures contrasted by Breiman are not descriptive vs. causal but, rather, two styles of descriptive modeling, one interpretable, the other uninterpretable. The former is exemplified by predictive regression models, and the latter by modern big-data algorithms such as deep-learning, BART, trees and forests. The former carries the potential of being interpreted as causal, the latter leaves no room for such interpretation; it describes the prediction process chosen by the analyst, not the data-generation process chosen by nature. Breiman’s main point is: If you want prediction, do prediction for its own sake and forget about the illusion of representing nature.

Breiman’s paper deserves its reputation as a forerunner of modern machine learning techniques, but falls short of telling us what we should do if we want the model to do more than just prediction, say, to extract some information about how nature works, or to guide policies and interventions. For him, accurate prediction is the ultimate measure of merit for statistical models, an objective shared by present day machine learning enterprise, which accounts for many of its limitations (https://ucla.in/2HI2yyx).

In their comments on Breiman’s paper, David Cox and Bradley Efron noticed this deficiency and wrote:

“… fit, which is broadly related to predictive success, is not the primary basis for model choice and formal methods of model choice that take no account of the broader objectives are suspect. [The broader objectives are:] to establish data descriptions that are potentially causal.” (Cox, 2001)

And Efron concurs:

“Prediction by itself is only occasionally sufficient. … Most statistical surveys have the identification of causal factors as their ultimate goal.” (Efron, 2001)

As we read Breiman’s paper today, armed with what we know about the proper symbiosis of machine learning and causal modeling, we may say that his advocacy of algorithmic prediction was justified. Once guided by a causal model for identification and bias reduction, the predictive component of our model can safely be trusted to non-interpretable algorithms. The interpretation can be accomplished separately by the causal component of our model, as demonstrated, for example, in https://ucla.in/2HI2yyx.

Separating data-fitting from interpretation, an idea that was rather innovative in 2001, has withstood the test of time.

Judea

**ADDENDUM-1**

**ADDENDUM-2**

The following is an email exchange between Ying Nian Wu (UCLA, Statistics) and Judea Pearl (UCLA, Computer Science/Statistics).

Dear Judea,

I feel all models are about making predictions for future observations. The only difference is that causal model is to predict *p*(*y*|*do*(*x*)) in your notation, where the testing data (after cutting off the arrows into *x* by your diagram surgery) come from a different distribution than the training data, i.e., we want to extrapolate from training data to testing data (in fact, extrapolation and interpolation are relative — a simple model that can interpolate a vast range is quite extrapolative). Ultimately a machine learning model also wants to achieve extrapolative prediction, such as the so-called transfer learning and meta learning, where testing data are different from training data, or the current short-term experience (small new training data) is different from the past long-term experience (big past training data).

About learning the model from data, we can learn *p*(*y*|*x*), but we can also learn *p*(*y*, *x*) = *p*(*y*) *p*(*x*|*y*). We may call *p*(*y*|*x*) predictive, and *p*(*x*|*y*) (or *p*(*y*, *x*)) generative, and both may involve hidden variables *z*. The generative model can learn from data where *y* is often unavailable (the so-called semi-supervised learning). In fact, learning a generative model *p*(*y*, *z*, *x*) = *p*(*z*) *p*(*y*, *x*|*z*) is necessary for predicting *p*(*y*|*do*(*x*)). I am not sure if this is also related to the two cultures mentioned by Brieman. I once asked him (at a workshop at Banff, while enjoying some second-hand smoking) about the two models, and he actually preferred generative model, although in his talk, he also emphasized that a non-parametric predictive model such as forest is still interpretable in terms of assessing the influences of variables.

To digress a bit further, there is no such a thing called how nature works according to the Copenhagen interpretation of quantum physics: there must be an observer, the observer makes a measurement, and the wave function predicts the probability distribution of the measurement. As to the question of what happens when there is no observer or the observer is not observing, the answer is that such a question is irrelevant.

Even back to the classical regime where we can ask such a question, Ptolemy’s epicycle model on planet motion, Newton’s model of gravitation, and Einstein’s model of general relativity are not that different. Ptolemy’s model is actually more general and flexible (being a Fourier expansion, where the cycle on top of cycles is similar in style to the perceptron on top of perceptrons of neural network). Newton’s model is simpler, while Einstein’s model fits the data better (being equally simple but more involved in calculation). They are all illusions about how nature works, learned from the data, and intended to predict future data. Newton’s illusion is action at a distance (which he himself did not believe), while Einstein’s illusion is about bending of spacetime, which is more believable, but still an illusion nonetheless (to be superseded by a deeper illusion such as a string).

So Box is still right: all models are wrong, but some are useful. Useful in terms of making predictions, especially making extrapolative predictions.

Ying Nian

Dear Ying Nian,

Thanks for commenting on my “Causally Colored Reflections.”

I will start from the end of your comment, where you concur with George Box that “All models are wrong, but some are useful.” I have always felt that this aphorism is painfully true but hardly useful. As one of the most quoted aphorism in statistics, it ought to have given us some clue as to what makes one model more useful than another – it doesn’t.

A taxonomy that helps decide model usefulness should tell us (at the very least) whether a given model can answer the research question we have in mind, and where the information encoded in the model comes from. Lumping all models in one category, as in “all models are about making prediction for future observations” does not provide this information. It reminds me of Don Rubin’s statement that causal inference is just a “missing data problem” which, naturally, raises the question of what problems are NOT missing data problems, say, mathematics, chess or astrology.

In contrast, the taxonomy defined by the Ladder of Causation (see https://ucla.in/2HI2yyx): 1. Association, 2. Intervention, 3. Counterfactuals, does provide such information. Merely looking at the syntax of a model one can tell whether it can answer the target research question, and where the information supporting the model should come from, be it observational studies, experimental data, or theoretical assumptions. The main claim of the Ladder (now a theorem) is that one cannot answer questions at level i unless one has information of type i or higher. For example, there is no way to answer policy related questions unless one has experimental data or assumptions about such data. As another example, I look at what you call a generative model *p*(*y*,*z*,*x*) = *p*(*z*)*p*(*y, x*|*z*) and I can tell right away that, no matter how smart we are, it is not sufficient for predicting *p*(*y*|*do*(*x*)).

If you doubt the usefulness of this taxonomy, just examine the amount of efforts spent (and is still being spent) by the machine learning community on the so-called “transfer learning” problem. This effort has been futile because elementary inspection of the extrapolation task tells us that it cannot be accomplished using non-experimental data, shifting or not. See https://ucla.in/2N7S0K9.

In summary, unification of research problems is helpful when it facilitates the transfer of tools across problem types. Taxonomy of research problems is helpful too; for it spares us the efforts of trying the impossible, and it tells us where we should seek the information to support our models.

Thanks again for engaging in this conversation,

Judea

Dear Judea,

Thanks for the inspiring discussion. Please allow me to formulate our consensus, and I will stop at here.

Unification 1: All models are for prediction.

Unification 2: All models are for the agent to plan the action. Unification 2 is deeper than Unification 1. But Unification 1 is a good precursor.

Taxonomy 1: (a) models that predict *p*(*y*|*x*). (b) models that predict *p*(*y*|*do*(*x*)) or (c) models that can fill in Rubin’s table.

Taxonomy 2: (a) models that fit data, not necessarily make sense, only for prediction. (b) models that understand how nature works and are interpretable.

Taxonomy 1 is deeper and more precise than Taxonomy 2, thanks to the foundational work of you and Rubin. It is based on precise, well-defined, operational mathematical language and formulation.

Taxonomy 2 is useful and is often aligned with Taxonomy 1, but we need to be aware of the limitation of Taxonomy 2, which is all I want to say in my comments. Much ink has been spilled on Taxonomy 2 because of its imprecise and non-operational nature.

Ying Nian

AIWS Innovation Network - Powered by BGF