The Cities Papers

Can Machine Learning Be Useful for Social Science?

by Cynthia Rudin September 29, 2015

It is obvious that the answer to the question in the title of this essay is “yes.” However, the philosophies between the social science fields and machine learning are very different, and some clarity should be provided on how machine learning tools can mesh with the study of sociological theories. Many sociologists start with a causal hypothesis and aim to investigate it, often using assumptions about how the data were generated. For instance, we might assume the data come from a linear model. The generative assumptions are important—if they hold, then the model is meaningful in describing the causal relationship. In machine learning, the philosophy is usually very different. We might aim to predict an outcome with the only assumption being that the data are drawn independently from an unknown distribution. In that case, we would not assume a particular form for the model that generates the data, but we would try to predict on new data drawn from that distribution anyway. These big philosophical differences do not mean, however, that these two areas are incompatible—quite the contrary. But it does raise the question of how machine learning can help to address critical social science questions. At the very least, we should clarify the difference in scientific philosophies between sociological theory and machine learning. Some of the key differences are:

Sociological theory is often hypothesis-driven, whereas machine learning is data- In machine learning, one starts with a dataset to build a hypothesis, whereas in sociology, one often starts with a hypothesis. Both fields use (or at least should use) and out-of-sample evaluation to test their hypotheses.
In machine learning, the focus is generally on prediction, not on One can predict without explaining why a phenomenon happens.
In machine learning, one does not believe the model is righ That is, the model is not assumed to be the data-generating mechanism. Models are evaluated only on how well they predict on data they were not created from, not how they explain. In sociology, one commonly considers whether a coefficient of a linear model is distinguishable from zero; this makes strong assumptions about the data-generating mechanism that machine learning researchers would not think are valid.
The focus of machine learning has traditionally not been on causal effects, though that is c Machine learning can be useful in causal inference problems.

I will elaborate on each of these points below. Before I do this, I would like to address the term “data mining.” Data mining is a well-respected, very broad field of computer science and statistics, and the data mining community has heavy overlap with artificial intelligence, knowledge discovery (which is now called Big Data or data science), and machine learning. In sociology, the term “data mining” is commonly taken to mean “overfitting”—that is, finding patterns in a dataset that do not generalize beyond that dataset. Ironically, overfitting is exactly what the fields of machine learning and data mining aim to avoid. This will be clarified in the elaboration of point 1, where we discuss the emphasis on cross validation and out-of-sample testing.

Elaboration of point 1: Sociological theory is often hypothesis-driven, whereas machine learning is data-driven.

A sociological study often starts with a hypothesis of how the world works. For instance, one might assume that a linear model with normal noise is generating values of salary, where the variables include demographic information and graduate education. One then infers coefficients and finds that the coefficient of graduate education is significantly different from zero, which then leads us to the conclusion that education may have an effect on salary. This makes some core assumptions that the linear model with the given covariates generates the data, with normal noise. One can criticize this approach as follows: what if there were two equally good models, measured on an out-of-sample test set, where one uses the “graduate education” covariate and one does not? We could argue that an alternative, equally simple, explanation like this could cast serious doubt on the original hypothesis that graduate education is important.

In machine learning (in particular statistical learning), the key assumption is that the data are drawn iid (independent and identically distributed) from an unknown distribution. This may be a strong assumption, but it is the one that is used. There are no other assumptions about the data generation mechanism; the data are not generally assumed to be drawn from a linear model. One splits the data into 10 folds (usually following a random permutation of the observations). The model is trained on 9 folds, and the 10th fold is preserved as the “test” fold, on which we evaluate prediction quality. Each fold is used in turn as the test fold. This allows us to obtain a measure of uncertainty (standard deviation) on the prediction quality over the test folds. Prediction quality is assessed as usual; for instance, for a classification problem, we would measure the number of correct predictions (or true positive and false positive rate, or in some cases, the area under the Receiver Operator Characteristic—ROC—curve). For regression, we might use the sum of squared differences. Regardless of what distribution the data were drawn from, we aim to show that we predict well on data drawn from the same distribution.

We often test several different machine learning methods, rather than assuming one type of model is the correct one beforehand. In sociology this might be considered cheating, since the model is chosen after looking at some of the data. In a sense, the machine learner’s training set substitutes for the sociologists’ prior knowledge in forming the model.

Elaboration of point 2: In machine learning, the focus is generally on prediction, not on explanation.

Consider the following example of the (potentially Minority-Report-ish) problem of predicting criminal recidivism, based on age, criminal history, and other characteristics of the person. We do not know exactly what form of model is correct (whether it is linear, polynomial, or something more complicated), but we can still create a model that has high out-of-sample accuracy. This model would not help with our understanding of recidivism necessarily, since one cannot claim that a person’s criminal history or age causes them to commit more crimes, but one can definitely use it for predicting recidivism. Even if the model were very complex, if it were shown to predict well out of sample, it could still be useful for assessing risk.

Many people find the idea of a complicated machine learning model unappealing (the author of this essay included). On the other hand, one does not always need to see inside the “black box.” For instance, credit card companies use machine learning to predict fraud. When a transaction occurs, the algorithm processes it to determine whether it is potentially fraudulent. Regardless of what the form of the model is, and regardless of whether it explains why fraud has occurred, we have managed to identify the fraud correctly.

Elaboration of point 3: In machine learning, one does not believe the model is right.

This is best explained through the foundation of machine learning called statistical learning theory. The simplest possible learning theoretic bound is the Occam’s Razor Bound. This is a probabilistic guarantee on out-of-sample prediction quality. In this basic classification setting, our machine learning method chooses one function f from a finite class of functions F of size |F|. Here we consider classification where the goal is to predict “yes” or “no” as in logistic regression.

Theorem (Occam’s Razor Bound) We have data (xi, yi), i = 1 . . . n drawn iid from an unknown distribution µ on X×{0, 1}. We have a function class F , where each f ∈ F is a function f : X → {0, 1}

In practice, our machine learning algorithm will choose a predictive model from F, but this bound will hold for all f ∈ F , so the bound is algorithm independent.

Define the true risk as the probability of misclassification on an unknown point drawn out of sample from µ.

Define the empirical risk as follows:

Then, with probability at least 1 − δ, for all f in F ,

This is the simplest theorem from statistical learning theory, and the proof is just Hoeffding’s inequality combined with the union bound over all of the functions in F . What this bound says is that if your algorithm is performing well in-sample, and it uses only simple functions, then it is likely to generalize well to new data drawn from µ. Performing well in-sample means your empirical risk is small. The simplicity of the function class is the number of functions in it for this bound, but we have other ways of measuring the complexity of a set of functions (for instance, covering numbers). This bound tells us that if we can create a simple model that describes the training data well, it will likely generalize well to new data.

In this way, we do not need to assume a model for generating the data. As long as we have a good model that is simple, we get a guarantee on prediction quality. This is why effort in machine learning is focused on out-of-sample prediction quality, rather than assumptions about the data generation mechanism.

If you start performing lots of experiments with different machine learning algorithms, you will immediately find out that many algorithms have approximately the same out-of-sample performance. This is a well-known observation, called the Rashomon Effect, named after a Japanese film where many equally good perspectives were provided about one crime. The framework of statistical learning theory is consistent with this experimental observation that there are many equally good models. Knowing this, if you had made an assumption about the data generation mechanism, your model might perform well, but a lot of other totally different models will perform just as well. This casts doubt on whether the data really were generated from your model. This is why a pvalue on a coefficient is not very meaningful in a machine learning sense—the pvalue computation is based on strong assumptions about the data generation process that very clearly may not be true.

On the other hand, there is an appeal of having a model with a generative story behind it. Many machine learning researchers feel that way, too, despite not actually believing that the data are generated from the model.

Elaboration on point 4: The focus of machine learning has traditionally not be on causal effects.

Since the focus of machine learning has traditionally been on prediction, these models usually do not come with a causal explanation. This is very problematic from a sociological scientific perspective, where the goal is to find out what the causal effect is. On the other hand, machine learning methods can be directly useful in causal inference in two ways. First, one can use a machine learning model in the single treatment framework. The simplest way to do this is as follows: first, create a predictive model for the outcome, given that the patient had the treatment. This will be a model that is a function of covariates x for the patient. Second, create a separate predictive model for the outcome, given that the patient is in the control group. The difference between these two models is the estimated causal effect, which is a function of covariates x. In that case, one has directly modeled the causal effect, and both predictive models can be machine learning models. One can also model the propensity to receive the treatment as a function of x. Another way to use machine learning is to use the training data to uncover possible causal structure in the data. In particular, one can construct a graphical model from the training data that shows possible causal relationships. These two areas are becoming popular, both in the machine learning literature and in the economics literature.

Acknowledgments: Thank you to the organizers of the Big Data Meets Urban Social Science Workshop at Radcliffe. In particular, this note was inspired by a conversation with sociologists Mario Small and Robert Sampson. Thank you also to Chris Winship.

Cynthia Rudin

Cynthia Rudin is associate professor of Statistics at the Massachusetts Institute of Technology associated with the Computer Science and Artificial Intelligence Laboratory and the Sloan School of Management