Social science explores human interaction. So, now that we have data on virtually every type of human interaction, can we, once and for all, see exactly how human society works? Sort of. The potential of “big data” is enormous. But data by themselves are not enough. In this essay, I will argue that research still requires accurate theoretical models to provide guidance, and compelling research strategies to understand causal relationships. The size of the data cannot make up for the absence of these other pillars of research in social science.

One common technical definition of “big data” is when a researcher has more information about each observation than the number of observations. Although this definition applies often in computer science or genomics, and occasionally in social sciences,1Xavier X. Sala-i-Martín, “I Just Ran Two Million Regressions,” American Economic Review 87, no. 2 (May, 1997): 178–183. Alexandre Belloni, Daniel Chen, Victor Chernozhukov, and Christian Hansen, “Sparse Models and Methods for Optimal Instruments with an Application to Eminent Domain,” Econometrica, November 2012.  it does not describe many of the large-scale empirical exercises that have recently been influential in economics. Instead, economists have obtained rich new data from administrative or private sources where the number of observations exceed the information about each observation. These datasets are comprehensive and detailed—whether the universe of US tax returns,2Danny Yagan, “Moving to Opportunity? Migratory Insurance over the Great Recession,” Working paper, University of California at Berkeley, January 2014. near-universe of credit card accounts,3Sumit Agarwal, Souphala Chomsisengphet, Neale Mahoney and Johannes Stroebel, “Regulating Consumer Financial Products: Evidence from Credit Cards,” Quarterly Journal of Economics, February 2015. or universe of eBay transactions.4Liran Einav, Theresa Kuchler, Jonathan Levin, and Neel Sundaresan, “Assessing Sale Strategies in Online Markets using Matched Listings,” American Economic Journal: Microeconomics, May 2015. They may avoid some of the challenges of analyzing “big data” in the traditional sense. But my arguments here are at least as relevant when the observations become more detailed than numerous.

Economics has a long history of well-developed theory and attention to causal inference.5 James Stock and Francesco Trebbi, “Who Invented Instrumental Variable Regression?” Journal of Economic Perspectives, August 2003. This allows it to take advantage of the new opportunities that big data are now opening up, which is exactly what I try to do in my own research. For example, Jeffrey Clemens and I have shown that the structure of doctors’ payments affects how aggressively they practice medicine and which specialties they choose.6Jeffrey Clemens and Joshua D. Gottlieb, “Do Physicians’ Financial Incentives Affect Treatment Patterns and Patient Health?” American Economic Review, April 2014. So the next natural question is, how are the payments for these treatments set? Despite the topic’s importance, this price determination has historically been opaque. Big data allow us tease out how private prices are set in granular detail. In ongoing work with Clemens and Tímea Laura Molnár, we use data on 71 million payments to study exactly this question.7Jeffrey Clemens, Joshua D. Gottlieb, and Tímea Laura Molnár, “The Anatomy of Physician Payments: Contracting Subject to Complexity,” National Bureau of Economic Research Working Paper No. 21642, October 2015.

Even with 71 million price observations, we can never hope to know exactly how every price is set. So, we have to first ask the right, limited questions based on economic theory. Because Medicare is the largest insurer, it can significantly influence the remaining private market.8Jeffrey Clemens and Joshua D. Gottlieb, “In the Shadow of a Giant: Medicare’s Influence on Private Physician Payments,” Journal of Political Economy, forthcoming.This observation suggests that we should look for private prices to be linked to Medicare payment rates. Guided by this principle, we look for prices in our data to be split up into two categories: some payments that depend directly on Medicare rates, and others that don’t. We use two empirical approaches to measure how often each type of price setting occurs.

fig-1

Figure 1: This graph shows all of the different prices that a large private insurer makes to a group of doctors. It shows that most prices—but not all—are directly related to Medicare’s reimbursements. Each dot shows a different medical service and a price paid for that service. The values along the horizontal axis show what the group would earn for treating a Medicare patient, and the height shows what the private insurer actually paid. The graph shows that the private and Medicare prices line up most of the time, though the private prices are marked up over Medicare levels by different amounts. Nevertheless, some payments aren’t on any of these lines. Source: Adapted from Jeffrey Clemens, Joshua D. Gottlieb, and Tímea Laura Molnár, “The Anatomy of Physician Payments: Contracting Subject to Complexity.” National Bureau of Economic Research Working Paper No. 21642, October 2015.

Figure 1 shows the payments for one specific physician group. It plots private insurers’ payments for each specific treatment against the amount Medicare would pay for the same care. When we make this graph for any particular physician group, we tend to see the majority of private insurance payments lining up perfectly against Medicare’s rates. Figure 1 exemplifies this. Nevertheless, there are some deviations from a perfect relationship. These deviations seem to represent an attempt to improve on Medicare’s price list in situations where it is flawed. So, despite the convenient shortcut, private pricing contracts do not inherit all of Medicare’s inefficiencies.

This type of exercise would be impossible without access to the raw underlying payment data. Data on overall health-care spending amalgamate information from diverse physician groups; often conflate the price level with the amount of care; and lack the level of detail necessary to carefully investigate these phenomena. At the same time, we were aware of the limitations inherent in simply graphing this particular pricing relationship. Theory and empirical strategies allowed us to take the next step.

We next developed a separate method to estimate the direct, causal effect of Medicare prices on private ones. This strategy relies on instances when Medicare changes its payments. When private prices are linked to Medicare’s, a given percentage change in the latter should be reflected through an identical percentage change into the former. In contrast, when the privately negotiated rates are independent of Medicare, a Medicare change should leave private payments unaffected.

This theoretical and empirical model told us how to measure the private-public pricing spillovers. By looking at the relationship between percent changes in Medicare payments across different services and percent changes in private payments for those same services, we were able to infer how often the two are linked. We then looked for variation in these relationships across different physician groups and categories of care. Based on which theoretical model best matches this variation, we learned key information about the negotiating costs, and the groups’ objectives in their price adjustments.

We concluded that privately negotiated prices are in fact frequently set as a direct markup over Medicare rates. This is a useful shortcut to reduce physicians’ and insurers’ negotiating costs. Only the combination of extremely rich data, simplifications guided by theory, and the strategy for causal identification allowed us to make meaningful progress in understanding this fascinating sector of the economy.

Even as data grow, the most comprehensive dataset is inherently incomplete. The universe of tax returns omits people who don’t work and don’t file taxes. Even detailed health statistics and comprehensive medical records for the population of the world would inherently lack information on the deceased. Researchers have long been aware of the selection bias problem: relationships in the data we see may mislead us about the data we don’t see. This conceptual problem applies even in the most thorough datasets we could imagine.

To see how a combination of economic theory and clever research designs can overcome some of these limitations of big data, consider my student Oscar Becerra’s research for his PhD dissertation.9Oscar Becerra, “Pension Incentives and Formal-Sector Labor Supply: Evidence from Colombia,” Working paper, University of British Columbia, November 2015. Becerra is interested in the informal labor markets that are pervasive in developing countries. Many firms in these countries don’t register their existence with the government, and this informality reduces productivity10Rafael La Porta and Andrei Shleifer, “Informality and Development,” Journal of Economic Perspectives, Summer 2014. and hampers overall economic development. One of the main tradeoffs firms and workers face in deciding whether to register involves taxes. Registration is expensive in many places—including Colombia, which Becerra studies—since it generally opens the firm up to hefty taxes. On the other side, registration gives their employees access to valuable government benefits. Becerra’s research examines how firms and workers handle this tradeoff, and measures how valuable these benefits—such as promises of a future pension—are to workers.

He obtained comprehensive data from the pension system in Colombia, so has records on all workers who have contributed to that system. But informal labor markets exist precisely to avoid the government system. So even this full administrative dataset omits those working in the informal economy. Looking at work histories or wages for people who register with the pension system would be systematically misleading.

To resolve this problem, Becerra developed a theoretical model that guides his interpretation of the missing data. Those who are most likely to benefit from the pension contributions are likely to work in the formal sector. Those who are least likely to benefit view their payroll tax as just that—a tax. So they will search for informal jobs. Becerra’s model predicted who is likely to appear and who will be missing from the formal sector data. So a worker’s absence becomes just as informative as the worker’s presence. What would otherwise have been a problematic selection bias becomes a useful result.

The model, and basic intuition, suggests that formal-sector and informal-sector workers could be very different. So Becerra still needed a way to measure how government policies affect workers’ job-search decisions. This is where policy changes become valuable. Much like the developed world, many developing countries face financial problems with the long-term solvency of their pension systems. Colombia recently reformed its system to address this problem, and changed the criteria for pension eligibility. Becerra focused on one particular change: the requirements for earning a pension changed notably depending on a worker’s exact birth date. Men born on or after April 1, 1954, have to work 25 years in the formal sector to qualify for a pension. But everyone born on March 31, 1954, or earlier only has to work 20 years. Since the exact birth date is effectively random across a short window—such as March 31 vs. April 1—we can assume that two people close in age but subject to the different pension requirements are extremely similar. This nearly random variation in exact birth dates provided Becerra with the quasi-experiment he needed. Focusing on narrow variation in ages, he can address the direct effect of the policy change: how did the reform affect workers’ decisions to work in the formal or informal sectors?

fig-2-1 fig-2-2

Figure 2: These graphs show the effect of the Colombian pension reform on labor supply in the formal sector. In both graphs, the horizontal axis represents age above the cutoff for pension generosity. Positive numbers imply older ages, and thus workers who can qualify more easily for a pension. Workers to the left of the zero marker, in contrast, face more stringent requirements. The heights of each dot show formal sector employment for workers of each age. The left panel shows that in 2005, when older workers expected substantial benefits from working in the formal sector, they were 16 percent more likely to do so. In 2011, when these workers had mostly qualified for their pension already, they were less likely to work in formal jobs. Source: Adapted from Oscar Becerra, “Pension Incentives and Formal-Sector Labor Supply: Evidence from Colombia.” Working paper, University of British Columbia, November 2015.

Figure 2 provides the answer: workers respond strongly to changes in the pension benefits they anticipate. Those who have to pay the substantial pension contributions, but don’t expect to get anything in return, shy away from formal sector work. As the worker’s personal valuation of formality increases, this force reverses and workers seek out formal-sector jobs.

Despite the absence of informal workers from Becerra’s large administrative datasets, his model allows us to learn a great deal about their behavior and preferences. Yet the model alone would not have been sufficient. By seeing the administrative records of all formal workers’ labor market histories, he makes a convincing case that the absences provide information on their choices to work in the informal sector. This type of analysis shows the power of large-scale administrative data in combination with high-quality research strategies and theoretical frameworks. In combination, these inputs are advancing the research frontier.

Although my examples discuss research in economics, the same principles apply in the social sciences more broadly. Big data open up new opportunities in politics, sociology, psychology, and beyond. But these disciplines face the same challenges as economics: to exploit these data effectively, they also need well-designed research strategies developed with causal inference in mind. There is promising evidence that researchers in these fields are taking up the challenge and using big data appropriately.11Matthew Gentzkow and Jesse Shapiro, “Ideological Segregation Online and Offline,” Quarterly Journal of Economics, November 2011. Markus Gangl, “Causal Inference in Sociological Research,” Annual Review of Sociology, 2010.This kind of research will pave the way to generating insights that improve public policy and benefit society at large.