Parameters

Making Big Data Informative Data

Social science explores human interaction. So, now that we have data on virtually every type of human interaction, can we, once and for all, see exactly how human society works? Sort of.

by Joshua D. Gottlieb September 14, 2016

Social science explores human interaction. So, now that we have data on virtually every type of human interaction, can we, once and for all, see exactly how human society works? Sort of. The potential of “big data” is enormous. But data by themselves are not enough. In this essay, I will argue that research still requires accurate theoretical models to provide guidance, and compelling research strategies to understand causal relationships. The size of the data cannot make up for the absence of these other pillars of research in social science.

One common technical definition of “big data” is when a researcher has more information about each observation than the number of observations. Although this definition applies often in computer science or genomics, and occasionally in social sciences,¹ it does not describe many of the large-scale empirical exercises that have recently been influential in economics. Instead, economists have obtained rich new data from administrative or private sources where the number of observations exceed the information about each observation. These datasets are comprehensive and detailed—whether the universe of US tax returns,² near-universe of credit card accounts,³ or universe of eBay transactions.⁴ They may avoid some of the challenges of analyzing “big data” in the traditional sense. But my arguments here are at least as relevant when the observations become more detailed than numerous.

Economics has a long history of well-developed theory and attention to causal inference.⁵ This allows it to take advantage of the new opportunities that big data are now opening up, which is exactly what I try to do in my own research. For example, Jeffrey Clemens and I have shown that the structure of doctors’ payments affects how aggressively they practice medicine and which specialties they choose.⁶ So the next natural question is, how are the payments for these treatments set? Despite the topic’s importance, this price determination has historically been opaque. Big data allow us tease out how private prices are set in granular detail. In ongoing work with Clemens and Tímea Laura Molnár, we use data on 71 million payments to study exactly this question.⁷

Even with 71 million price observations, we can never hope to know exactly how every price is set. So, we have to first ask the right, limited questions based on economic theory. Because Medicare is the largest insurer, it can significantly influence the remaining private market.⁸This observation suggests that we should look for private prices to be linked to Medicare payment rates. Guided by this principle, we look for prices in our data to be split up into two categories: some payments that depend directly on Medicare rates, and others that don’t. We use two empirical approaches to measure how often each type of price setting occurs.

fig-1

Figure 1: This graph shows all of the different prices that a large private insurer makes to a group of doctors. It shows that most prices—but not all—are directly related to Medicare’s reimbursements. Each dot shows a different medical service and a price paid for that service. The values along the horizontal axis show what the group would earn for treating a Medicare patient, and the height shows what the private insurer actually paid. The graph shows that the private and Medicare prices line up most of the time, though the private prices are marked up over Medicare levels by different amounts. Nevertheless, some payments aren’t on any of these lines. Source: Adapted from Jeffrey Clemens, Joshua D. Gottlieb, and Tímea Laura Molnár, “The Anatomy of Physician Payments: Contracting Subject to Complexity.” National Bureau of Economic Research Working Paper No. 21642, October 2015.

Figure 1 shows the payments for one specific physician group. It plots private insurers’ payments for each specific treatment against the amount Medicare would pay for the same care. When we make this graph for any particular physician group, we tend to see the majority of private insurance payments lining up perfectly against Medicare’s rates. Figure 1 exemplifies this. Nevertheless, there are some deviations from a perfect relationship. These deviations seem to represent an attempt to improve on Medicare’s price list in situations where it is flawed. So, despite the convenient shortcut, private pricing contracts do not inherit all of Medicare’s inefficiencies.

This type of exercise would be impossible without access to the raw underlying payment data. Data on overall health-care spending amalgamate information from diverse physician groups; often conflate the price level with the amount of care; and lack the level of detail necessary to carefully investigate these phenomena. At the same time, we were aware of the limitations inherent in simply graphing this particular pricing relationship. Theory and empirical strategies allowed us to take the next step.

We next developed a separate method to estimate the direct, causal effect of Medicare prices on private ones. This strategy relies on instances when Medicare changes its payments. When private prices are linked to Medicare’s, a given percentage change in the latter should be reflected through an identical percentage change into the former. In contrast, when the privately negotiated rates are independent of Medicare, a Medicare change should leave private payments unaffected.

This theoretical and empirical model told us how to measure the private-public pricing spillovers. By looking at the relationship between percent changes in Medicare payments across different services and percent changes in private payments for those same services, we were able to infer how often the two are linked. We then looked for variation in these relationships across different physician groups and categories of care. Based on which theoretical model best matches this variation, we learned key information about the negotiating costs, and the groups’ objectives in their price adjustments.

We concluded that privately negotiated prices are in fact frequently set as a direct markup over Medicare rates. This is a useful shortcut to reduce physicians’ and insurers’ negotiating costs. Only the combination of extremely rich data, simplifications guided by theory, and the strategy for causal identification allowed us to make meaningful progress in understanding this fascinating sector of the economy.

Even as data grow, the most comprehensive dataset is inherently incomplete. The universe of tax returns omits people who don’t work and don’t file taxes. Even detailed health statistics and comprehensive medical records for the population of the world would inherently lack information on the deceased. Researchers have long been aware of the selection bias problem: relationships in the data we see may mislead us about the data we don’t see. This conceptual problem applies even in the most thorough datasets we could imagine.

To see how a combination of economic theory and clever research designs can overcome some of these limitations of big data, consider my student Oscar Becerra’s research for his PhD dissertation.⁹ Becerra is interested in the informal labor markets that are pervasive in developing countries. Many firms in these countries don’t register their existence with the government, and this informality reduces productivity¹⁰ and hampers overall economic development. One of the main tradeoffs firms and workers face in deciding whether to register involves taxes. Registration is expensive in many places—including Colombia, which Becerra studies—since it generally opens the firm up to hefty taxes. On the other side, registration gives their employees access to valuable government benefits. Becerra’s research examines how firms and workers handle this tradeoff, and measures how valuable these benefits—such as promises of a future pension—are to workers.

He obtained comprehensive data from the pension system in Colombia, so has records on all workers who have contributed to that system. But informal labor markets exist precisely to avoid the government system. So even this full administrative dataset omits those working in the informal economy. Looking at work histories or wages for people who register with the pension system would be systematically misleading.

To resolve this problem, Becerra developed a theoretical model that guides his interpretation of the missing data. Those who are most likely to benefit from the pension contributions are likely to work in the formal sector. Those who are least likely to benefit view their payroll tax as just that—a tax. So they will search for informal jobs. Becerra’s model predicted who is likely to appear and who will be missing from the formal sector data. So a worker’s absence becomes just as informative as the worker’s presence. What would otherwise have been a problematic selection bias becomes a useful result.

The model, and basic intuition, suggests that formal-sector and informal-sector workers could be very different. So Becerra still needed a way to measure how government policies affect workers’ job-search decisions. This is where policy changes become valuable. Much like the developed world, many developing countries face financial problems with the long-term solvency of their pension systems. Colombia recently reformed its system to address this problem, and changed the criteria for pension eligibility. Becerra focused on one particular change: the requirements for earning a pension changed notably depending on a worker’s exact birth date. Men born on or after April 1, 1954, have to work 25 years in the formal sector to qualify for a pension. But everyone born on March 31, 1954, or earlier only has to work 20 years. Since the exact birth date is effectively random across a short window—such as March 31 vs. April 1—we can assume that two people close in age but subject to the different pension requirements are extremely similar. This nearly random variation in exact birth dates provided Becerra with the quasi-experiment he needed. Focusing on narrow variation in ages, he can address the direct effect of the policy change: how did the reform affect workers’ decisions to work in the formal or informal sectors?

fig-2-1 fig-2-2

Figure 2: These graphs show the effect of the Colombian pension reform on labor supply in the formal sector. In both graphs, the horizontal axis represents age above the cutoff for pension generosity. Positive numbers imply older ages, and thus workers who can qualify more easily for a pension. Workers to the left of the zero marker, in contrast, face more stringent requirements. The heights of each dot show formal sector employment for workers of each age. The left panel shows that in 2005, when older workers expected substantial benefits from working in the formal sector, they were 16 percent more likely to do so. In 2011, when these workers had mostly qualified for their pension already, they were less likely to work in formal jobs. Source: Adapted from Oscar Becerra, “Pension Incentives and Formal-Sector Labor Supply: Evidence from Colombia.” Working paper, University of British Columbia, November 2015.

Figure 2 provides the answer: workers respond strongly to changes in the pension benefits they anticipate. Those who have to pay the substantial pension contributions, but don’t expect to get anything in return, shy away from formal sector work. As the worker’s personal valuation of formality increases, this force reverses and workers seek out formal-sector jobs.

Despite the absence of informal workers from Becerra’s large administrative datasets, his model allows us to learn a great deal about their behavior and preferences. Yet the model alone would not have been sufficient. By seeing the administrative records of all formal workers’ labor market histories, he makes a convincing case that the absences provide information on their choices to work in the informal sector. This type of analysis shows the power of large-scale administrative data in combination with high-quality research strategies and theoretical frameworks. In combination, these inputs are advancing the research frontier.

Although my examples discuss research in economics, the same principles apply in the social sciences more broadly. Big data open up new opportunities in politics, sociology, psychology, and beyond. But these disciplines face the same challenges as economics: to exploit these data effectively, they also need well-designed research strategies developed with causal inference in mind. There is promising evidence that researchers in these fields are taking up the challenge and using big data appropriately.¹¹This kind of research will pave the way to generating insights that improve public policy and benefit society at large.

Joshua D. Gottlieb

Joshua Gottlieb is an assistant professor in the Vancouver School of Economics at the University of British Columbia, and faculty research fellow in the National Bureau of Economic Research. He is also a visiting scholar at the Federal Reserve Bank of San Francisco, and was a visiting assistant professor at Stanford University in 2015-16. Gottlieb completed his PhD in economics at Harvard University in 2012. He won the 2015 Kenneth Arrow Award for best paper in health economics and the 2012 National Tax Association Dissertation Award. Gottlieb's work has been published in outlets such as the Journal of Political Economy,... Read more