Parameters

There Aren’t Any Rules on How Social Scientists Use Private Data. Here’s Why We Need Them.

The new universe of private data is reshaping social science research in some ways that are still poorly understood.

by Henry Farrell July 10, 2016

The politics of social science access to data are shifting rapidly in the United States as in other developed countries. It used to be that states were the most important source of data on their citizens, economy, and society. States needed to collect and aggregate large amounts of information for their own purposes. They gathered this directly—e.g., through censuses of individuals and firms—and also constructed relevant indicators. Sometimes state agencies helped to fund social science projects in data gathering, such as the National Science Foundation’s funding of the American National Election Survey over decades. While scholars such as James Scott and John Brewer disagreed about the benefits of state data gathering, they recognized the state’s primary role.

In this world, the politics of access to data were often the politics of engaging with the state. Sometimes the state was reluctant to provide information, either for ethical reasons (e.g. the privacy of its citizens) or self-interest. However, democratic states did typically provide access to standard statistical series and the like, and where they did not, scholars could bring pressure to bear on them. This led to well-understood rules about the common availability of standard data for many research questions and built the foundations for standard academic practices. It was relatively easy for scholars to criticize each other’s work when they were drawing on common sources. This had costs—scholars tended to ask the kinds of questions that readily available data allowed them to ask—but also significant benefits. In particular, it made research more easily reproducible.

We are now moving to a very different world. On the one hand, open data initiatives in government are making more data available than in the past (albeit often without much in the way of background resources or documentation).The new universe of private data is reshaping social science research in some ways that are still poorly understood. On the other, for many research purposes, large firms such as Google or Facebook (or even Apple) have much better data than the government. The new universe of private data is reshaping social science research in some ways that are still poorly understood. Here are some of the issues that we need to think about:

Access

There are no common and well-understood rules for external researchers’ access to commercial data. Typically, large firms do not have standard rules providing social scientists with common access to data. Instead, they forge specific relationships with individual researchers, or small groups of researchers, whose work might be valuable to the firm. These relationships are furthermore usually covered by non-disclosure agreements (NDAs) and/or other contractual rules determining the ways in which researchers can use the data and summarize them in published academic research. Sometimes, however, it is possible to get rough and ready access to aggregated data via tools like Google Trends, or data that are made available for other purposes (e.g. to give potential advertisers a sense of the population of potential markets).

Differential access to data may start to have important consequences for the success or failure of research careers, exacerbating inequalities between scholars. Those with access to the right social and research networks, and hence the opportunity to get access to privately held data, may be advantaged over those who do not.

Non-transparency of data generation

State-constructed datasets were flawed in many ways in their heyday, and continue to be. However, as collective professional standards improved, the flaws were better understood and more transparent. This is not necessarily so for new forms of data. They are collected primarily for commercial purposes, rather than for purposes of research. They are also often collected from services that change importantly over time using methods such as, for example, machine-learning techniques that are often opaque even to their creators. Finally, the findings of these data are often deployed on the fly to reshape algorithms to change human behavior (e.g., by making individuals more likely to click on ads). In combination, these factors can mean that it is really hard to interpret the data. For example, to what extent might changes in behavior on Facebook be driven by underlying changes in society, and to what extent by changes to Facebook’s algorithms? Except under certain circumstances (e.g., where Facebook runs controlled experiments) it can be very hard to say.

Non-reproducibility

The politics of access have obvious implications for the reproducibility of social science research. If a piece of research is based on data that are not publicly available, it will be hard for others to evaluate it and discover weaknesses in analysis. NDAs and other agreements may not only prevent researchers from sharing data, but also may hamper them from providing valuable information about how the data were gathered and processed. We may be about to witness a collision between the reproducibility movement, which is gaining ground in the social sciences, and the temptations of new proprietary data, which may push away from reproducibility and towards reliance on non-available data.

Non-cumulative research

Commercial enterprises have only limited incentives to share data with academic researchers. They have even less incentive to share data with their competitors, since such data can be a source of enormous commercial advantage. This helps reinforce a general fragmentation of knowledge, where competing firms have different kinds of data that could illustrate a problem from multiple preferences. For example, someone studying the relationship between information availability and collective action might want to have data from both Google and Facebook (as well perhaps as third parties, depending on the kinds of collective action they were thinking about). Except under unusual circumstances, it is hard to bring those different kinds of data together.

Selection bias

The final problem is that the kinds of research we carry out are often a product of the data that are available. Businesses’ control of their proprietary data could lead to two kinds of selection bias, one specific and the other general. The first is straightforward–that unflattering findings will not be published. For example, Uber recently funded social scientists to carry out research on whether or not their service was cheaper and faster than standard taxis. The research suggested that Uber was indeed cheaper and faster, but Uber insisted on retaining control over whether or not the results were published. It doesn’t take an especially suspicious mind to hypothesize that Uber might have withheld permission for publication if the results had suggested that its services were worse than taxis. When businesses use proprietary access to data and legal agreements to retain control over publication, they will have an obvious commercial incentive to only allow the publication of material that is flattering to their interests. Over time, this will lead to the skewing of publicly available research.

The second kind of selection bias is more subtle–but also more insidious. As scholars begin to look to private businesses for data, the contours of entire academic fields may become subject to pervasive forms of selection bias, as certain research topics and methods are favored, while others fall by the wayside. As already noted, businesses are rarely willing to encourage research that might hurt their corporate reputation with consumers or governments. This may guide researchers away from entire topics of research.

It is highly unlikely, for example, that Facebook will ever allow its proprietary data to be used for public research on the role that it played in facilitating upheavals during the Arab Spring. This is a crucial question for social science, but Facebook understandably does not want to give non-democratic governments the impression that its network of users might take action to topple them. It may and does, however, allow research on topics that seem more politically anodyne (although even here, it may stumble) or that have the potential of flattering Facebook’s corporate image.Facebook understandably does not want to give non-democratic governments the impression that its network of users might take action to topple them. While scholars have certainly written on this topic, it has been hard for them to establish authoritatively what happened, because they do not have access to directly relevant data. This in turn makes it harder for them to publish in the top journals of their field, biasing researchers away from these topics, and toward topics where the interests of research and the interests of the businesses holding the key data are more readily compatible.

The fundamental point is not that we are necessarily worse off than we were twenty or thirty years ago. Many forms of behavior that were previously invisible now leave trackable electronic spoor. It is far easier than it used to be to carry out large-scale experiments. This is a potential boon to scholarship, even as it raises complicated ethical questions. It isn’t only large corporations that face the temptation to abuse the possibilities. The Simpsons character Dr. Marvin Monroe harbors the ambition to build a “Monroe Box,” in which he will keep an infant until the age of thirty, subjecting it at random moments to electrocution and showers of icy water, in order to test the hypothesis that it will feel resentment towards its captor. All social scientists have a little Marvin Monroe in their hearts.

However, we do face a more complex politics of information with fewer tools and rules than we are accustomed to having. Even though some people are beginning to think about these issues, there are few high-level conversations happening. The power of the disciplines and disciplinary associations may be limited, given the incentives of individual researchers, and the bargaining power of the large firms that hold all the data. However, it may be enough to bring through some change. Businesses too may sometimes benefit from clear rules, norms and statements of best practice since they themselves lack clarity on what the appropriate rules are. Scholarly disciplines, which have been thinking through these issues for decades, should at least try to begin the conversation to see where it may go.

A version of this article was later published in the Chronicle of Higher Education.

Henry Farrell

Henry Farrell is associate professor of political science and international affairs at George Washington University. He has previously been a fellow at the Woodrow Wilson International Center for Scholars, assistant professor at George Washington University and the University of Toronto, and a senior research fellow at the Max Planck Project Group in Bonn, Germany. He works on a variety of topics, including trust, the politics of the Internet and international and comparative political economy. His recent book, The Political Economy of Trust: Interests, Institutions and Inter-Firm Cooperation, was published in 2009 by Cambridge University Press. In addition he has authored... Read more