In December 2014, the Radcliffe Institute for Advanced Study at Harvard brought together 40 scholars from a wide range of disciplines in the social and natural sciences to talk about the future of “big data” in social science research. Major technical advancements have given social scientists access to new forms of data and sophisticated analytical tools, but the full potential of these resources has not yet been realized. The conference, hosted by the Joint Initiative on Neighborhood, Social Organization, and the Future of the City, aimed to share information about promising and successful approaches in order to promote broader participation in the use of big data for social science.
Within and beyond academic institutions, several research centers have emerged that use big data to advance the understanding of social issues. In the world of urban science and urban analytics, people trained in physics, statistics, and computer science—analytical techniques that that can comprehend the vast amounts of data available today—are applying their knowledge to important contemporary urban processes. The work produced in these centers is consistently interesting but also largely unconnected to ongoing academic scholarship in economics, political science, and sociology.
Meanwhile, within sociology, contemporary approaches to the city focus on poverty, neighborhood effects, and unemployment. These are critical issues, but researchers often rely on old methods and fail to take advantage of cutting-edge data and analytic techniques. Big data holds immense promise, but in order to take advantage of it, the technical experts and the subject-matter experts must bring these two distinct fields into conversation.
The conference was organized as a series of panels addressing either empirical applications or models and methods for employing big data. Panelists gave brief presentations followed by a longer discussion with a discussant, the other panelists, and the conference attendees.
Participants adopted a broad definition of big data, encompassing techniques that range from community mapping to machine learning. Panelists presented both methodological innovations and substantive findings, but at the heart of the conversation were the challenges and possibilities of big data. Across the presentations and rich discussions, six key themes emerged: 1) the forms and challenges of data collection; 2) the diversity of data and methods available; 3) emerging analytical techniques; 4) the role of institutions in collecting, housing, and facilitating access to data; 5) the importance of privacy and security; and 6) the promises and challenges of cross-disciplinary collaboration. In this summary, we discuss the themes that emerged from the discussion, highlighting several directions for future work. Many of these themes are also highlighted in the accompanying short papers from approximately ten conference presenters.
1. Data collection: The world is increasingly awash in new forms of data. “Big data” is about both the scale of the information and its increasing diversity as we begin to capture richer behavioral and preference data through online activity. But while these data proliferate, researchers still face traditional challenges in harnessing and managing the data available.
a. Shifting collection methods: One excellent source of data is the major search and social media companies. Several participants presented findings that drew on Google search and social media data. But while researchers often require consistency, corporations are constantly changing their collection methods in order to optimize performance, adjusting their systems in response to users; as one participant put it, “the system under study is reactive.” As a result, one seemingly continuous set of data can in fact be generated using different parameters as companies adjust their algorithms over time. One presenter ran into just this problem, finding that the algorithm underlying “Google Flu” failed to provide accurate predictions of the course of the flu epidemic in part because Google’s search algorithm changed. In such cases, transparency about algorithms and data-generation processes is key to successful research.
b. User self-selection: Another important source of data is user-generated quantitative and qualitative content. Researchers must be cautious, however, as user self-selection can lead to non-random samples and produce concerns about data validity. To take one example, it appears that Twitter users are disproportionately young and affluent compared to non-users.
c. Observed data: Many new sources of data, particularly those that emerge from city agencies, are “observed” rather than designed. These data are collected for administrative purposes and often pose problems of quality, coverage, and bias. The successful use of administrative data is immensely promising, but researchers will likely encounter bumps along the way.
d. Hard-to-access data: Administrative data present administrative challenges as well: organizations’ data are often siloed, locked away in departments that hoard their data and are reluctant to share. Whether the concern is privacy, security, or territoriality, researchers may have to overcome obstacles to gain access.
e. Traditional problems and new tools: In the developing world, collecting traditional data is a difficult process, particularly in slums and other contested communities. Often, mayors are reluctant to acknowledge that these areas exist because it opens them to criticism, such that the data collection itself is politicized. Interactive tools can be used to engage these communities in self-mapping and community-based data collection. While new technologies might be useful in these cases, researchers must be careful to ensure that communities retain control over their data.
2. Diverse methods for employing big data: Big data and emerging analytic techniques provide new methods for gaining insight into social phenomena. Workshop participants presented substantive and methodological findings on how the rise of big data is enabling new discoveries.
a. Google searches can detect racism: One conference participant presented research that found that the incidence of Google searches for racist terms was associated with increased racial discrimination across American cities.
b. Methods from other sciences can be applied to social science data: One study used 311 data in New York City and applied edge detection algorithms to census maps in order to measure the distinctiveness of neighborhood boundaries.
c. A universal understanding of social mobility: Access to data on the complete population allows researchers to develop more specific findings and to follow individuals over time and across generations. Researchers are beginning to use IRS data, which covers the U.S. population and allows for tracing individuals over time, to estimate effects of neighborhoods on child development and intergenerational social mobility.
d. Comparing new and old data: Sometimes the old data collection methods are useful for validating big data or highlighting inconsistencies across methods of collection. One presenter compared 311 data to field surveys of physical and social disorder in order to develop a richer understanding of how 311 reporting aligns or diverges from actual conditions on the ground.
3. Emerging Analytical Techniques: Big data is both content and method, providing new inputs or variables for existing models while also facilitating the development of new analytical techniques. Workshop participants shared new methods for using data, enabled by the new data collection and analysis tools.
a. Accurately estimating distance: Studies have long relied on average estimates of distance to understand spatial relationships, but people experience distance in terms of traffic, road conditions, and mode of transportation. Measures of travel time are thus being developed by Bing Maps to take into account the multiple dimensions of going from one place to another. These new measures, developed in partnership with Microsoft, will allow researchers to capture the actual time and effort of traveling from a given census block to important resources like banks and food stores.
b. Computational ethnography: With big data, researchers are gaining new insights into behavior. One presenter argued that the ability to watch people navigate online and see how they make decisions is creating new opportunities for understanding decision structures and micro-level processes of evaluation. Whether on dating sites or real estate clearinghouses such as Zillow, researchers can gather and model data on sequential actions and decisions to better understand preferences and the processes that lead to action.
c. Potential to misinterpret online behavior: While these tools allow us to see what people do online, there are also challenges: researchers lack information about the individual user, which limits the kinds of questions that can be answered. Moreover, there is a fundamental ambiguity regarding intentionality: are users actively searching, or passively browsing? Researchers must be wary of these blind spots in order to effectively take advantage of new tools.
d. The role of behavioral models: Whether researchers look at Google search frequencies or 311 complaints, their insights will be limited without a meaningful behavioral model: without a strong model, it would be impossible to know whether people Google racist terms in order to figure out what they mean or because they are racist. Technology companies are beginning to develop methods for resolving this ambiguity using sequential search models.
e. Machine learning: The emerging capacity to learn from data using techniques of “machine learning” is allowing researchers to start answering questions using both new and traditional data sources. Cynthia Rudin, for example, presented the results of a machine learning project focused on detecting crime patterns (described in the accompanying paper). Traditionally, crime analysts would try to determine a crime series by hand. While this labor-intensive process was inefficient for analysis, it ensured that the data were high quality. Now, using machine learning, the researchers could simultaneously identify series along with distinctive characteristics of these series. They were able to generate theories and test them on a sample of more recent data.
4. Data and institutions: Who will collect big data? Who will house it and ensure its security? And who will develop the tools needed to facilitate access to the data for researchers? Big data is growing, but institutions are needed to support its collection, maintenance, and use. Workshop participants discussed the challenges and promises of institutional support for big data social science.
a. The role of cities: Cities have struggled to realize the potential of big data for a number of reasons. First, the use of big data is generally confined to service delivery optimization, prediction, and anomaly detection, with few public entities digging into the data to generate new ideas. The questions that cities are asking of their data are different from those that computer scientists and social scientists would ask. Cities can be rich sources of new data sets, but they are unlikely to serve an agenda-setting role.
b. Universities are centers for collaboration: Universities are playing a new role in building and maintaining system infrastructure for integrating data from different public entities and private sources, which often fail to agree among themselves. Universities can thus be important players in facilitating collaboration and moderating competing interests.
c. Universities as infrastructure providers: In facilitating collection of and access to big data, infrastructure is a critical component. But infrastructure requires major investments, both in up-front capital and in long-term maintenance and support. What role should universities have in establishing and sustaining the infrastructure that supports big data? Participants argued that there is a business case for investing in infrastructure within and between universities, in terms of access to grants and partnerships with the private sector. But collaboration beyond the university is required. An ideal model may be a consortium, leveraging university analytic capabilities, public sector data, and private sector resources.
d. Collaboration and ethics: Managing the relationship among cities, universities, and private entities will require a lot of thought. How do universities ensure that cities refrain from using academic analysis for partisan purposes? How can researchers maintain their independence from major private sponsors? Collaboration is necessary but requires thoughtful planning.
5. Privacy and security: As with all new technologies, privacy and security remain critical issues. Researchers are at the forefront of data collection and analysis, but this means they also must consider new vulnerabilities and invest in protection.
a. Protecting privacy as data get bigger: Researchers can gain important insights by linking datasets, but this also presents potential risks for privacy and security. Meanwhile, there is interest in making data more broadly available in order to facilitate replication of results, raising concerns about the privacy of subjects. Some researchers are increasing transparency while ensuring privacy by sharing their code online and offering to run any code that other researchers request on their secured data.
b. The role of secure centers: Some data are locked up in secure centers: IRS tax data, for instance, will never leave the IRS. But researchers can collaborate to more effectively make use of these data where they are, attaching other datasets by sending data into the host institution and working within their secured centers.
c. National models: How can data be made available in a way that respects the interests of everyone involved? One potential solution is a central warehouse that stores many linked datasets securely and provides one central access point. Scandinavian countries provide a model, with central statistical agencies accessible to researchers.
d. Important issues: Several key issues must be addressed in setting up any secure center. Funding is a crucial concern, as is the appropriate level of security, which can range from technical, to physical, to airgap as an extreme solution.
6. Cross-disciplinary collaboration: The goal of the workshop was to facilitate cross-disciplinary collaboration. But the distinct fields involved in bringing big data to social science ask different questions and are motivated by different problems, which can produce challenges for collaboration. Participants engaged in frank discussion of where these gaps continue to create obstacles.
a. Differing goals: Fields vary in how they understand the goal of a project and when this goal has been achieved. Computer scientists are often concerned with developing a functional capacity rather than perfecting a technology—for instance, computer vision researchers might consider detecting the gender of a person in video to be resolved if the program succeeds 80% of the time—while social science requires a higher level of accuracy. How can social scientists keep computer scientists engaged beyond the initial level of problem solving?
b. Replication: As data become larger and more sensitive, some researchers are thinking about the shifting norms around replicability. These norms should not be a default outcome; researchers can help to shape replication practices by directly addressing the issues of access. For instance, should authors be responsible for running the tests of colleagues on sensitive data? These questions require more thought.
c. The career problem: Social scientists, statisticians, and computer scientists—among others—have different academic careers and different audiences. To some degree, one must have faith that working together will produce something that is valuable and mutually rewarding. Universities and research organizations need to be able to facilitate open-ended collaboration where professional pressures are aligned.
d. Prediction vs. insight: Some emerging technologies are remarkably effective at predicting outcomes, but the mechanisms that allow for accurate prediction remain a black box to researchers. One key barrier to the use of machine-learning methods is this black box: social scientists—as well as policymakers—often insist that the underlying rationale must be understood, no matter how accurate the predictions may be.
e. Identifying causality, or not: Many computer science methods are not designed to determine causality. In some cases, these methods can be further developed to allow for causal understanding. The question of how and to what extent machine-learning methods can enhance or substitute for experiments in determining causality is still open. But there is also much to be learned about the social world that is non-causal. Social scientists can start to be creative about identifying questions that can be answered with these methods, while also working with computer scientists to develop a shared understanding of why causality is important.
f. The importance of theory: Developing theory to go along with the new methods and data is critical, but is often sidelined. Engineering and control theory (or big data “without theory”) work well when there is a measurable outcome, a simple policy to correct for it, and fast enough reaction time that the correction can be implemented while it is still appropriate. In cities, this is the process used to optimize service delivery. But this theory does not work well for complex social systems with long time horizons.
We invite readers to peruse the accompanying papers to gain further insight on the issues raised here.
This conference was organized by the “Joint Initiative on Neighborhood, Social Organization, and the Future of the City,” funded by the John D. and Catherine T. MacArthur Foundation (Mario L. Small and Robert J. Sampson, Principal Investigators). We thank Alaina Harkness from the MacArthur Foundation for advice and the Radcliffe Institute for hosting the conference. We also thank Laura Adler and Robert Manduca, graduate students in Sociology at Harvard, for taking notes and summarizing the conference.