Current social science research and writing faces a number of possibilities that seem to be constrained by three major challenges. The first is the limits of the imagination; the second is knowing what kinds of data are now out there; and the third is having the tools to aggregate and mine them.
Extend this beyond the act of thinking about the publication of the work to doing the research itself—that is, to almost any other question in any social science field. Because there are sensors everywhere— traffic sensors, security footage, digital tracks that we strew all over—now that we are citizens of the Internet. These digital traces are everywhere: there are records that are being kept, sometimes passively, sometimes actively, sometimes curated, sometimes not; there are tracks of data that we are all leaving and have been leaving for at least the past two decades that could answer questions, or pose interesting questions to ask, that require the active stirring of human curiosity to imagine. Add to that the text and data mining of enormous collections of literary texts and the digitizing of earlier analog data sources, and the possibilities within which to apply our cognitive skills grow further.
The second challenge is tied to the first: for that curiosity to be activated, one would need to have a sense of what data are actually out there and therefore what one would use and what one would need to gather anew. What do we know about real-time traffic patterns, or the movements of people from one domicile to another, or the economic transactions of a certain group of people over time? We know that there are certain kinds of archives out there—like the collection of tweets at the Library of Congress or the Internet Archive of the web—but how many social scientists are aware of the work of the New York firm Sparks and Honey that has been tracking trends across the planet in ways that are crucial to corporations but could also be the most valuable kind of research data archive for any number of issues of immense interest to various social science fields? They claim that they work in a space of “people and platforms, man and machine, ideas and algorithms, magic and math.” The data they collect are curated, carefully housed, and searchable in myriad ways. And what about those tech companies who know what we read and when we read it and how much of what we are reading we actually page through, what we buy and how much it costs, where we are at a certain moment in time, and can sort those data in categories that might not yet have been imagined by social scientists?
The third challenge—let us posit that we can imagine what it is that we want to ask and that we can start to get a sense of what possible data are out there (I am more sanguine about the first than the second)—is to get the access and the permission to publish from it. Some privileged researchers can get to Google searches at a more finely grained level—how finely grained it actually may be is likely not known fully by the researchers themselves because they may not know how much is really available or what portion they are being allowed to see. What do we know about real-time traffic patterns, or the movements of people from one domicile to another, or the economic transactions of a certain group of people over time? Then the kinds of issues that may strike terror into an institutional review board: Can one conceivably use the kinds of information that have been collected in ways that would pass muster through the traditional process? What if we are dealing with materials across national borders with different legal privacy regimes? What private agreements have corporations who collect data as a matter of course made in different countries over time and how is that reflected in the data that are extant?
In the face of these three challenges, we continue to do research about questions that we pose and write up the results. One way that some of us deal with these issues is to delimit our work so that we do not wander into areas that require us to think about all of the sources we might imagine by confining our questions to the worlds of research with which we are most comfortable while avoiding privacy issues as far as we can. But I suspect that none of us is willing to constrain ourselves in the long run to tools of the trade that are losing their finely honed edges. Learning what is out there is becoming more and more of a profession in itself: the data curator and the data scientist are two of our newest job titles and the holders of these positions are now working in research libraries and research universities in all fields of the social sciences. The Alfred P. Sloan Foundation has been one of the leaders in this area, developing the careers of recent PhDs in conjunction with the Council on Library and Information Resources (CLIR) in placing postdocs in libraries. Working with the Moore Foundation, Sloan is funding data scientists at the University of Washington, at Berkeley, and at New York University. So finding a colleague in one’s field who is a data curator or a data scientist when one is posing a research question would be a good way to start. The murky legal terrain in the areas of privacy and in areas of copyright across national borders are much more complicated issues and the fears of putting people’s lives in danger because of the multiple sources that could be triangulated to uncover the identity of a supposedly anonymous person make it very difficult to rely upon older methods of protecting one’s informants.
Let me end on another note, not a fourth challenge to the social sciences per se, but to the entire world of scholarship. We are increasingly in a position where the basic way of interacting with words, numbers, images, and data of all sorts is no longer possible without machine intervention. We can read distantly, we search electronically, we collect and file digitally, we write on a machine. It seems to me that the one tool that once remained within our own individual control was the act of reading and what followed from that: thinking about and writing about what we read. Our only technical aid was perhaps a pair of glasses and the organizational skills of those who collected and curated what was published, making it available to us. When reading becomes distant, when interaction with data becomes entirely electronic, and when the searching for materials is mediated by software, the scholar cannot in good faith rely on algorithms in which one had no input on how they operate, or searches that are mediated by a third party, or software that makes assumptions about which the researcher is unaware. Building our own tools, understanding how the tools that others build and operate and knowing that we are working with data in a way that we understand and can rely upon, might be the greatest challenge of the years ahead. Searching for and finding everything relevant on a certain topic—the work that used to be dominated by the values of the library and university community—is key to that challenge: finding a way to take back into the hands of those whose commitment is to the stewardship of information on and knowledge of our cultural heritage should be central to the work of scholarship.