Parameters

Challenges to Archives in an Age of Digital Abundance

As almost everyone by now is aware, the twenty-first century is the age of data—big, plentiful, and full of threat and promise.

by Paul DiMaggio July 20, 2016

As almost everyone by now is aware, the twenty-first century is the age of data—big, plentiful, and full of threat and promise. In this essay, I shall focus less on the promise (although that is very substantial) and more on the threats, especially to professionals concerned with curating, archiving, and preserving knowledge of the past. By “knowledge” I refer both to the kinds of datasets that social scientists analyze and the kinds of documents with which historical archivists have been concerned, the latter of which increasingly are distributed in digital form. I will discuss challenges that the digital age poses for two types of digital knowledge—social-science data and, more briefly, conventional publications and historical records.

Big Data

Back in 1965, Gordon Moore, a computer scientist at Fairchild Semiconductor, noted that the number of transistors one could fit on a computer chip (a number closely associated with processing speed) had doubled every year to eighteen months for six years, and he predicted that this exponential growth in computing speed and power would continue for some time. Indeed, this occurred for long enough that his prediction became known as Moore’s Law. Whether or not it continues to be literally correct, one thing is certain: technological change has made it possible to generate exponentially more data, to process those data more quickly, and to store them more efficiently, than humans had imagined possible.¹

For social scientists, and for social-science data archives, this profusion of data poses both problems and opportunities. Until recently, quantitative data analysis in the social sciences has depended on survey data, and, in particular, major institutionally supported data collection efforts. Such ongoing studies—General Social Survey, National Election Survey, American Community Survey, Panel Study of Income Dynamics, and more specialized efforts—have accounted for a great deal of the progress that social scientists have made in understanding social processes and informing social policy decisions.Unprecedented quantities of data, almost all in digital form, are being generated as a byproduct of behavior undertaken for purposes other than research. Recently, however, traditional methods of data collection have become more difficult (persons sampled for surveys are far less willing to participate than they once were) and more expensive (a problem compounded by the difficulty of enlisting participants).

At the same time, unprecedented quantities of data, almost all in digital form, are being generated as a byproduct of behavior undertaken for purposes other than research.² All of it falls under what is popularly known as “big data,” but this rubric is misleading if it leads us to conflate types of information, and modes of gathering information, that are significantly different and pose significantly different challenges to researchers and archivists. Although journalists and even scholars often speak of “big data” as if it is one thing, there are at least six categories of big data, which have little in common except that, were it not for high-speed computing and digitization, we would either not have them or not be able to put them to ready use.

Three Well-Behaved Types of Big Data

Three types of data are well-behaved in the sense that there are well-established analytic methods, competent professionals are already archiving them (though access to archives may be complicated), and privacy issues are either nonexistent or tractable. All three are well on their way to being assimilated into the social science toolbox.

Analysis of large datasets gathered by public agencies in the course of their work and merger of such data with other kinds of information. The administrative databases of governments (e.g., records of such agencies as the IRS, Department of Labor, Department of Commerce, or Census Bureau) are immensely valuable not only for the data they contain but, more importantly, for the scale at which they assemble it. “Big” in this sense means comprehensive, so that researchers need worry less about sample bias, and immense, which means that researchers can study the behavior of subgroups that never show up in sufficient numbers in even large sample surveys. Use of these data raises real issues: it can be difficult to ensure respondent privacy, and the measures one takes to do so limit replicability (especially when analysts merge data from several sources). Nonetheless, data quality is ordinarily high, and an increasing number of social scientists have developed collaborations with agencies that permit them to work with these data productively, in some cases merging data from different agencies or from government and private-sector entities. This is a frontier, but one that social scientists are penetrating with considerable success.³

Textual data. Another area in which there have been dramatic advances and where social-science uses are become routine is in the analysis of large textual datasets, stimulated by the rapid emergence and diffusion of machine-learning approaches to identifying latent themes in large corpora.⁴ Often researchers acquire such data from public archives (e.g., newspaper and periodical series, collections of books, court records, patents, or legislative documents), enabling researchers to work with entire populations of texts and making replication of well-documented studies simple.⁵

Online experiments. Online experiments employ a well-established method at a grander scale than has been possible before. I treat them here as a species of “big data,” because they are a product of the digital age (specifically, of the Internet) and because the data they generate are plentiful in comparison to data generated by conventional lab or lab-in-field experiments. Indeed, scale—temporal and numerical—is the raison d’être for online experiments: one can include an order of magnitude more subjects than in a lab experiment and (with appropriate inducements) keep them at it for a much longer time. More subjects make it possible to test many more experimental conditions (and thus more theories) than is possible in the lab, whereas more time enables one to produce more complex interventions and track change over a longer duration.⁶ And these substantial virtues come without any special obstacles to data archiving.

Out on the Big Data Frontier

Research with administrative data, textual data, and data from online experiments is becoming, if not routine, at least well understood and ever more frequent. One can predict with confidence that these research genres will take their place in the social scientist’s standard-issue toolkit. The challenges they present to digital archivists, while in a few cases significant, will be familiar and resolvable. By contrast, the next three types of big data are far less well understood and will pose much greater challenges to digital archivists.

Data Generated Online by Social Media Users. When users post to Twitter, change their status on Facebook, or send photos by Instagram, they make public declarations (at least to their chosen correspondents) that are recorded by media companies and, in many cases, may be accessible to researchers. Social scientists are making use of these databases, and some have done so ingeniously. But it will be difficult to make effective use of such found data to generalize beyond the people who produce it until we better understand the generative processes behind these data. Take, for example, Twitter, which is perhaps the most public of social media and also one that (because of the public quality of most tweets) has been most accessible to researchers. We know that, as a group, users of social media like Twitter are far from being a cross-section of the public.⁷ We also know that a disproportionate share of tweets is produced by a small percentage of users, who are even more unusual than their less warblesome peers.⁸ And when we go from aggregate to specifics, we know even less about what makes a particular person tweet (or not) in response to one event, but remain silent in the wake of another. Despite some nascent efforts to match user identities to such public data sources as voter rolls or to predict demographics from following patterns, we know little about specific people who use social media for specific purposes.⁹ But we do know that even active tweeters behave in different ways: for example, depending on the topic and the stimulus, tweets may trend well to the left or to the right of public opinion.¹⁰ In short, until we understand the dynamics that determine, first, who signs up for social media services, second, who actually uses them (and how much they do), third, why different people choose to use them to communicate about different kinds of events, and, finally, the communications ecology within which they play a role (the division of labor among social media and their connection to more old-fashioned forms of communication like talking face to face), they provide a weak basis for generalization. Moreover, access to social media data must often be negotiated with providers that may be unwilling to allow data to be archived for the use of other researchers after its use. Until such problems are overcome, social media data might be considered an attractive nuisance. ¹¹

Social science analyses of proprietary databases. Online proprietary databases maintained by corporate entities called aggregators hold vast amounts of information about the online behavior of persons across many sites, including search behavior and websites visited, e-mails sent, and so on.¹²Until such problems are overcome, social media data might be considered an attractive nuisance. Credit card companies have detailed information about your purchases (which include not just what you spend money on, but where and when you spend it). Your supermarket knows what consumer products you buy (at least when you buy them at the usual spot), your music service knows about your real-time listening behavior, and Uber may know a lot about your travel patterns. Cell phone providers know not only the numbers you have called but also where you were when you called them. The electric company knows how much energy you used and when you used it. Your Fitbit and similar devices collect and transmit time-stamped data on your weight, quality, steps taken, and even sexual activity to cloud-based servers. Such data, which are collected continuously on many millions of people, have tremendous promise for social science, but also raise significant issue of privacy and, potentially, liability for the companies involved. Companies often go to great pains to avoid using all the personal information they possess, lest the “creep factor” (an industry term) drive away traffic. And, as we have seen during some well-publicized hacking incidents, release of private data can be harmful to users. Given the commercial value of the data they possess and the risks they face in sharing it, companies are understandably reluctant to collaborate with researchers. There have been some exceptions: some social scientists have used extensive information from intranet exchanges among employees within a particular company, limited data on web searches, information about credit histories, cell phone data, and complete data on musical and financial online communities.¹³ But we have barely scratched the surface in developing modes of collaboration that protect the interests of all involved, while at the same time generating usable data for social scientists. Archiving for replication is at present but a distant dream.

Surveillance data on the urban environment. Many authors have written with justifiable apprehension about the threats of the “surveillance society.” At the same time, all this surveillance is generating a lot of data that could be helpful in understanding human behavior. By “surveillance data” I refer to information collected on places rather than individuals, though in some cases a thin line lies between them. Such data is often place-specific and is almost always stamped for time and place. To take just one example, surveillance cameras in urban neighborhoods routinely collect information about how many people are on the street, whether they linger or pass through quickly, in what kinds of groups (with some possibility of classification through machine learning by age, gender, and race), and, to some extent, what they are doing, as well as changes to the physical environment that may occur for whatever reason. Students of neighborhood effects have gathered less complete versions of such data at great expense, but have rarely been able to study the temporality of street life or over-time trends with great precision. Is there any way to make such data available in a form that would be useful to social scientists while protecting the privacy of persons in the community? Similarly, traffic sensors capture the intensity of vehicle traffic throughout urban areas and on many highways. Moreover, some police departments have adopted sophisticated approaches to integrating information from multiple systems, producing datasets that could be useful to social scientists.¹⁴ In addition to always-on data sensors, administrative data that track behavior of service providers in urban systems (e.g., information on taxi fares and routes collected by regulatory agencies) have also been applied to social-scientific questions.Ironically, the more dystopic the scenario for humanity, the more utopic the prospects for social science.¹⁵ If the “smart cities” movement takes off, a vast array of new data on the urban environment will follow: trash bins with censors will let sanitation departments plan their routes (and also record how much trash residents are generating), streetlights with sensors will control energy use (and record levels of traffic on particular blocks), even park benches may be equipped with sound sensors (ostensibly to monitor noise levels, but can voice recordings be far behind?). Ironically, the more dystopic the scenario for humanity, the more utopic the prospects for social science. (But will a society that tolerates the level of invasiveness and control that new technologies permit also tolerate critical scholarly inquiry? Time will tell.)

In any case, such systems pose more mundane challenges for data archivists. For one thing, much of the information is in video and may soon be in audio form. Even where privacy is not an issue (and in some ways anonymization will be easier with data from environmental probes than with data collected on individuals), what exactly should archives contain? If, for example, I use video data to measure the density and pace of movement, and basic demographic characteristics of people moving, across city blocks, should my raw data be archived (with identifying features of particular people obscured), or only the data as I have coded it?

* * *

Social media, data recorded as a byproduct of everyday activities, and environmental surveillance data are the frontiers of “big data.” Social scientists have just begun to address the possibilities, and both methodological and practical challenges are formidable, but if researchers solve challenges of data management, access, and ethics, archiving will require innovation in stewardship of both digital documents and the interests of those whose activities generate them.

The Integrity and Security of Digital Archives

Thus far, I have been concerned with data archives holding the raw material out of which knowledge is constructed. In this section, I turn to archives that contain more conventional documents, but in digital form. Digitization entails challenges to both the integrity and security of archives.

The most serious problem has to do with the nature of publication and of reading in a digital world. By all accounts, Americans and Europeans read today as much as they ever did. But much more of that reading takes place on computer screens—news stories on pages that update every hour, blogs, specialized websites with particular types of information, daily feeds of stories from the digital desks of web magazines, and so on. People still read books, of course—bound books seem to be becoming the new vinyl for younger readers—and libraries still collect them. But much of what used to reside in microfilm collections of newspapers or bound collections of periodicals now appears online, changes from moment to moment, and may not be collected or recorded by anyone. The situation is not entirely unprecedented: during the 1960s, for example, the underground press and social movement groups produced much significant material outside of scope of regularly archived publications. But despite some heroic efforts, archivists as a whole under-collected such material to the detriment of future generations. The same is true of the flourishing of zines during the heyday of punk rock. Before the Internet, significant writing outside normally collected channels tended to burst out during periods of social, political, or artistic ferment. But in the Internet age, the diversity and evanescence of significant writing, and the challenge of capturing what people are indeed reading, has become a permanent challenge.¹⁶

But digital archiving may be problematic even for conventional media. Take, for example, digital archives of newspapers, which have for the most part replaced microfiche archives, which in turn replaced paper archives and the “morgues” (categorized clipping files for the use of reporters but often opened to legitimate researchers) that existed through the 1990s. Physical archives and microfiche provided either the original documents or photographic facsimiles thereof, ensuring that the historical record is complete, or that omissions are visible (because interruptions in a newspaper series are evident and scissors leave physical evidence of their use). Online archives are a great improvement because they are readily searchable, so that work that might have taken years of person-hours can be accomplished in a few hours. But online archives are also vulnerable to the removal of documents in ways that are impossible to detect. For example, as a result of the New York Times v. Tasini decision (2001), which gave freelance writers digital rights to their work in cases where their contract did not transfer those rights explicitly, newspaper publishers and aggregators removed many articles by freelancers from digital archives.¹⁷ Not only did they fail to indicate the removal of an article (e.g., by including its title in search results with a note that it was not available), but I can report, based on personal experience, that they used differing criteria for deleting articles and were reluctant to report what those criteria were.¹⁸

Newspapers that maintain their own archives and even aggregators that supply such archives to academic libraries are at least supposed to follow best practices in archiving—don’t remove materials and, if you must, indicate which materials you have removed. But commercial entities are not required to adhere to such standards, and often they do not. This is nothing new, of course: many important archives were maintained by nonprofessionals for years before finding their way to professional archivists, and nonprofessionals often consider it their responsibility to eliminate files that might reflect badly on the leadership of a company or nonprofit organization.

But as more and more information finds its way to the Internet, the function of archiving past publications has changed with the nature of publication itself, so that much of the archiving function is distributed and, in effect, entrusted to people who maintain web servers and companies that maintain search engines. Some challenges are built into search engine algorithms: most searches turn up more “hits” than any one or two researchers can pursue, so that the order in which Google’s algorithm returns hits has significant consequences for what is de facto accessible and what is not.

To these problems have been added legal threats. One such family of threats, well documented by Wendy Seltzer, is associated with the “safe harbor” clause of the Digital Millenium Copyright Act (the major piece of US copyright legislation), which requires Internet service providers, including all websites, to take down any link on a website upon receipt of a formal complaint alleging that the website contains material protected by copyright. Although these provisions have been used legitimately to require websites to take down copyrighted materials, they have also been abused by IP holders whose definition of their rights is more expansive than courts would uphold, and by private interests (e.g., political operatives attempting to suppress information on the eve of an election) with no legitimate IP claims at all.¹⁹ To be sure, website operators subject to DMCA takedown notices have the right to respond. By the time a web page is reposted in response to an appeal, however, it may be no longer timely; when a small operator is up against a corporation with deep pockets, even takedowns without merit may be difficult to reverse. An even more radical bill—the Stop Online Piracy Act—appeared headed for passage in 2012 until an unprecedented public response (including a 7,000-website daylong strike) led to its defeat. SOPA would have required takedowns of entire websites, not simply offending pages, as well as centralized blocking of IP addresses, without due process based on IP-holder complaints.²⁰

Another threat to the archiving function online comes from court decisions upholding, under the broader class of privacy rights, a “right to be forgotten,” that permits persons to demand that search engines block access to websites containing discrediting information (in some cases, even if that information is accurate). Such a right has been established in the European Union (both through European Commission directive and through case law in several European countries) and has been proposed elsewhere (successfully, in Argentina). Under such law, individuals may appeal to websites (e.g., Wikipedia) or search engines to remove information that, for example, publicizes a criminal history or personal scandal, and to bring suit if that request is refused.²¹ In the first several years, Google (the one company that publicized requests and responses) received hundreds of takedown requests from citizens of the European Union and responded positively to many of them (while denying many others). One influential example from case law: a Spanish businessman who had declared bankruptcy in the late 1990s demanded that Google suppress search results leading to information about his past insolvency. The EU court ruled against Google (which had turned down the claim) on the grounds that the man’s personal data was “inadequate, irrelevant, or no longer relevant…” Within five months after the decision Google received 143,000 de-indexation requests asking it to take down almost half a million links. On April 14, 2016, the European Parliament passed a new General Data Protection Regulation that strengthened the “right to be forgotten” yet further (putting European law even more at variance with US law on protected speech).²²

An equally serious threat, about which I have found little in print, involves the security of information within digital archives of historical source materials. Digitization of such materials has many advantages (especially if archivists can figure out how to keep files up to date through cycles of technological change), including easy access by researchers anywhere in the world and ready searchability. But it has become evident that even government agencies and private corporations with sophisticated security consultants are vulnerable to incursions by even more sophisticated hackers. Most university and nonprofit archives, I suspect, cannot compete with Sony, the Iranian Nuclear Agency, or the New York Fed (three notable victims of hacker attacks) on Internet security. I assume that some archivists are thinking about this, but an admittedly cursory online search found only a few Digital Humanities courses entitled “Hacking the Archive,” all of which used “Hacking” as a benign synonym for gaining legal online entry to sources one is entitled to enter. Without sustained attention to this issue, the prospect of a motivated attacker—imagine, e.g., Stalin, Putin, North Korea, or the Nixon White House—literally changing the historical record by gaining unauthorized entry to digital archives and editing digital documents seems like a real concern. Wikipedia—in effect a public archive—may present a kind of model in that behind the archive of information is an archive recording every change in that information. Some such automatic recording of changes might be applied to digital archives as a routine practice (though even such a system could be easy for a sophisticated assailant to work around).

* * *

Clearly, this is an exciting time to be an archivist. As we find ourselves in a new world in which more information goes online each week than appeared in print for centuries after the introduction of the printing press, the work of the archivist has become even more complex, and the contributions of archivists to scholarship in the social sciences and humanities has become even more indispensable. What to keep, what to discard, how to respect the privacy of individuals, how to maintain the integrity of archival collections—all of these are issues that archivists have dealt with for decades. But when the flow of information is measured in zettabytes, scaling up established routines is unlikely to be an option. Those of us who depend on archivists and information scientists to ensure the availability of the data and documents we need must look on with appreciation, apprehension, and hope.

Photo credit: iStock.com/ViewApart

Paul DiMaggio

Paul DiMaggio is Professor of Sociology at New York University, where he is studying applications of computational text analysis to the study of cultural change. He is also A. Barton Hepburn Professor Emeritus of Sociology and Public Affairs at Princeton University, where he taught from 1992-2016, served as Chair and Director of Graduate Studies in the Sociology Department, and headed the Center for the Study of Social Organization. From 1979 to 1992, he taught at Yale University, where he directed the Program on Non-Profit Organizations. DiMaggio, who has written widely on culture, organizational analysis, and economic sociology, has been a... Read more

Previous article In Defense of Poverty

Next article The Effect of Sponsorship upon Social Science Research

Challenges to Archives in an Age of Digital Abundance

Big Data

Three Well-Behaved Types of Big Data

Out on the Big Data Frontier

The Integrity and Security of Digital Archives

Paul DiMaggio

You may also like

Smart Cities: What Do We Need to Know to Plan and Design Them Better?

The Importance of Critical Data Analysis for the Social Sciences

Reshaping Tenure & Promotion into a Catalyst for Innovative & Invigorating Scholarship

Interview: Sound Recording, Oral Positionality, and Audio as Ethnographic Object