As almost everyone by now is aware, the twenty-first century is the age of data—big, plentiful, and full of threat and promise. In this essay, I shall focus less on the promise (although that is very substantial) and more on the threats, especially to professionals concerned with curating, archiving, and preserving knowledge of the past. By “knowledge” I refer both to the kinds of datasets that social scientists analyze and the kinds of documents with which historical archivists have been concerned, the latter of which increasingly are distributed in digital form. I will discuss challenges that the digital age poses for two types of digital knowledge—social-science data and, more briefly, conventional publications and historical records.

Big Data

Back in 1965, Gordon Moore, a computer scientist at Fairchild Semiconductor, noted that the number of transistors one could fit on a computer chip (a number closely associated with processing speed) had doubled every year to eighteen months for six years, and he predicted that this exponential growth in computing speed and power would continue for some time. Indeed, this occurred for long enough that his prediction became known as Moore’s Law. Whether or not it continues to be literally correct, one thing is certain: technological change has made it possible to generate exponentially more data, to process those data more quickly, and to store them more efficiently, than humans had imagined possible.1 On Moore’s Law, see Ilkka Tuomi, “The Lives and Death of Moore’s Law,” First Monday 7 no. 11 (Nov. 2002); On the range of technologies responsible for the expansion of computing power, see Gill Pratt, “Is a Cambrian Explosion Coming for Robotics?” J. Econ. Perspectives 29 (2015): 51–60.

For social scientists, and for social-science data archives, this profusion of data poses both problems and opportunities. Until recently, quantitative data analysis in the social sci­ences has depended on survey data, and, in particular, major institutionally supported data collection efforts. Such ongoing studies—General Social Survey, National Election Survey, Am­erican Community Survey, Panel Study of Income Dynamics, and more specialized efforts—have accounted for a great deal of the progress that social scientists have made in under­stand­ing social processes and informing social policy decisions.Unprecedented quantities of data, almost all in digital form, are being generated as a byproduct of behavior undertaken for purposes other than research. Recently, however, traditional meth­ods of data collection have become more difficult (persons sampled for surveys are far less will­ing to participate than they once were) and more expensive (a problem compounded by the difficulty of enlisting participants).

At the same time, unprecedented quantities of data, almost all in digital form, are being generated as a byproduct of behavior undertaken for purposes other than research.2 In thinking about these issues, I have benefited immeasurably from conversations with colleagues at Princeton’s Center for Information Technology Policy, with members of the Russell Sage Foundation’s Working Group on Computational Data Analysis, and, especially, with Matt Salganik, whose definitive study of many of these issues will be published soon by Princeton University Press All of it falls under what is popularly known as “big data,” but this rubric is misleading if it leads us to conflate types of information, and modes of gathering information, that are significantly different and pose significantly different challenges to researchers and archivists. Although journalists and even scholars often speak of “big data” as if it is one thing, there are at least six categories of big data, which have little in common except that, were it not for high-speed computing and digitization, we would either not have them or not be able to put them to ready use.

Three Well-Behaved Types of Big Data

Three types of data are well-behaved in the sense that there are well-established analytic methods, competent professionals are already archiving them (though access to archives may be complicated), and privacy issues are either nonexistent or tractable. All three are well on their way to being assimilated into the social science toolbox.

Analysis of large datasets gathered by public agencies in the course of their work and merger of such data with other kinds of information. The administrative databases of governments (e.g., records of such agencies as the IRS, Department of Labor, Department of Commerce, or Census Bureau) are immensely valuable not only for the data they contain but, more importantly, for the scale at which they assemble it. “Big” in this sense means comprehensive, so that researchers need worry less about sample bias, and immense, which means that researchers can study the behavior of subgroups that never show up in sufficient numbers in even large sample surveys. Use of these data raises real issues: it can be difficult to ensure respondent privacy, and the measures one takes to do so limit replicability (especially when analysts merge data from sev­eral sources). Nonetheless, data quality is ordinarily high, and an increasing number of social scientists have developed collaborations with agencies that permit them to work with these data productively, in some cases merging data from different agencies or from government and private-sector entities. This is a frontier, but one that social scientists are penetrating with considerable success.3 Health care systems researchers and researchers in educational policy, poverty, and inequality have been using such data for some time. See, for example, Taryn W. Morrissey, Don Oellerich, Erica Meade, Jeffrey Simms, and Ann Stock, “Neighborhood Poverty and Children’s Food Insecurity,” Children and Youth Services Review 66 (2016): 85–93; Greg Sacks, Elise Lawson, Aaron Dawes, et al., “Variation in Hospital Use of Postacute Care After Surgery and the Association with Care Quality,” Medical Care 54 (2016): 172–79. For an excellent discussion of technical issues in linking administrative datasets, see David S. Johnson, Catherine Massey, and Amy O’Hara, “The Opportunities and Challenges of Using Administrative Data Linkages to Evaluate Mobility,” Annals of the American Academy of Political and Social Sciences 657 (2015): 247–64.

Textual data. Another area in which there have been dramatic advances and where social-science uses are become routine is in the analysis of large textual datasets, stimulated by the rapid emergence and diffusion of machine-learning approaches to identifying latent themes in large corpora.4 Paul DiMaggio, Manish Nag, and David Blei, “Exploring Affinities Between Topic Modeling and the Sociological Perspective on Culture: Applications to Newspaper Coverage of U.S. Government Arts Funding.” Poetics 41 (2013): 570–606. Often researchers acquire such data from public archives (e.g., newspaper and periodical series, collections of books, court records, patents, or legislative documents), enabling researchers to work with entire populations of texts and making replication of well-documented studies simple.5 Christopher Bail, “The Fringe Effect: Civil Society Organizations and the Evolution of Media Discourse about Islam Since the September 11th Attacks,” American Sociological Review 77 (2012): 855–79; Sara Klingenstein, Tim Hitchcock, and Simon DeDeo, “The Civilizing Process in London’s Old Bailey,” Proceedings of the National Academy of Science 111 (2014): 9419–9424; Sarah Kaplan and Keyvan Vaili, “The Double-Edged Sword of Recombination in Breakthrough Innovation,” Strategic Management Journal 36 (2015): 1435–57; K. M. Quinn, B. L. Monroe, M. Colaresi, M. H. Crespin, and D. R. Radev, “How to Analyze Political Attention with Minimal Assumptions and Costs,” American Journal of Political Science 54 (2010): 209–228.

Online experiments. Online experiments employ a well-established method at a grander scale than has been possible before. I treat them here as a species of “big data,” because they are a product of the digital age (specifically, of the Internet) and because the data they gen­er­ate are plentiful in comparison to data generated by conventional lab or lab-in-field exper­iments. Indeed, scale—temporal and numerical—is the raison d’être for online exper­iments: one can include an order of magnitude more subjects than in a lab experiment and (with appropriate inducements) keep them at it for a much longer time. More subjects make it possible to test many more experimental conditions (and thus more theories) than is possible in the lab, whereas more time enables one to produce more complex interventions and track change over a longer duration.6 Matthew Salganik, Peter Dodds, and Dunan Watts, “Experimental Study of Inequality and Unpredictability in an Artificial Cultural Market,” Science 311 (2006): 854–56; Damon Centola, “The Spread of Behavior in an Online Social Network Experiment,” Science 329 (2010): 1194–97. And these substantial virtues come without any special obstacles to data archiving.

Out on the Big Data Frontier

Research with administrative data, textual data, and data from online experiments is becoming, if not routine, at least well understood and ever more frequent. One can predict with confidence that these research genres will take their place in the social scientist’s standard-issue toolkit. The challenges they present to digital archivists, while in a few cases significant, will be familiar and resolvable. By contrast, the next three types of big data are far less well understood and will pose much greater challenges to digital archivists.

Data Generated Online by Social Media Users. When users post to Twitter, change their status on Facebook, or send photos by Instagram, they make public declarations (at least to their chosen correspondents) that are recorded by media companies and, in many cases, may be accessible to researchers. Social scientists are making use of these databases, and some have done so ingeniously. But it will be difficult to make effective use of such found data to generalize beyond the people who produce it until we better understand the generative processes behind these data. Take, for example, Twitter, which is perhaps the most public of social media and also one that (because of the public quality of most tweets) has been most accessible to researchers. We know that, as a group, users of social media like Twitter are far from being a cross-section of the public.7 Andrew Perrin, “Social Media Usage: 2005–2015,” Pew Research Center: Internet, Science and Technology Report, October 8, 2015. We also know that a disproportionate share of tweets is produced by a small percentage of users, who are even more unusual than their less warblesome peers.8 Kristina Lerman and Rumi Ghosh, “Information Contagion: An Empirical Study of the Spread of News on Digg and Twitter Social Networks,” Proceedings of the Fourth International AAAI Conference on Weblogs and Social Media, 2010. And when we go from aggregate to specifics, we know even less about what makes a particular person tweet (or not) in response to one event, but remain silent in the wake of another. Despite some nascent efforts to match user identities to such public data sources as voter rolls or to predict demographics from following patterns, we know little about specific people who use social media for specific purposes.9 Aron Culotta, Nirmal Kumar Ravi, Jennifer Cutler, “Predicting the Demographics of Twitter Users from Website Traffic Data,” Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015. But we do know that even active tweeters behave in different ways: for example, depending on the topic and the stimulus, tweets may trend well to the left or to the right of public opinion.10 Amy Mitchell and Paul Hitlin, “Twitter Reaction to Events Often at Odds with Overall Public Opinion,” Pew Research Center Report, March 4, 2013. In short, until we under­stand the dynamics that determine, first, who signs up for social media services, second, who actually uses them (and how much they do), third, why different people choose to use them to communicate about different kinds of events, and, finally, the communications ecology within which they play a role (the division of labor among social media and their connection to more old-fashioned forms of commun­ication like talking face to face), they provide a weak basis for generalization. Moreover, access to social media data must often be negotiated with providers that may be unwilling to allow data to be archived for the use of other researchers after its use. Until such problems are overcome, social media data might be considered an attractive nuisance. 11 There are two exceptions to this gloomy assessment. Where researchers collaborate with online companies to conduct well-designed behavioral experiments, the results can be valuable. See, e.g., Robert Bond, Chris Fariss, Jason Jones, et al., “A 61-Million-Person Experiment in Social Influence and Political Mobilization,” Nature 489 (2012): 295–99; or Michael Restivo and Arnout Van de Rijt, “An Experimental Study of Informal Re­wards in Peer Production,” PLOS One, March 29, 2012. Such studies share the virtues of other online experiments, though they are less likely to generate archived data and bear other risks, such as negative reactions from users or observers in politics or the press.

Social science analyses of proprietary databases. Online proprietary databases maintained by corporate entities called aggregators hold vast amounts of information about the online behavior of persons across many sites, including search behavior and websites visited, e-mails sent, and so on.12 Balachander Krishnamurthy and Craig Wills, “Characterizing Privacy in Online Social Networks,” Proceedings of ACM SIGCOMM Workshop on Online Social Networks, 2008; Gina Marie Stevens, “Data Brokers: Background and Industry Overview,” Congressional Research Service Report RS22137, 2007; Hal Abelson, Ken Ledeen, and Harry Lewis, “Naked in the Sunlight: Privacy Lost, Privacy Abandoned,” chapter 2 (pp. 19–72) in Blown to Bits: Your Life, Liberty and Happiness after the Digital Explosion, Boston: Addison-Wesley, 2008.Until such problems are overcome, social media data might be considered an attractive nuisance. Credit card companies have detailed information about your purchases (which include not just what you spend money on, but where and when you spend it). Your supermarket knows what consumer products you buy (at least when you buy them at the usual spot), your music service knows about your real-time listening behavior, and Uber may know a lot about your travel patterns. Cell phone providers know not only the numbers you have called but also where you were when you called them. The electric company knows how much energy you used and when you used it. Your Fitbit and similar devices collect and transmit time-stamped data on your weight, quality, steps taken, and even sexual activity to cloud-based servers. Such data, which are collected continuously on many millions of people, have tremendous promise for social science, but also raise significant issue of privacy and, potentially, liability for the companies involved. Companies often go to great pains to avoid using all the personal information they possess, lest the “creep factor” (an industry term) drive away traffic. And, as we have seen during some well-publicized hacking in­cidents, release of private data can be harmful to users. Given the commercial value of the da­ta they possess and the risks they face in sharing it, companies are understandably reluctant to collaborate with researchers. There have been some exceptions: some social scientists have used extensive information from intranet exchanges among employees within a particular com­pany, limited data on web searches, information about credit histories, cell phone data, and complete data on musical and financial online com­munities.13 Amir Goldberg, Sameer B. Srivastava, V. Govind Manian, et al., “Fitting in or Standing out? The Tradeoffs of Structural and Cultural Embeddedness,” Stanford University Graduate School of Business Research Paper No. 15-31, 2015; Victor Stango and Jonathan Zinman, “Borrowing High Versus Borrowing Higher: Price Dispersion and Shopping Behavior in the U.S. Credit Card Market,” Review of Financial Studies, published online, 2015; Pierre Deville, Chaoming Song, Nathan Eagle, et al., “Scaling Identity Connects Human Mobility and Social Interactions,” Proceedings of the National Academy of Science, PNAS Early Edition, pnas.1525443113; Amir Goldberg Where Do Social Categories Come from? A Comparative Analysis of Online Interaction and Categorical Emergence in Music and Finance. PhD Dissertation, Princeton University 2012. But we have barely scratched the surface in devel­oping modes of collaboration that protect the interests of all involved, while at the same time generating usable data for social scientists. Archiving for replication is at present but a distant dream.

Surveillance data on the urban environment. Many authors have written with justifiable apprehension about the threats of the “surveillance society.” At the same time, all this surveillance is generating a lot of data that could be helpful in understanding human behavior. By “surveillance data” I refer to information collected on places rather than individuals, though in some cases a thin line lies between them. Such data is often place-specific and is almost always stamped for time and place. To take just one example, surveill­ance cameras in urban neighborhoods routinely collect information about how many people are on the street, whether they linger or pass through quickly, in what kinds of groups (with some possibility of classification through machine learning by age, gender, and race), and, to some extent, what they are doing, as well as changes to the physical environment that may occur for whatever reason. Students of neighborhood effects have gathered less com­plete versions of such data at great expense, but have rarely been able to study the temporality of street life or over-time trends with great precision. Is there any way to make such data avail­able in a form that would be useful to social scientists while protecting the privacy of persons in the community? Similarly, traffic sensors capture the intensity of vehicle traffic throughout urban areas and on many highways. Moreover, some police departments have adopted sophisticated approaches to integrating information from multiple systems, producing datasets that could be useful to social scientists.14 Sarah Brayne, Stratified Surveillance: Policing in the Age of Big Data, PhD Dissertation, Princeton University, 2015. In addition to always-on data sensors, administrative data that track behavior of service providers in urban systems (e.g., information on taxi fares and routes collected by regulatory agencies) have also been applied to social-scientific questions.Ironically, the more dystopic the scenario for humanity, the more utopic the prospects for social science.15Henry S. Farber, “Reference-Dependent Preferences and Labor Supply: The Case of New York City Taxi Drivers,” American Economic Review 98 (2008): 1069–82. If the “smart cities” movement takes off, a vast array of new data on the urban environment will follow: trash bins with censors will let sanitation departments plan their routes (and also record how much trash residents are generating), streetlights with sensors will control energy use (and record levels of traffic on particular blocks), even park benches may be equipped with sound sensors (ostensibly to monitor noise levels, but can voice recordings be far behind?). Ironically, the more dystopic the scenario for humanity, the more utopic the prospects for social science. (But will a society that tolerates the level of invasiveness and control that new technologies permit also tolerate critical scholarly inquiry? Time will tell.)

In any case, such systems pose more mundane challenges for data archivists. For one thing, much of the information is in video and may soon be in audio form. Even where privacy is not an issue (and in some ways anonymization will be easier with data from environmental probes than with data collected on individuals), what exactly should archives contain? If, for example, I use video data to measure the density and pace of movement, and basic demo­graphic characteristics of people moving, across city blocks, should my raw data be archived (with identifying features of particular people obscured), or only the data as I have coded it?

*     *     *

Social media, data recorded as a byproduct of everyday activities, and environmental surveillance data are the frontiers of “big data.” Social scientists have just begun to address the possibilities, and both methodological and practical challenges are formidable, but if researchers solve challenges of data management, access, and ethics, archiving will require innovation in stewardship of both digital documents and the interests of those whose activities generate them.

The Integrity and Security of Digital Archives

Thus far, I have been concerned with data archives holding the raw material out of which knowledge is constructed. In this section, I turn to archives that contain more conventional documents, but in digital form. Digitization entails challenges to both the integrity and security of archives.

The most serious problem has to do with the nature of publication and of reading in a digital world. By all accounts, Americans and Europeans read today as much as they ever did. But much more of that reading takes place on computer screens—news stories on pages that update every hour, blogs, specialized websites with particular types of information, daily feeds of stories from the digital desks of web magazines, and so on. People still read books, of course—bound books seem to be becoming the new vinyl for younger readers—and libraries still collect them. But much of what used to reside in microfilm collections of newspapers or bound collections of periodicals now appears online, changes from moment to moment, and may not be collected or recorded by anyone. The situation is not entirely unprecedented: during the 1960s, for example, the underground press and social movement groups produced much significant material outside of scope of regularly archived publications. But despite some heroic efforts, archivists as a whole under-collected such material to the detriment of future generations. The same is true of the flourishing of zines during the heyday of punk rock. Before the Internet, significant writing outside normally collected channels tended to burst out during periods of social, political, or artistic ferment. But in the Internet age, the diversity and evanescence of significant writing, and the challenge of capturing what people are indeed reading, has become a permanent challenge.16No one is obliged to maintain a website or to keep material available on websites they maintain, so collecting and archiving must be ongoing, not episodic. The Internet Archive has done wonderful work, but no single organization is capable of meeting the challenge.

But digital archiving may be problematic even for conventional media. Take, for example, dig­ital archives of newspapers, which have for the most part replaced microfiche archives, which in turn replaced paper archives and the “morgues” (categorized clipping files for the use of re­porters but often opened to legitimate researchers) that existed through the 1990s. Phys­ical archives and microfiche provided either the original documents or photographic facsimiles thereof, ensuring that the historical record is complete, or that omissions are visible (because interruptions in a newspaper series are evident and scissors leave physical evidence of their use). Online archives are a great improvement because they are readily searchable, so that work that might have taken years of person-hours can be accomplished in a few hours. But online archives are also vulnerable to the removal of documents in ways that are impossible to detect. For example, as a result of the New York Times v. Tasini decision (2001), which gave free­lance writers digital rights to their work in cases where their contract did not transfer those rights explicitly, newspaper publishers and aggregators removed many articles by freelancers from digital archives.17 Tasini only forbid reproduction of a freelancer’s work in a novel context, not in a digital reproduction of the original newspaper in which it occurred. Nonetheless, searchable archives from which articles could be retrieved piecemeal were held to constitute a new context. See Mark Radkefeld, “The Medium is the Message: Copyright Law Confronts the Information Age in New York Times v. Tasini,” University of Akron Review 36 (2003): 545–87. Not only did they fail to indicate the removal of an article (e.g., by including its title in search results with a note that it was not available), but I can report, based on personal experience, that they used differing criteria for deleting articles and were reluctant to report what those criteria were.18 For an excellent discussion of Tasini and related issues, see June M. Besek, Philippa S. Loengard, and Jane C. Ginsburg, Maintaining the Integrity of Digital Archives, New York: Kernochan Center for Law, Media and the Arts, 2007, See also June Besek and Philippa Lengard “Maintaining the Integrity of Digital Archives,” Columbia Journal of Law & the Arts 31, no. 3 (2008): 267–353.

Newspapers that maintain their own archives and even aggregators that supply such archives to academic libraries are at least supposed to follow best practices in archiving—don’t remove materials and, if you must, indicate which materials you have removed. But commercial entities are not required to adhere to such standards, and often they do not. This is nothing new, of course: many important archives were maintained by nonprofessionals for years before finding their way to professional archivists, and nonprofessionals often consider it their responsibility to eliminate files that might reflect badly on the leadership of a company or nonprofit organization.

But as more and more information finds its way to the Internet, the function of arch­iving past publications has changed with the nature of publication itself, so that much of the archiving function is distributed and, in effect, entrusted to people who maintain web servers and companies that maintain search engines. Some challenges are built into search engine algorithms: most searches turn up more “hits” than any one or two researchers can pursue, so that the order in which Google’s algorithm returns hits has significant consequences for what is de facto accessible and what is not.

To these problems have been added legal threats. One such family of threats, well documented by Wendy Seltzer, is associated with the “safe harbor” clause of the Digital Millenium Copyright Act (the major piece of US copyright legislation), which requires Internet service providers, including all websites, to take down any link on a website upon receipt of a formal complaint alleging that the website contains material protected by copyright. Although these provisions have been used legitimately to require websites to take down copyrighted materials, they have also been abused by IP holders whose definition of their rights is more expansive than courts would uphold, and by private interests (e.g., political operatives at­tempting to suppress information on the eve of an election) with no legitimate IP claims at all.19 Wendy Seltzer, “Free Speech Unmoored in Copyright’s Safe Harbor: Chilling Effects of the DMCA on the First Amendment,” Harvard Journal of Law and Technology 24 (2010): 171ff, To be sure, website operators subject to DMCA takedown notices have the right to re­spond. By the time a web page is reposted in response to an appeal, however, it may be no longer timely; when a small operator is up against a corporation with deep pockets, even take­downs without merit may be difficult to reverse. An even more radical bill—the Stop Online Piracy Act—appeared headed for passage in 2012 until an unprecedented public response (including a 7,000-website daylong strike) led to its defeat. SOPA would have required takedowns of entire websites, not simply offending pages, as well as centralized blocking of IP addresses, without due process based on IP-holder complaints.20 Michael Carrier, “SOPA, PIPA, ACTA, TPP: An Alphabet Soup of Innovation-Stifling Copyright Legislation and Agreements,” Northwestern Journal of Technology and Intellectual Property 11, no. 2 (2013): 21–31.

Another threat to the archiving function online comes from court decisions upholding, under the broader class of privacy rights, a “right to be forgotten,” that permits persons to demand that search engines block access to websites containing discrediting information (in some cases, even if that information is accurate). Such a right has been established in the European Union (both through European Commission directive and through case law in several European countries) and has been proposed elsewhere (successfully, in Argentina). Under such law, individuals may appeal to websites (e.g., Wikipedia) or search engines to remove informat­ion that, for example, publicizes a criminal history or personal scandal, and to bring suit if that request is refused.21 Jeffrey Rosen, “The Right to be Forgotten,” Stanford Law Review Online 64 (2012): 88, In the first several years, Google (the one company that publicized re­quests and responses) received hundreds of takedown requests from citizens of the European Union and responded positively to many of them (while denying many others). One influential example from case law: a Spanish businessman who had declared bankruptcy in the late 1990s demanded that Google suppress search results leading to information about his past insolven­cy. The EU court ruled against Google (which had turned down the claim) on the grounds that the man’s personal data was “inadequate, irrelevant, or no longer relevant…” Within five months after the decision Google received 143,000 de-indexation requests asking it to take down almost half a million links. On April 14, 2016, the European Parliament passed a new General Data Protection Regulation that strengthened the “right to be forgotten” yet further (putting European law even more at variance with US law on protected speech).22 Lee Bygrave, “Law and Technology: A Right to Be Forgotten?” Communications of the ACM 58, no. 1 (2015): 35–37.

An equally serious threat, about which I have found little in print, involves the security of information within digital archives of historical source materials. Digitization of such mater­ials has many advantages (especially if archivists can figure out how to keep files up to date through cycles of technological change), including easy access by researchers anywhere in the world and ready searchability. But it has become evident that even government agencies and private corporations with sophisticated security consultants are vulnerable to incursions by ev­en more sophisticated hackers. Most university and nonprofit archives, I suspect, cannot com­pete with Sony, the Iranian Nuclear Agency, or the New York Fed (three notable victims of hack­er attacks) on Internet security. I assume that some archivists are thinking about this, but an admittedly cursory online search found only a few Digital Humanities courses entitled “Hack­ing the Archive,” all of which used “Hacking” as a benign synonym for gaining legal online entry to sources one is entitled to enter. Without sustained attention to this issue, the prospect of a motivated attacker—imagine, e.g., Stalin, Putin, North Korea, or the Nixon White House—literally changing the historical record by gaining unauthorized entry to digital archives and editing digit­al documents seems like a real concern. Wikipedia—in effect a public archive—may present a kind of model in that behind the archive of information is an archive recording every change in that information. Some such automatic recording of changes might be applied to dig­ital archives as a routine practice (though even such a system could be easy for a sophist­icat­ed assailant to work around).

*     *     *

Clearly, this is an exciting time to be an archivist. As we find ourselves in a new world in which more information goes online each week than appeared in print for centuries after the introduction of the printing press, the work of the archivist has become even more complex, and the contributions of archivists to scholarship in the social sciences and humanities has become even more indispensable. What to keep, what to discard, how to respect the privacy of individuals, how to maintain the integrity of archival collections—all of these are issues that archivists have dealt with for decades. But when the flow of information is measured in zettabytes, scaling up established routines is unlikely to be an option. Those of us who depend on archivists and information scientists to ensure the availability of the data and documents we need must look on with appreciation, apprehension, and hope.

Photo credit: