Parameters

On the Importance of Web Archiving

Web archiving has received increased attention in the popular media over the past few years. The Internet Archive’s Wayback Machine, which can replay past versions of web pages, has been mentioned in news articles in the New York Times and the Washington Post and has been highlighted by MSNBC’s Rachel Maddow and HBO’s John Oliver. The Wayback Machine itself has been the subject of articles in the New Yorker and The Atlantic (in 2015 and 2017).

by Michele C. Weigle September 19, 2018

Web archiving has received increased attention in the popular media over the past few years. The Internet Archive’s Wayback Machine, which can replay past versions of web pages, has been mentioned in news articles in the New York Times and the Washington Post and has been highlighted by MSNBC’s Rachel Maddow and HBO’s John Oliver. ¹ The Wayback Machine itself has been the subject of articles in the New Yorker and The Atlantic (in 2015 and 2017). Web archives have been used as evidence in court cases and in the court of public opinion—often to hold politicians and governments accountable for things they have said in the past.

But what is web archiving, and is it any better than simply taking a screenshot of a web page? First, screenshots may suffice as a quick reminder of what a web page looked like, but images such as screenshots can be easily edited and manipulated (and people know this), so they are not suitable as evidence. In addition, screenshots are static. There can be no interaction with the page—no scrolling, no hovering, no clicking of links or even revealing what web pages the links on the page referred to.

Web archives, on the other hand, record the entire contents of a web page, including its source HTML and embedded images, stylesheets, or JavaScript source. Upon playback, the user can interact with the archived page, including clicking links to explore what the web page was connected to. In addition, public web archives are created and stored by independent archival organizations, such as the Internet Archive. We trust that the contents of these public web archives have not been tampered with or maliciously manipulated.

Although the Internet Archive’s Wayback Machine is the oldest and largest public web archive, it is not the only public web archive. Many countries and national libraries run their own web archives. Some prominent public web archives include the UK Web Archive, the US Library of Congress web archive, archive.is, Archive-It, and the Portuguese web archive. A large list of international web archiving initiatives is available on Wikipedia.

Although web archives provide a valuable service, they are not perfect, and archiving a web page is very different from archiving a physical object or even a static file such as a PDF. Web pages have become increasingly more complex over the years, with many loading hundreds or even thousands of images, stylesheets, and JavaScript resources, which can include advertisements and trackers. These JavaScript resources are executed by web browsers, and many of their interactions cannot be captured by all web archives. The embedded and linked nature of HTML makes the direct replay of archived web pages difficult, so web archives must make some limited transformations to the original web page. This includes rewriting links and locations of embedded resources so that they are loaded from the archive instead of the live web. This prevents someone from viewing a web page captured in 2012, for instance, and seeing an advertisement from 2018 embedded in that 2012 web page.

The Internet Archive and other archives have been recording portions of the web since 1996, providing social science scholars with an immense amount of historical information about the web itself, recent history and culture, and how the web has changed the way people communicate. Historian Ian Milligan has used web archives to explore the GeoCities online communities, popular in the late 1990s and no longer available on the live web. He investigated how users formed their own communities and engaged with others online in a time before social media. Milligan also credits web archives with enabling and enhancing the study of important cultural and historical events from the past 20 years: “Imagine writing a history of Bill Clinton’s scandals during the mid-1990s or of the September 11, 2001, terrorist attacks without using archived websites.” ² Further examples of the use of web archives for humanities research are described in the recent edited volume The Web as History.³

So, how can social science scholars and researchers take advantage of web archives?

Our Web Science and Digital Libraries (WS-DL) group at Old Dominion University (ODU) has been studying the challenges related to allowing researchers to create and share their own web archives for the past eight years. Our work is focused more on close reading of archived material than distant reading. For those interested in distant reading of web archives, the Archives Unleashed project, a collaboration between historians, librarians, and computer scientists, is developing excellent tools to enable researchers to perform large-scale analysis of web archives.

Our WS-DL group has developed tools that allow users to locally archive web pages as they browse the web and to submit URLs for archiving in public archives. One issue with submitting a URL for archiving, as opposed to creating a local archive, is that what you are viewing in your browser will likely not be exactly the same as what the archive will record. When you submit a URL, a web crawler will be directed to load the web page from its perspective, without your geolocation or cookies. Another issue is that some web crawlers, such as Heritrix, which is used by the Internet Archive, do not execute JavaScript when archiving web pages and so may miss archiving resources that are only loaded after JavaScript execution by the web browser (such as after user interaction with the web page).

Creating and viewing local web archives

Often researchers want to create an archive of a web page as they are viewing it. This was the motivation behind our NEH-funded “Archive What I See Now” project. We built a Google Chrome extension, WARCreate, that creates a local archive of the web page being currently viewed in the browser. This can be a page loaded after interaction, such as scrolling that causes more content to load, or a page that is only displayed after authentication, such as a social media account. As the name implies, WARCreate creates a file in the standard WARC (Web ARChive) format that is saved on the user’s local computer. WARC files are used by most web archives to store the results of web crawls. Multiple web resources can be stored in a single WARC file. These files contain the content of the web resources along with metadata, including HTTP header information.

Once users have WARC files, they need to be able to replay them. So we built WAIL (Web Archiving Integration Layer) as a standalone application to allow users to replay local archives. In addition, WAIL allows users to run web crawls. So instead of just archiving a single page, as with WARCreate, WAIL can create web archives of a web page and all of its links, or even of an entire website. Our latest version of WAIL uses pywb, a Python-based version of the Wayback Machine software, to manage local archive collections and a browser-based crawler, which will execute JavaScript before creating the archive. Users with local WARC files can also use Rhizome’s Webrecorder Player, which can replay one WARC at a time.

Submitting web pages to public web archives

In some cases, generating a local web archive is needed, but often, researchers are interested in archiving public web pages and may want to easily share those archived pages in the future. For these situations, submitting the web page URL to a public web archive is the best option. This will essentially be requesting that the archive make an independent observation of the web page that is then made publicly available for replay. The Internet Archive’s Save Page Now service is relatively well-known, but we highly encourage the use of multiple web archives. There have been recent cases where web page owners have put restrictions on the playback of their pages from the Internet Archive, but not all archives are subject to those restrictions.

We’ve built some tools that allow users to submit web pages to multiple web archives at the same time. The Mink Chrome browser extension not only provides access to archives of the page currently being viewed in the browser, but also allows users to submit the page for archiving by three different archives: the Internet Archive, archive.is, and WebCite. Those familiar with Python can install and use archivenow, which also allows users to create local WARCs. Finally, we built a Twitter bot, ICanHazMemento for archiving URLs found in tweets. A user can include #icanhazmemento in a tweet with a URL (or in a reply to a tweet that contains a URL), and the bot will reply with a link to the archived web page.

There are also several services available for on-demand archiving of web pages, including the Internet Archive’s Save Page Now service, archive.is, WebCite, and Webrecorder.io. In particular, Webrecorder.io is an excellent browser-based archiving service. Webrecorder can create high-fidelity archives, including all of the JavaScript on a web page. Unlike many other on-demand archiving services, Webrecorder.io can archive pages that are behind authentication. One issue, though, is that all traffic passes through Webrecorder’s servers, including sensitive requests or credentials required to load certain web pages.

Web archives have been used as evidence in court cases and in the court of public opinion—often to hold politicians and governments accountable for things they have said in the past.

Accessing public web archives

Once web pages have been crawled by a web archive, they can be replayed in a web browser. The default way to access an archived web page is to go to an individual web archive and request the URL. The archive will typically return a list of the archived versions of that web page. As mentioned earlier, there are many web archives that often contain different holdings, so mechanisms have been developed to aid users in querying multiple archives with one request.

Memento is an HTTP protocol extension developed by Los Alamos National Lab (LANL) and ODU that allows for time-based negotiation of web pages. Many of the public web archives mentioned earlier, including the Internet Archive, archive.is, and the UK Web Archive, support Memento. Memento Aggregators allow for the querying of multiple web archives in a single request. Several tools, many listed on mementoweb.org and some that will be described below, have been developed to take advantage of the Memento protocol. Because the Memento protocol is so foundational to the work we do, from here on, we will call archived versions of a web page mementos of the web page.

The Mink Chrome browser extension (mentioned above) uses Memento to report how many times the currently viewed web page has been archived in multiple web archives and provides an interface to access those mementos. In addition to the tools that we’ve developed, our collaborators at LANL have also developed Memento-based browser extensions: Memento Time Travel for Chrome and Firefox.

To access public web archives without browser extensions, or if the desired web page is not available on the live web, then the Time Travel service provided by mementoweb.org is the best option. The user supplies the desired URL and datetime, and Time Travel will use Memento to return a list of mementos closest to the specified datetime from multiple web archives. The UK Web Archive also offers a Memento service. It uses the same protocol as Time Travel, but with a different interface.

The previously mentioned services all require that a user at least have the URL that they want to explore. There are some services that allow users to browse their web archive holdings. The UK Web Archive has classified their holdings by subject and special collection, suitable for browsing. Archive-It is a subscription-based service from the Internet Archive that allows organizations to create collections of mementos. Archive-It’s 400+ partners include museums, libraries, universities, and state governments, with collections covering a wide range of topics. Because Archive-It is operated by the Internet Archive, all of the public holdings of Archive-It can also be replayed in the Internet Archive’s Wayback Machine.

Research issues in web archiving

Our work on enabling users to create their own web archives and building tools to improve access to web archives has uncovered several interesting research issues that we are continuing to investigate.

Summarizing and visualizing web archives. Our current NEH-funded work (described here) focuses on choosing representative mementos to show as a summary or overview of how a single web page has changed over time. This work focuses on web pages that have a large number of mementos—too many to expect a user to replay each one to understand how it has changed. For efficiency, we compare the source HTML of mementos and choose to replay and take screenshots of only the most unique. Then we can arrange these thumbnail-sized screenshots in a grid layout, on a timeline, or as an animation. We will be releasing a preliminary web service in the near future, and we are continuing to investigate how this type of service might be used by researchers and what other techniques could be used to choose the most representative mementos in an efficient manner.

Our previous work in summarizing how a web page changes over time takes advantage of Twitter and Tumblr. The What Did It Look Like Twitter bot will take the first memento from each year for a particular web page and create an animated GIF, which is then posted to Tumblr. As with our ICanHazMemento bot, a user can just tweet a URL (or reply to a tweet that contains a URL) with the hashtag #whatdiditlooklike to invoke the service. The service will reply with a link to the Tumblr post.

We also have on-going work in summarizing a collection of mementos. The “Dark and Stormy Archives Framework” (described in Dr. Yasmin AlNoamany’s PhD wrap-up blog post) takes Archive-It collections, chooses representative mementos, and imports them into a Storify story. Unfortunately, the Storify service is no longer available, but we are investigating alternatives, including creating our own service to generate Memento-aware social cards.

Selecting high-quality mementos. One of the issues that was highlighted when attempting to summarize a collection or web page over time is how to select high-quality mementos. There are several reasons why a particular memento may not be high quality.

Sometimes the web page that was archived was actually behind a paywall. These could be articles at news organizations, like the New York Times and the Wall Street Journal, or academic publishers, like Springer and Elsevier. We investigated the prevalence of such mementos in the Internet Archive and are developing methods for detecting mementos that are actually behind a paywall.

We know that web pages can change over time and even go off-topic for various reasons. Sometimes the content of the web page changes so much that it can no longer be considered to be about the same topic it was originally. Sometimes there were database errors or the site was down for maintenance or the web page was hacked. Often these instances are captured by web crawlers and saved in web archives. Our Off-Topic Memento Toolkit can automatically identify these instances so that the off-topic mementos can be filtered out before summarization or visualization.

Another issue is that all of the resources associated with a web page may not be captured when the page was crawled by the web archive. This could be due to transient errors in loading embedded resources or inability to capture resources that are loaded by JavaScript, as mentioned earlier. The result of missing resources in a replayed memento is memento damage. We have developed a technique to estimate the amount of memento damage present in a memento. This would allow users to pick the best-archived memento from a set of similar mementos.

One of the issues that we’ve mentioned with archiving is the inability of some web crawlers to capture resources loaded by JavaScript. One reason for this is that capturing these types of resources takes significantly more time. Traditional web crawlers like Heritrix, used by the Internet Archive, are focused on crawling as many web pages as quickly as possible. Ensuring that all JavaScript-loaded resources are captured for each page would greatly reduce the number of captures that the crawler could make in the same amount of time.

Many on-demand archiving services, like Webrecorder.io, are focused on creating high-fidelity archives of web pages rather than speed. These systems use browser-based capture tools (such as our Squidwarc tool) that load the entire web page, including all JavaScript-loaded resources, before creating the capture.

Finally, we also discovered issues with the replay of certain web pages. For instance, we discovered that mementos of the front page of cnn.com have not been replayable since November 2016 due to changes the site made with how the page is loaded. Fortunately, many of the resources have been captured, so the mementos can be replayed with the Wayback++ browser extension (available on both Chrome and Firefox) that we developed.

Integrating web archives into the web browser. As mentioned earlier, the Mink Google Chrome extension allows users to access mementos of web pages that they are currently viewing, as well as submit the web page to multiple web archives. Mink shows us what might be possible with native support for Memento and web archives in the browser.

As described above, we are developing techniques to provide more information about memento quality (off-topic detection, memento damage) when users request archived web pages. An interface like Mink could display this information along with the datetime of each capture, so that users could make informed decisions about which mementos to view.

Mink can provide access to mementos from multiple archives. We are currently developing a framework to allow users to integrate access to both public web archives and their own private web archives. Mink could serve as an interface for users to access this type of aggregation.

Conclusion

Web archives are becoming increasingly important for those studying the culture and history of the past 20 years. This article has provided an overview of how researchers and scholars can use web archives in their own research—from creating archives of current web pages to accessing mementos of the past. We have also introduced several research issues involved in improving how individuals can interact with web archives. Our overarching motivation is the belief that bringing the past web into the web browser is a key component of enabling more access to and use of web archives.

For more information about the WS-DL research group at ODU, follow us on Twitter (@WebSciDL) or on our blog.

Michele C. Weigle

Michele Weigle is a professor of computer science at Old Dominion University, where she co-leads the Web Science and Digital Libraries (WS-DL) research group. Her research interests include include web science, digital preservation, and information visualization. Her research in web archiving includes developing tools for personal web archiving, improving access to web archives for general users and researchers, and applying information visualization to web archives. Her current work is supported by grants from the National Endowment for the Humanities, the Institute of Museum and Library Services, the National Science Foundation, and the Andrew W. Mellon Foundation.