Parameters

Umbra Search African American History: Aggregating African American Digital Archives

If you want to study the American South, you go to Chapel Hill. If you want to study gravestones, you go to Amherst. African American history, however, is easily perceived as under-collected. While there are some online guides to African American archival collections, there is no centralized or authoritative source, especially when it comes to smaller, less researched collections.

by Dorothy Berry December 14, 2016

There are topics on which American archives have based their foundations for generations. Literature, political figures, labor struggles, regional history. Major collecting institutions across the United States have been describing and arranging these records for years, building research destinations with clear focuses. If you want to study the American South, you go to Chapel Hill. If you want to study gravestones, you go to Amherst. African American history, however, is easily perceived as under-collected. While there are some online guides to African American archival collections, there is no centralized or authoritative source, especially when it comes to smaller, less researched collections. Stories are spread out across the nation following the trails of academics in Mississippi who collected photos of the Southside of Chicago, music librarians in Durham who collected African American sheet music, and other even more surprising routes. In many cases, the best way to find a collection has been to serendipitously know someone who has already found it.

Big names do, of course, exist in the world of African American archival materials, but the full story of Blackness in the archival record is difficult to tell through the often small holdings dispersed across a large nation. Digitization efforts have dramatically changed archival accessibility, but the scattershot nature of African American collections, and their digitization, has made discoverability and entirely different matter.

Umbra Search African American History (umbrasearch.org), a search tool/widget developed by the University of Minnesota Libraries’ Givens Collection of African American Literature, was designed to counter the web of links and depositories by aggregating African American records from across the digital landscape, pulling from major sources like the Schomburg Center and the Library of Congress, as well as the minor but potent collections held by historical societies and county libraries.The full story of Blackness in the archival record is difficult to tell through the often small holdings dispersed across a large nation. Aggregating more than 400,000 digital holdings from over 1,000 institutions creates a new collection whose repositories are not physical spaces but rather the ur-archive of the human mind. Records in digital space carry the metadata of provenance encoded within them, but when seen all together, photos of rural life from seven different repositories are no longer presented primarily as institutional research capital but rather as the associated records, photos, and ephemera that make up African American history.

The first steps toward such lofty goals can be found at u mbra s earch .org, where over 400,000 records from more than 1,000 institutions constitute an aggregation linking records in a manner that 20 years earlier would have taken a researcher multiple research trips, and 5 years earlier would have taken a research multiple hours scouring libraries’ websites and furiously organizing bookmarks folders. These numbers are good and represent huge amounts of work on the part of the Umbra Search team as well as archivists, catalogers, and registrars from across our contributing institutions. In the past year, however, University of Minnesota Libraries was awarded a Council on Library and Information Research (CLIR) grant to digitize African American holdings from across the U of M collections, and on this campus alone over half a million records were identified. In two years’ time, Umbra Search will have almost doubled its current size based on this project alone, raising real questions on how data is aggregated across institutions with widely idiosyncratic descriptive metadata.

Efforts to create computing systems that can identify records relevant to Umbra Search hit roadblocks not because of a lack of materials, but rather because of the information accompanying those materials. A recent experiment at a campus coding event shined direct light on this issue. A set of 1,500 records pulled from the Digital Public Library of America, one of the largest sources for records on Umbra Search, was used as a test example and marked for relevancy/irrelevancy by members of the Umbra Search team. Research system engineer David Olsen was able to design an algorithm that, using frequency of keywords, was able to match that human defined set to around 90 percent. Examining the discrepancy revealed an overarching issue with gathering together archival records in a cross-institutional, digital setting: descriptive practices that privilege the intellectual gatekeeping of physical accessibility to discrete human knowledges.

An example I often use to illustrate this issue is a record titled “Justice Department Report on the Shooting of Michael Brown by Ferguson, Missouri Police Officer Darren Wilson.” The humanly defined set marked this record as relevant, but the computing system did not. When I share this example people react in shock—of course this record is relevant to African American history! The metadata for the file, however, included only the keywords “Civil Unrest,” “Justice Department,” “Michael Brown,” “Darren Wilson,” “Ferguson, Missouri.” Those keywords meld together in the zeitgeist creating a shared set extended keywords like “Police Violence,” and “Black Lives Matter,” but the computer systems work only with the technical metadata tied to a record, not the intellectual metadata tied to contemporary understandings.

The humanly defined set marked the Justice Department report on the shooting of Michael Brown as relevant, but the computing system did not.

Challenges around metadata are the some of the largest facing the growth of Umbra Search as a collection of aggregated digital holdings. At this point, those obstacles appear in three broad categories: conventions around archival description, limited resources for archival processing, and a lack of transferable documentation. Before diving into these issues, a clear point should be made that these are not call-outs of particular institutions or the archival profession, but simply problematics that will become ever more apparent as digital user-bases widen and begin to expect an ease of access and discoverability that mirrors what they are used to on far less academically vetted sources with far more resources, like Wikipedia or Google.

The first obstacle is the limited resources for archival processing. Most archives have or have had huge backlogs of unprocessed collections, compounded by the continuous acquisition of new materials. Collection backlogs have been addressed in recent years by doing what is known as “minimal processing.” Inspired by Mark A. Greene and Dennis Meissner’s incredibly influential article “More Product, Less Process: Revamping Traditional Archival Processing.” To simplify, Greene and Meissner argue that archives have awarded “a higher priority to serving the perceived needs of our collections than to serving the demonstrated needs of our constituents” (212) by allowing large amounts of materials to sit unprocessed until such a time that resources allowed for complete, high-level inventorying and finding aid creation. Their article championed methods to expedite physical access including, most relevantly to digital archives aggregation, description of materials “sufficient to promote use” (213).

Description sufficient to promote use in a physical setting can be drastically different from description sufficient to promote use by both computer systems looking for records to aggregate and use by researchers searching for individual records in mass digital settings. While the methodology promoted in “More Product, Less Process” known colloquially as MPLP, has led to huge strides in getting collections available to researchers either in person or through finding aids available online, processing collections at a box level or folder level can lead to digital records with an inefficiently minimal level of description. To provide an illustrative hypothetical: an archive has an unprocessed collection of twenty boxes from a local African American social worker, titled the Jane Doe Collection. In a pre-minimal processing world, an individual archivist might spend hours and hours researching Jane Doe, arranging and describing the materials, writing up a finding aid that describes the foldered content of each box, with time spent on detailing the condition of each individual records, all before any researchers had access. With minimal processing principles in mind, that collection might be processed into a finding aid quickly, either at box level (“Photographs”), or folder level (“Photographs; November 1942-January 1943”). This has the effect of greatly speeding up the accessibility to researchers on site who can look at the finding aid and request the box “Photographs” from the Jane Doe collection, or for a researcher requesting scans of the records in “Photographs; November 1942-January 1943.”

When that same archive decides to put some of their records into a digital archive setting, however, that means there are now 250 item-level records all with the title “Photographs; November 1942–January 1943.” Since the keywords for that folder were designed to applied to totality of folder-level record, researchers searching for Christmas 1942 photographs have to sift through all the photos, each of which shares the keywords “Jane Doe,” “Social Workers,” “Photographic Negatives.” This research obstacle is not, necessarily, an issue with MPLP on a basic level, but rather an emerging division between archival accessibility and digital discoverability.

Conventions around archival description are myriad and standardized in the United States in the form of Describing Archives: A Content Standard (DACS). DACS is an output independent form designed to work with a variety of information formats, including “manual and electronic catalogs, databases, and other finding aid formats,” while acknowledging that “DACS will be used principally with the two most commonly employed forms of access tools, catalogs and inventories.” While DACS is definitively the guidebook to archival description, a digital archive comprised of image files requires a descriptive level completely outside the traditions of archival access. When you visit an archive and tell the archivist you are looking for photos of a 1942 toy drive led by a local social worker, they will most likely bring out a box from the Cab Calloway Collection, from which you will pull a folder titled something like “Photographs; Cotton Club, 1931.” In the digital archives setting, you simply type “Cab Calloway” and limit your results to images. The structures that make up archival description are built on iterative levels of organization that, while still in place, become invisible in a digital aggregate. The structures that make up archival description are built on iterative levels of organization that, while still in place, become invisible in a digital aggregate. Aggregation by its nature contradicts what has been referred to as the guiding principle of archival theory: respects des fonds. Michel Duchien defines respects des fonds as “to group, without mixing them with others, the archives (documents of every kind) created by or coming from an administration, establishment, person, or corporate body.” The entire appeal of a platform like Umbra Search or the Digital Public Library of America is the mixing of archives with others! In traditional archives, a single photo is not handed out, removed from the context of its folder and box, while in a digital setting having to go through a box’s worth of photos to get the one you want is seen as a systems failure.

This reordering of priorities is made even more clear in the use of keywords and/or subject headings. Grouping information by hyperlinked keywords is a given in a digital setting, but the use of keywords in the archival setting is one of the most idiosyncratic elements of descriptions. Library of Congress Subject Headings are a common standard but lack many signifying headings that would assist in pulling together relevant data, as illustrated by a search for the subject “African American culture” or “United States–History–African American.” This is compounded by the complexities around identification of African Americans over time and across communities (Colored vs. Negro vs. Afro-American vs. Black vs. African American). These descriptive issues come together to create digital records that are difficult for computers to parse for relevancy, as seen in the Ferguson, Missouri, example above.

A lack of transferable documentation exists is the crux of the first two obstacles. Though minimal processing may lead to digital records with overly broad titles, and the difficulties around description for digitized records may lead to those same records being decontextualized and lacking in efficient keywords, this is not because the information does not exist. Archivists are astounding for their ability to amass subject knowledge in their collection’s holdings, but the funding and time to retrofit descriptive practice for the digital age just simply isn’t there. Metadata enhancement is written into the grant for Umbra Search’s CLIR funded mass digitization project, but most digitization efforts are constantly juggling time management and financial issues that don’t leave a lot of space for creating extensive subject authorities or spending hours writing folder level descriptions. Researchers with physical and financial capabilities can visit archives and benefit from the archivists’ subject knowledge, but those using digital archives as their sole option miss out. This is an even more extreme issue when dealing with collections of marginalized people. A single archivist may be a champion for African American records in their collection and hold a wealth of knowledge about their holdings but if there is no transferable documentation, when that standard bearer retires or passes away a huge hole appears in our understanding of some of the most at-risk communities’ histories.

As Umbra Search African American History continues to be shaped, we hope that solutions to these challenges can be discovered through collaborations with researchers, scholars, information professionals, and systems developers. As archives strive to provide access beyond the ivory tower, and users seek information in digital spaces before physical ones, Umbra Search will increasingly offer the possibility of putting African American history into the hands of its true inheritants by creating a large scale digital collection uniting archives across the country, but also to explore the upcoming problematics of archives in the age of digital reproduction.

Dorothy Berry

Dorothy Berry is the digitization and metadata lead for Umbra Search African American History at the University of Minnesota. Her work has focused on the intersections of information science and African American history and has included research on nineteenth-century Black performing arts in the digital archives, exhibit creation for the Black Film Center/Archive and the Archives of African American Music and Culture, object conservation for the National Museum of African American History and Culture Smithsonian, and digital resource creation for the Ozarks Afro-American History Museum.