Parameters

Quality Assurance under Conditions of Scale and Complexity

This collection is distinctively poised between digital genres: with over 400 texts (approximately 11 million words), the collection is sizeable and well beyond the scope of a typical scholarly edition, but the level of detail and human attention represented in the encoding distinguishes it from typical large-scale text digitization efforts. Given such size and complexity, the management of error and inconsistency while encoding these texts in XML is a crucial task.

by Sarah Connell and Julia Flanders October 3, 2018

Introduction

Over the past thirty years, the Women Writers Project (WWP) has developed a digital collection of full-text transcriptions of texts by pre-Victorian women writers, following the standards set by the Text Encoding Initiative (TEI) Guidelines.¹ This collection is distinctively poised between digital genres: with over 400 texts (approximately 11 million words), the collection is sizeable and well beyond the scope of a typical scholarly edition, but the level of detail and human attention represented in the encoding distinguishes it from typical large-scale text digitization efforts. The use of TEI and XML transforms the basic textual content into something much more powerful and complex by adding information about the texts’ structures, content, genres, and interconnections. Given such size and complexity, the management of error and inconsistency while encoding these texts in XML is a crucial task. Applying the TEI Guidelines to a wide range of genres over a long timespan invites variances in practice, and the processes of hand transcription and encoding inevitably produce some level of error even in the hands of the most highly trained staff. While consumers of large-scale digital collections have become more acclimated to error as the cost of scale (Google Books and other optical character recognition–based collections being significant examples), the WWP’s audience and usage model rely on a high level of accuracy and consistency, despite the size of the collection.

Finding and fixing errors in such a collection requires automation, and there are a variety of XML technologies (such as XQuery, XPath, and XSLT) that offer powerful tools for discovering patterns of variation and for making global changes to complex encoded data. However, there are several reasons for not treating error correction as simply a technical challenge, an automated cleanup stage in a workflow that passes the text from (humanist) transcribers to (technologist) data-cleanup experts. First, in practical terms the complexity of decision-making at the level of transcription and encoding means that although patterns of inconsistency are easy to detect with automated tools, they need to be interpreted by those with textual expertise before they can be read as “error.” Quite often, odd edge cases and exceptions turn out to reveal a fracture in the encoding policy itself and can motivate a reassessment of specific practices: for example, when an investigation of inconsistencies in the handling of stage directions led to the development of a more nuanced markup for that feature.² But even more significantly, in order to build and maintain a culture of empowered collaboration within a project of this kind, it is important to situate as many tools as possible within the process of transcription and encoding itself. Doing so gives students the opportunity to become familiar with new technologies (rather than isolating those technologies as distantly and intimidatingly “expert” systems in another part of the forest), and also positions their encoding work as an engagement with the entire existing collection that both intervenes in and is constrained by the project’s established practice. Instead of treating error correction as an assessment stage at the end of a private process, we seek to treat it as a tool for improving the encoding process itself and rendering it more dialogic.

These are useful principles, but what does it look like in practice to establish a complex error-correction tool that is deployed as part of the process of encoding texts in XML? In the brief case studies below, we examine some specific examples to explore what is involved in adapting a new tool to both the specificity of the data and the specificity of the working environment.

Quality assurance in practice

The collaborative development process the WWP has adopted for our automated error-testing routines enables us to bring in both specialized technical knowledge and a deep understanding of the WWP’s encoding. A typical example is the ISO Schematron³ schema that we use to look for spacing errors around phrase-level elements (elements for tagging phrase-level phenomena such as names, dates, and technical terms). This schema runs a series of tests to check for errors such as the following:

<title>The Tragedy of <persName>Hamlet</persName>,<roleName>Prince of

<placeName>Denmark</placeName></roleName></title>

in which a missing space (following the comma) is difficult for human proofreaders to see because of the start and end tags in the encoding. This schema can identify a range of spacing- related errors and flag them with messages such as “this <placeName> is followed immediately by a <persName> without any intervening space,” “this <said> has no space immediately before,” or “this <persName> ends with a space.” That is, the schema identifies spacing errors in both textual content and markup before and after phrase-level elements, as well as within the elements themselves.

This schema is fairly complex, because the kinds of problems it tests for can manifest in complex ways. It needs to be able to identify the relevant phrase-level elements and recognize how their contexts in the encoding impact spacing concerns (such as when spaces between elements are not necessary because they are nested inside of other elements—like items in a list or paragraphs—that will default to starting on a new line). It also needs to distinguish nonproblematic cases where spaces are lacking (such as when an abbreviation in the original text is immediately followed by an editorial expansion). It must correctly handle punctuation: for example, commas should have following but not preceding spaces; em dashes should not have preceding or following spaces. And it must also correctly handle the WWP’s encoding for superscripts, i/j and u/v shifts, end-of-line hyphens, possessives, characters marking the anchor points of notes, distinctive initial capitals, and many other phenomena that impact whether or not a space is needed.

This schema was initially developed by the WWP’s senior XML programmer-analyst, Syd Bauman; it was then tested by WWP staff, who worked with Bauman to edit the schema file, and it is now used by student encoders as part of our routine publication procedures. As is often the case, we began our work using reports generated by Bauman but we now have placed the schema file under version control so that encoders and staff can use it to validate the files they are working on directly.

For proofing routines like this one with numerous variables, it often takes several iterations to identify and address all of the ways that the markup and transcribed texts are interacting, with multiple staff members contributing their detailed understanding of the project’s encoding practices. In this case, the schema as initially developed by Bauman retrieved hundreds of errors, many of them false positives. In order to prepare this schema to work with the full range of texts in our collection, Bauman set up summary reports to identify the most common types of spacing errors, roughly half of which were either false positives or genuine errors that could be addressed globally.⁴ Based on these reports, other staff members were then able to determine the cause of the false positives and work with Bauman to refine the schema’s operation.

One adjustment that we made in the refinement phase of schema development can illustrate the complexity at stake: the WWP has two different ways of encoding certain kinds of punctuation characters such as dashes, quotation marks, and brackets. When these characters are treated as “delimiters” marking out the boundaries of elements, they are recorded using a @rend attribute as part of that element’s “rendition,” or appearance.⁵ In other cases, the characters are simply transcribed as part of the text. The initial version of the schema correctly handled transcribed em dashes but incorrectly flagged missing spaces between elements that had em dashes encoded as rendition. The updated schema can now handle cases such as the following:

<quote>To be or not to be</quote><bibl rend=”pre(—

)”><author><persName>Shakespeare</persName></author></bibl>

which encodes this text without issue:

To be or not to be—Shakespeare

Having identified and addressed this issue, WWP staff recognized that we needed to review other ways that our encoding with @rend might affect our testing for spacing errors, after which we made several additional revisions to the schema. As this example suggests, creating a schema like this one requires understanding the encoding on a level that includes not only identifying relevant categories of interactions between the encoding and the schema (here, the class of cases in which delimiters encoded with @rend mean that spacing is not required), which might appear with several different elements and be flagged by different kinds of error messages, but also recognizing the broader implications of particular kinds of errors (here, that this class of errors revealed the need for a more thorough review of our encoding of rendition). The WWP has two different ways of encoding certain kinds of punctuation characters such as dashes, quotation marks, and brackets.Without this review, our systematic identification of errors would have been far less accurate, falsely flagging correct encoding, as above, but also failing to recognize errors—for example, because we regularize space around em dashes to zero across the Women Writers Online (WWO) collection, a space between the <quote> and <bibl> elements above would, in fact, be an error. These details may seem minor, but they scale up very rapidly and it is crucial that we consistently follow the regularization and other editorial practices outlined in our documentation,⁶ not least so that readers of WWO can rely on the information represented there.

Because encoding for WWO demands this very high level of engagement with the markup, in our encoder training we emphasize not just the basics of transcription and markup but also advanced navigation of the digital collection with XPath, regular expressions, and other searching mechanisms. We also work to help encoders develop a strong conceptual understanding of how the project uses validation and transformation routines to manage our collection. This approach means that encoders can use many different routines as part of their individual proofing processes and can also contribute to our ongoing collection improvements and consistency checking. For instance, after a group of encoders learned how to use the phrase-level spacing schema along with several others,⁷ one encoder asked if Schematron could also look for elements that should always have markup specifying whether or not they start on a new line to flag any cases where this markup is missing. It can—and we now have a new schema in our proofing routines that does just that.

This is just one of many cases where encoders have driven improvements in our work processes. For one final example, an encoder led a push among her colleagues to search the published WWO texts using XPath in all cases where they were uncertain which (if any) of our elements for named entities should be used to encode a particular word or phrase, such as in a recent discussion on whether <persName>, <name>, or some other element would best encode “Britannia.” Encoders have long been resourceful in searching WWO for examples when they have questions, but this encoder made consulting the collection using XPath a formal practice that all encoders now follow as a starting point to our group discussions about handling edge cases and reviewing past encoding for consistency. Quality assurance has thus found its way into the encoders’ own direct engagement with the collection. What’s more, the encoder suggested that we investigate creating a transformation that would let us review the elements for which these questions tend to arise (such as <name>, <persName>, <placeName>, and <title>) and then check their contents to make sure that challenging words and phrases (such as “Britannia,” “Rajah,” “Paradise,” or “Chiron”) are encoded using the same elements. We have now begun development of an XSLT transformation that can locate the same words or phrases (accommodating minor spelling and orthographic variations) and display the elements in which they are encoded so that we can review and address any inconsistencies. As with the phrase- level spacing schema, the development of this test will involve intensive collaboration between encoders and technical staff.

Conclusion

As these examples illustrate, pedagogy plays an important role in the success of these strategies, not only in supporting encoders’ use of complex tools, but (much more importantly) in acculturating them early on as members of a team in which content expertise and technical expertise are very closely intertwined, and for whom no part of the project’s work is invisible or off limits. It makes a significant difference that the staff responsible for leading tool development also work closely with the encoding staff and are involved in that pedagogy. Understanding the tools themselves and their development as part of the learning process helps ensure that the student encoders are not positioned as perpetual novices who need protection from technology, but as curious learners and contributors with reciprocal knowledge to offer.

This is an area where good practice is easy to see and hard to follow, as the WWP’s own experience has shown. For instance, the WWP successfully used emacs (a powerful command-line text editor) as its encoding environment for many years. But although emacs is an ideal environment for creating tools for error discovery and correction (and we developed several), we did not successfully ground them in the encoders’ own core expertise. As a result, although these tools filled a real need, once the few encoders who first mastered them had moved on, the tools fell into disuse.

An important side effect of this pedagogical reciprocity therefore is to increase the transparency (and hence the cultural sustainability) of the tool within the working environment: because its users fully understand how the tool operates, it is less susceptible to accidental misuse, or to disuse through loss of a single expert individual. The challenge of maintaining the integrity of this material (from a data perspective and also from the perspective of scholarly quality) is considerable, but is also closely connected with the pedagogical and collaborative ethos of the project—we want our student colleagues to be strongly empowered and broadly knowledgeable rather than to have them work with black-box tools fixing isolated problems. Tools, training, documentation, and communication all play important and interconnected roles here.

Sarah Connell

Sarah Connell is the assistant director of the Women Writers Project and of the NULab for Texts, Maps, and Networks at Northeastern University. Her recent publications include two chapters on text encoding and transformation, co-authored with Julia Flanders and Syd Bauman for a textbook on the digital humanities; an article, “Meta(data)morphosis,” co-authored with Ashley Clark; and “The Poetics and Politics of Legend: Geoffrey Keating’s Foras Feasa ar Éirinn and the Invention of Irish History,” published in the Journal for Early Modern Cultural Studies. Her current research activities include a text encoding and analysis project, Making Room in History, which examines... Read more

Julia Flanders

Julia Flanders is professor of the practice in the Northeastern University English Department and the Director of the Digital Scholarship Group, where she also directs the Women Writers Project and the TEI Archiving, Publishing, and Access Service. She also serves as editor-in-chief of Digital Humanities Quarterly, and has served as president of the Association for Computers and the Humanities and chair of the TEI Consortium. With Fotis Jannidis, she is the co-editor of a forthcoming book titled "The Shape of Data," and with Neil Fraistat she is the co-editor of the Cambridge Companion to Textual Scholarship. Her current research focuses... Read more