Introduction
Over the past thirty years, the Women Writers Project (WWP) has developed a digital collection of full-text transcriptions of texts by pre-Victorian women writers, following the standards set by the Text Encoding Initiative (TEI) Guidelines.1 The Text Encoding Initiative Guidelines (http://www.tei-c.org/) specify an XML language for creating high-quality digital representations of research materials in the humanities, social sciences, and linguistics. The Guidelines have served since 1994 as a major standard for digital humanities practice. This collection is distinctively poised between digital genres: with over 400 texts (approximately 11 million words), the collection is sizeable and well beyond the scope of a typical scholarly edition, but the level of detail and human attention represented in the encoding distinguishes it from typical large-scale text digitization efforts. The use of TEI and XML transforms the basic textual content into something much more powerful and complex by adding information about the texts’ structures, content, genres, and interconnections. Given such size and complexity, the management of error and inconsistency while encoding these texts in XML is a crucial task. Applying the TEI Guidelines to a wide range of genres over a long timespan invites variances in practice, and the processes of hand transcription and encoding inevitably produce some level of error even in the hands of the most highly trained staff. While consumers of large-scale digital collections have become more acclimated to error as the cost of scale (Google Books and other optical character recognition–based collections being significant examples), the WWP’s audience and usage model rely on a high level of accuracy and consistency, despite the size of the collection.
Finding and fixing errors in such a collection requires automation, and there are a variety of XML technologies (such as XQuery, XPath, and XSLT) that offer powerful tools for discovering patterns of variation and for making global changes to complex encoded data. However, there are several reasons for not treating error correction as simply a technical challenge, an automated cleanup stage in a workflow that passes the text from (humanist) transcribers to (technologist) data-cleanup experts. First, in practical terms the complexity of decision-making at the level of transcription and encoding means that although patterns of inconsistency are easy to detect with automated tools, they need to be interpreted by those with textual expertise before they can be read as “error.” Quite often, odd edge cases and exceptions turn out to reveal a fracture in the encoding policy itself and can motivate a reassessment of specific practices: for example, when an investigation of inconsistencies in the handling of stage directions led to the development of a more nuanced markup for that feature.2
In this case, our documentation at the time instructed encoders to select a value of “mixed” for the @type attribute on the <stage> element in cases where more than one value might apply (as in “Enter Sophonsiba, speaking loudly as follows,” which contains both an entrance and a description of how a character’s line are delivered).Quite often, odd edge cases and exceptions turn out to reveal a fracture in the encoding policy itself and can motivate a reassessment of specific practices.However, encoders found the “mixed” value inadequate for representing the specificity of their documents, and we discovered that some had instead simply chosen whatever more specific value seemed primary (thereby losing the idea of plurality that “mixed” was supposed to convey). Recognizing this, we decided to revise our encoding to allow multiple values on @type and reviewed our stage directions to update them according to this new provision; the review process also led us to add three new values for @type. For our current handling of stage directions, see http://wwp.neu.edu/research/publications/documentation/internal/#!/entry/stage_element.
But even more significantly, in order to build and maintain a culture of empowered collaboration within a project of this kind, it is important to situate as many tools as possible within the process of transcription and encoding itself. Doing so gives students the opportunity to become familiar with new technologies (rather than isolating those technologies as distantly and intimidatingly “expert” systems in another part of the forest), and also positions their encoding work as an engagement with the entire existing collection that both intervenes in and is constrained by the project’s established practice. Instead of treating error correction as an assessment stage at the end of a private process, we seek to treat it as a tool for improving the encoding process itself and rendering it more dialogic.
These are useful principles, but what does it look like in practice to establish a complex error-correction tool that is deployed as part of the process of encoding texts in XML? In the brief case studies below, we examine some specific examples to explore what is involved in adapting a new tool to both the specificity of the data and the specificity of the working environment.
Quality assurance in practice
The collaborative development process the WWP has adopted for our automated error-testing routines enables us to bring in both specialized technical knowledge and a deep understanding of the WWP’s encoding. A typical example is the ISO Schematron3ISO Schematron is a rule-based XML schema language for validating XML documents. schema that we use to look for spacing errors around phrase-level elements (elements for tagging phrase-level phenomena such as names, dates, and technical terms). This schema runs a series of tests to check for errors such as the following:
<title>The Tragedy of <persName>Hamlet</persName>,<roleName>Prince of
<placeName>Denmark</placeName></roleName></title>
in which a missing space (following the comma) is difficult for human proofreaders to see because of the start and end tags in the encoding. This schema can identify a range of spacing- related errors and flag them with messages such as “this <placeName> is followed immediately by a <persName> without any intervening space,” “this <said> has no space immediately before,” or “this <persName> ends with a space.” That is, the schema identifies spacing errors in both textual content and markup before and after phrase-level elements, as well as within the elements themselves.
This schema is fairly complex, because the kinds of problems it tests for can manifest in complex ways. It needs to be able to identify the relevant phrase-level elements and recognize how their contexts in the encoding impact spacing concerns (such as when spaces between elements are not necessary because they are nested inside of other elements—like items in a list or paragraphs—that will default to starting on a new line). It also needs to distinguish nonproblematic cases where spaces are lacking (such as when an abbreviation in the original text is immediately followed by an editorial expansion). It must correctly handle punctuation: for example, commas should have following but not preceding spaces; em dashes should not have preceding or following spaces. And it must also correctly handle the WWP’s encoding for superscripts, i/j and u/v shifts, end-of-line hyphens, possessives, characters marking the anchor points of notes, distinctive initial capitals, and many other phenomena that impact whether or not a space is needed.
This schema was initially developed by the WWP’s senior XML programmer-analyst, Syd Bauman; it was then tested by WWP staff, who worked with Bauman to edit the schema file, and it is now used by student encoders as part of our routine publication procedures. As is often the case, we began our work using reports generated by Bauman but we now have placed the schema file under version control so that encoders and staff can use it to validate the files they are working on directly.
For proofing routines like this one with numerous variables, it often takes several iterations to identify and address all of the ways that the markup and transcribed texts are interacting, with multiple staff members contributing their detailed understanding of the project’s encoding practices. In this case, the schema as initially developed by Bauman retrieved hundreds of errors, many of them false positives. In order to prepare this schema to work with the full range of texts in our collection, Bauman set up summary reports to identify the most common types of spacing errors, roughly half of which were either false positives or genuine errors that could be addressed globally.4For example, we resolved all of the cases in which spaces were missing between <placeName> and <persName> elements by simply searching for a <placeName> end tag followed immediately by a <persName> start tag and then adding a space Based on these reports, other staff members were then able to determine the cause of the false positives and work with Bauman to refine the schema’s operation.
One adjustment that we made in the refinement phase of schema development can illustrate the complexity at stake: the WWP has two different ways of encoding certain kinds of punctuation characters such as dashes, quotation marks, and brackets. When these characters are treated as “delimiters” marking out the boundaries of elements, they are recorded using a @rend attribute as part of that element’s “rendition,” or appearance.5The WWP records many aspects of our texts’ rendition, including italicization, casing, indentation, alignment, underlining, and so on. In other cases, the characters are simply transcribed as part of the text. The initial version of the schema correctly handled transcribed em dashes but incorrectly flagged missing spaces between elements that had em dashes encoded as rendition. The updated schema can now handle cases such as the following:
<quote>To be or not to be</quote><bibl rend=”pre(—
)”><author><persName>Shakespeare</persName></author></bibl>
which encodes this text without issue:
To be or not to be—Shakespeare
Having identified and addressed this issue, WWP staff recognized that we needed to review other ways that our encoding with @rend might affect our testing for spacing errors, after which we made several additional revisions to the schema. As this example suggests, creating a schema like this one requires understanding the encoding on a level that includes not only identifying relevant categories of interactions between the encoding and the schema (here, the class of cases in which delimiters encoded with @rend mean that spacing is not required), which might appear with several different elements and be flagged by different kinds of error messages, but also recognizing the broader implications of particular kinds of errors (here, that this class of errors revealed the need for a more thorough review of our encoding of rendition). The WWP has two different ways of encoding certain kinds of punctuation characters such as dashes, quotation marks, and brackets.Without this review, our systematic identification of errors would have been far less accurate, falsely flagging correct encoding, as above, but also failing to recognize errors—for example, because we regularize space around em dashes to zero across the Women Writers Online (WWO) collection, a space between the <quote> and <bibl> elements above would, in fact, be an error. These details may seem minor, but they scale up very rapidly and it is crucial that we consistently follow the regularization and other editorial practices outlined in our documentation,6See http://wwp.neu.edu/research/publications/documentation/internal/#!/entry/regularization_narrative. not least so that readers of WWO can rely on the information represented there.
Because encoding for WWO demands this very high level of engagement with the markup, in our encoder training we emphasize not just the basics of transcription and markup but also advanced navigation of the digital collection with XPath, regular expressions, and other searching mechanisms. We also work to help encoders develop a strong conceptual understanding of how the project uses validation and transformation routines to manage our collection. This approach means that encoders can use many different routines as part of their individual proofing processes and can also contribute to our ongoing collection improvements and consistency checking. For instance, after a group of encoders learned how to use the phrase-level spacing schema along with several others,7Including one that checks whether linked instances of quotation and direct speech are correctly pointing to each other and another that looks for “erroneously empty elements,” or cases where elements that should always have content are empty. one encoder asked if Schematron could also look for elements that should always have markup specifying whether or not they start on a new line to flag any cases where this markup is missing. It can—and we now have a new schema in our proofing routines that does just that.
This is just one of many cases where encoders have driven improvements in our work processes. For one final example, an encoder led a push among her colleagues to search the published WWO texts using XPath in all cases where they were uncertain which (if any) of our elements for named entities should be used to encode a particular word or phrase, such as in a recent discussion on whether <persName>, <name>, or some other element would best encode “Britannia.” Encoders have long been resourceful in searching WWO for examples when they have questions, but this encoder made consulting the collection using XPath a formal practice that all encoders now follow as a starting point to our group discussions about handling edge cases and reviewing past encoding for consistency. Quality assurance has thus found its way into the encoders’ own direct engagement with the collection. What’s more, the encoder suggested that we investigate creating a transformation that would let us review the elements for which these questions tend to arise (such as <name>, <persName>, <placeName>, and <title>) and then check their contents to make sure that challenging words and phrases (such as “Britannia,” “Rajah,” “Paradise,” or “Chiron”) are encoded using the same elements. We have now begun development of an XSLT transformation that can locate the same words or phrases (accommodating minor spelling and orthographic variations) and display the elements in which they are encoded so that we can review and address any inconsistencies. As with the phrase- level spacing schema, the development of this test will involve intensive collaboration between encoders and technical staff.
Conclusion
As these examples illustrate, pedagogy plays an important role in the success of these strategies, not only in supporting encoders’ use of complex tools, but (much more importantly) in acculturating them early on as members of a team in which content expertise and technical expertise are very closely intertwined, and for whom no part of the project’s work is invisible or off limits. It makes a significant difference that the staff responsible for leading tool development also work closely with the encoding staff and are involved in that pedagogy. Understanding the tools themselves and their development as part of the learning process helps ensure that the student encoders are not positioned as perpetual novices who need protection from technology, but as curious learners and contributors with reciprocal knowledge to offer.
This is an area where good practice is easy to see and hard to follow, as the WWP’s own experience has shown. For instance, the WWP successfully used emacs (a powerful command-line text editor) as its encoding environment for many years. But although emacs is an ideal environment for creating tools for error discovery and correction (and we developed several), we did not successfully ground them in the encoders’ own core expertise. As a result, although these tools filled a real need, once the few encoders who first mastered them had moved on, the tools fell into disuse.
An important side effect of this pedagogical reciprocity therefore is to increase the transparency (and hence the cultural sustainability) of the tool within the working environment: because its users fully understand how the tool operates, it is less susceptible to accidental misuse, or to disuse through loss of a single expert individual. The challenge of maintaining the integrity of this material (from a data perspective and also from the perspective of scholarly quality) is considerable, but is also closely connected with the pedagogical and collaborative ethos of the project—we want our student colleagues to be strongly empowered and broadly knowledgeable rather than to have them work with black-box tools fixing isolated problems. Tools, training, documentation, and communication all play important and interconnected roles here.