Parameters

Social Science Replication in the Age of Software

If curating research data is a good idea, then curating the software that researchers use to process it is a good idea too. Clearly, even “pure” statistical analysis is a great deal more fragile than we had hoped.

by Jonathan Stray March 22, 2017

If curating research data is a good idea, then curating the software that researchers use to process it is a good idea too. After all, it’s typically not the data files you look at when you read a paper, but data visualizations and descriptions of the results of analysis. All of this graphing and analysis is done through software, sometimes quite complex or extensive software. At minimum, the preserved code documents the long chain of implications that lead to a conclusion. But you might also want to run the software again for a variety of good reasons, including replicating the results.

Much has been said on the value of replication for the progress of science, especially over the last few years. It may not have surprised skeptics to learn that only 39 of 100 psychology experiments could be replicated,¹ because there is huge scope for subtle differences in experimental setup. It should be much more worrisome that only half of 67 economic papers could be replicated starting with the author’s original data and code.² Clearly, even “pure” statistical analysis—there was no new data collection here—is a great deal more fragile than we had hoped.

“Reproducible” could mean many things. Victoria Stodden has proposed distinguishing between empirical, computational, and statistical reproducibility.³ Here I will focus on computational reproducibility, meaning the ability to rerun the calculations described in a paper, using the original code and data, and get the same answers as the original investigators. It might seem that this is trivial—after all, computers are deterministic machines. However, recent attempts demonstrate that duplicating a computation is often quite difficult and likely to fail.

I offer an explanation of this finding, and possible solutions, by appealing to software engineering. Because that is what a modern statistical analysis is: the application of (often custom) software to (often original) data to produce an interpretable result. And software engineers know just how fragile this process is, to the point where “But it works on my machine!” is an inside joke. Software inevitably breaks when you try to run it in a new context.

But professionals do have to deliver software that works correctly on their customers’ computers—for example, your laptop or phone—so they’ve painstakingly developed tools for configuration management, revision control, testing, and deployment. These tools could be applied to the problem of research reproducibility.

In short, we should consider a data analysis to be a living piece of software rather than a set of files. Just as it is possible for anyone to read a journal paper years after publication, it should be possible for anyone to rerun the statistical analysis that led to that paper. Rerun it, inspect the workings, repurpose it—a statistical analysis is better understood as a machine than a document.

Software engineers know just how fragile this process is, to the point where “But it works on my machine!” is an inside joke.

But it is not an easy thing to preserve software. There are both shallow and deep problems here. The shallow problems start with writing your analysis as code. Typing into a spreadsheet or entering commands is not code; what you want is a single programmatic script that does the entire computation, from the rawest of raw data all the way to the page proofs, in one shot. If you, the author, cannot reproduce your work, no one else ever will. This method of working requires some discipline and adjustment, but no magic, and there are now good guides that explain how to begin working this way in gradual steps.⁴ These guides owe a great deal to software practice, borrowing concepts and tools such as modern version control.

The next step is bundling up the components of a computation and archiving them. This includes all the code, data, configuration, documentation, etc. The minimal approach is just to append a zip file to the paper, but you can make things a lot more accessible than that. Something like an archival version of GitHub would fill a gap here. Beyond the crucial task of tracking file versions, GitHub also has extremely important affordances like online file browsing, search, bug tracking, and collaborative development tools.

But if you’ve ever tried to dig up and run a piece of software that’s more than a few years old, you know that there’s a much harder problem: all code runs on a deep “stack” of software, hardware, and network resources. Support libraries, operating systems, and downloadable resources all change, and even our hardware eventually shifts. To run an old program, you have to reconstruct the entire context of the computation.

This is the hardest instance of the general problem of preservation of digital artifacts. The Internet Archive—by far the largest archive in the world—chose to standardize on HTML and a few digital media formats for their collection. But software is not a document. Again, software engineering practice suggests solutions. Modern software depends on ever more sophisticated “deployment” processes, which smooth the transition from the developer’s work environment to the production server or end-user device. To run an old program, you have to reconstruct the entire context of the computation.This often involves virtualized or containerized environments, such as Python’s virtualenv or the popular Docker container system. Such systems seek to package up as much of the software execution context as possible, restricting interactions with the underlying computer to a well-defined, as-minimal-as-possible API.

Any computing system that implements this API will then support any software that has been containerized in this way. There is no way to avoid maintaining some software environment if you wish to keep old code running, but containerization standards can make this environment as minimal and universal as possible. The entire execution context—all operating systems, libraries, configuration settings, and other resources required to run the actual computation— is packaged inside the container as part of the (shallow) task of packaging up the software for submission and archiving.

Not that any truly archival software container standards exist. Though virtualization and emulation schemes date back to the dawn of computing, modern containerization is a relatively young technology; Docker has only become a prominent technology in the past two or three years. And Docker containers were never designed to be run fifty years from now. In fact, I doubt that containers produced today will run seamlessly in ten years. But there seems to be, in principle, nothing that would prevent an archival container design.

A final step, widely used in professional software development, would be built-in testing. Modern software is usually too complex to reliably predict the full effects of any change to the code or execution context, however small. To combat this fragility, software engineering best practice includes substantial test suites, code whose only purpose is to exercise the software to prove that all functions are operating as intended. It would be very valuable for future scholars to be able to know that the crusty old code attached to a classic paper really is running as intended, and to know that modifications or adaptations have not broken previously working analyses. Software is a living thing, never really “done” because it is always open to the possibility of being run again or usefully modified, and comprehensive test suites make for long-lived software.

I can’t help feeling that it is still very early days for the era of reproducible research. We are just starting to distinguish between different important meanings of this term, and we’ve only just learned that computational reproducibility is much harder than it appears. All applied statistical analysis involves a healthy dose of statistical computation, performed with a bespoke software machine. If we want to preserve the work and make it replicable, we need to keep those machines in working order. Software engineering practice offers solutions to the very important problem of running your code more than once.

Jonathan Stray

Jonathan Stray is a computational journalist at Columbia University, where he teaches the dual master’s degree in computer science and journalism and leads the development of Workbench, an integrated platform for data journalism. He has contributed to the New York Times, the Atlantic, Wired, Foreign Policy, and ProPublica. He was formerly an editor at the Associated Press, a reporter in Hong Kong, and a graphics algorithm designer for Adobe Systems.