Living in the San Francisco Bay Area, one quickly develops an allergy to any claim of a “revolution” in a particular field. But it is now abundantly clear to librarians, archivists, computer scientists, and many social scientists that we are in a transformational age. Terabytes of textual and video data are being created or scanned into existence everyday. While these data include silly tweets, they also include the archives of national libraries, news accounts of activities around the world, journal articles, online conversations, vital email correspondence, surveillance of crowds, videos of police encounters, and much more. If we can understand and measure meaning from all of these data describing so much of human activity, we will finally be able to test and revise our most intricate theories of how the world is socially constructed through our symbolic interactions.

But that’s a big “if.” Natural language and video data, compared to other data computer scientists have been pushing around for decades, are incredibly difficult to work with. Computers were initially built for data that can be precisely manipulated as unambiguous electrical signals flowing through unambiguous logic gates. The meaning of the information encoded in our human languages, gestures, and embodied activities, however, is incredibly ambiguous and often opaque to a computer. We can program the computer to recognize certain “strings” of letters, and then to perform operations on them (much like the operator of Searle’s Chinese room), but no one yet has programmed a computer to experience our human languages as we do. That doesn’t mean we don’t try. There are three basic approaches to helping computers understand human symbolic interaction, and language, in particular:

  1. We can write rules telling them how to treat all the different multi-character strings (i.e., words) out there.
  2. We can hope that general artificial intelligence will just “figure it out.”
  3. We can show computers how we humans process language, and train them through an iterative process, to read and understand more like we do.

The first two approaches are doomed, and I’ll say more about why. The third approach provides a way forward, but it won’t be easy. It will require that researchers like us recruit hundreds or thousands of people (i.e., crowds) into our processes. So, unpacking this post’s title: our ability to make sense of and systematically analyze the dense, complex, manifold meaning inhering in now ubiquitous and massive textual and video data will depend on our ability to enlist the help of many other humans who already know how to understand language, situations, emotion, sarcasm, metaphor, the pacing of events, and all the other aspects of being an agentic organism in a socially constructed world—all the stuff of social life that computers just won’t ever understand without our help.

Not enough rules

The great (and horrible) thing about computers is that—as long as you use the magic words of their “artificial languages”—they will do exactly what you tell them to do. For many, this fact leads to the quick conclusion that we can just write rules telling computers how to process all of our more ambiguous “natural languages.” Feed it a dictionary. Feed it a thesaurus. Tell it how grammar works. Then, they imagine, the computer will be able to speak and write as we do … Would that it were so easy.

Unfortunately, the natural languages we use to communicate everyday are so much more ambiguous than the artificial languages computers read that it is only a modest exaggeration to suggest that writing rules allowing a computer to pass a Turing test (i.e., to so aptly converse with a human that it could fool that human into believing it too was human) would require us to write almost as many rules as there are natural language sentences. There are over 40 exception rules necessary to reliably parse something as simple as the address field of a standard form.Consider, for example, the seemingly easy challenge of parsing an address field from a thousand survey forms. The first several characters before a space are the street number, right? And then the characters after the space are the street name, no? Well… sadly, the natural world is not so well organized, even for highly structured data like addresses. Sometimes addresses start with a building name, not the street address. Sometimes, too, contrary to what we might think, addresses include two separate numeric strings, or even alphabetical characters in the street number string. In fact, there are over 40 exception rules necessary to reliably parse something as simple as the address field of a standard form.

In fact, the computer’s stupid-perfect following of instructions has inspired a genre of blog posts entitled “Falsehoods Programmers Believe About ______.”  A Google search of this phrase should provide readers with ample humility about the plausibility of writing rules to teach computers natural language. If relatively simple tasks like parsing addresses, time, names, and geographic locations from structured forms generate so much frustration, imagine the difficulties inherent in parsing sentences like: “She saw him on the mountain with binoculars.” Did he have the binoculars? Was she on the mountain? Perhaps a sentence three paragraphs earlier explained that she was carrying the binoculars while walking along the beach. But, when should the computer compare information across such distant sentences?

By the time even the most patient rule-writer has directed a computer to read just one newspaper, accounting for all the “what they really meant to say” situations, the monumental effort will have produced countless contradictory rules along with many that are torturously complex. Moreover, they’re likely to be poorly designed for the next newspaper, let alone War and Peace, a Twitter feed, or transcripts of local radio news.

Cognitive linguists would argue that the problem with the rule-writing approach is its distance from humans’ actual processing of language. The goal should not be to train the computer to behave like the operator of Searle’s Chinese room, but to train it to understand Chinese (or any natural language) like a fluent speaker. If our ultimate goal is to build computer programs to process terabytes of textual data as humans do, shouldn’t we be attempting to train computers to read them (and even their ambiguities) as we do?

Go is easy

People have become very excited lately by the development of “deep learning” artificial intelligence technology. Heralded for its ability to defeat humans in complex games like Chess and Go, the technology is also spookily appealing in its mimicry of the actual human brain. It does not include ancient structures like the hippocampus, nor is it directly connected to a breathing, walking, eating mammal. But it does use simulated neurons and neural connections to learn much like we humans do. Our brains often (though not always) learn through a process of neural network potentiation via back-propagation. To sketch that out very simply: some network of neurons fires together in our brains whenever we think a particular thought, imagine a specific memory, or perform a singular task. If that firing does something sensible or useful for us, a chemical propagates back through all the neurons of the network to encourage those neurons to fire together in the future. To learn how to add numbers through this mechanism, for example, is to increase the (chemical) potential that a network of neurons performing the addition function will fire whenever we see two numbers with a “+” sign between them. The computer brain behind “deep learning” behaves similarly. As it gets positive or negative feedback about its performance on some task, it increases or decreases the probability that it will perform similarly the next time it faces a similar task. (More on this below.)

People have become so excited about “deep learning” technology and its potential for parsing language data because it recently did something that seems very hard indeed: it beat the World Champion of Go, the most complex strategic game invented by humans. If a computer can beat one of our smartest humans at a very complex game, the reasoning goes, surely a computer can read the New York Times and give us a juicy hot take on the latest scandal. Sadly, no.

The success of “deep learning” depends crucially on domain constraints that do not resemble those of our wide-open social world. In the simple world of Go, there is a clear winner and loser. The players can only make one of a several moves per turn. And the space of possible action (while more complex and dynamic than Chess or other games) is orders of magnitude smaller than in the vast social world. To understand why this matters, it’s helpful to first have an (at least hand-wavy) understanding of how AlphaGo, the winning computer, learned to play the game.

As explained above, “deep learning” does its learning through simulated neural networks. The AlphaGo computer actually uses two such learning networks. One has the task of figuring out which position AlphaGo should play from, which position is most likely to lead to a win. The second has the task of gaming out (or simulating) the best move AlphaGo could make from any given position. These two networks communicate to determine AlphaGo’s best move from the best position, a thought process likely to seem familiar to anyone who has played the game. But writing rules for each of these neural networks, and their coordination on a single turn-taking, was not enough to make AlphaGo particularly good at the game.

Just as our brains learn (i.e., potentiate the coordinated firing of neurons) based upon feedback, AlphaGo’s “deep learning” system also required feedback—a lot of it—to develop proficiency at the game. That feedback came in two forms: First it learned by comparing itself to excellent human players. When shown a Go board, its two neural networks would settle upon a move. Then it would learn what an identically situated masterful human player did in the past. If it chose the same as the human, it was “rewarded” slightly, potentiating the two neural networks to perform similarly in future scenarios. Otherwise, it was “punished” slightly so that it would be less likely to make the same mistake again. This sort of learning is called “supervised machine learning” because humans (or at least data they have generated) stand over the shoulder of the machine and let it know when it is right or wrong.

But even this training through millions of games played by many human masters was not enough to make AlphaGo great. Next, AlphaGo was programmed to train by playing against itself. In this step, the computer had no more humans to rely upon. It just knew the game very well, all the strategies it had learned, and, crucially, what it meant to score points and win or lose. After several million games against itself, it learned to keep pursuing the strategies that allowed it to win, while eschewing the strategies that caused its clone to lose. This sort of learning—harkening back to behavioral social scientists like B.F. Skinner—is called reinforcement learning. Even without human input, the rules for scoring in any well-defined game can be translated into “objective” or “loss” functions that provide feedback to the machine, reinforcing those behaviors more likely to lead to the objective of a win.

By now readers probably have an inkling why Go is so easy compared to parsing a conversation or a news article. Even for formal political debates, there is no clear winner or loser, no clear method for scoring points. Neither does there seem to be obvious objective or loss functions that one could write in order to help a computer understand how to be a good conversationalist. Even a sensemaking task like accurately parsing a news article doesn’t seem to be one that can be boiled down to a concise list of rules. The social world is not a game, or at least not a single game (or well-defined list of games) with recognizable rules that players are consistently incented to follow.

As NYU cognitive psychologist and AI researcher Gary Marcus has put it: “In chess, there are only about 30 moves you can make at any one moment, and the rules are fixed. In Jeopardy! [where the computer Watson has also bested human champions] more than 95 percent of the answers are titles of Wikipedia pages. In the real world, the answer to any given question could be just about anything, and nobody has yet figured out how to scale AI to open-ended worlds at human levels of sophistication and flexibility.” One of the foundational thinkers of AI, Gerald Sussman, put it even more succinctly: “you can’t learn what you can’t represent.”

(Researcher-directed) crowds to the rescue

We cannot write enough rules to teach a computer to read like us. And because the social world is not a game per se, we can’t design a reinforcement-learning scenario teaching a computer to “score points” and just ‘win.’ But AlphaGo’s example does show a path forward. Recall that much of AlphaGo’s training came in the form of supervised machine learning, where humans taught it to play like them by showing the machine how human experts played the game. Already, humans have used this same supervised learning approach to teach computers to classify images, identify parts of speech in text, or categorize inventories into various bins. Without writing any rules, simply by letting the computer guess, then giving it human-generated feedback about whether it guessed right or wrong, humans can teach computers to label data as we do. The problem is (or has been): humans label textual data slowly—very, very slowly. So, we have generated precious little data with which to teach computers to understand natural language as we do. But that is going to change.

My involvement in data science began when I was trying to ask and answer complex questions about police and protester interactions from a rather large body of textual data—over 8,000 news reports describing all of the events of Occupy campaigns spread across 184 US cities and towns. The available approaches to this task—using automated NLP algorithms or labeling documents by hand—were simply inadequate. Automatic natural language processing algorithms were not sophisticated enough to label all the information I wanted from the news reports. They were particularly poor at identifying the words, clauses, and sentences describing distinct protest events. Information about a protest march (or any event or social situation for that matter) is often scattered across many non-contiguous sentences and clauses.

But, since there is no natural language grammar clearly identifying the social and temporal boundaries of an event, the best existing automated “event identifier” algorithms settle for something far less valid. They just use a part of speech tagger to find the first subject, verb and object (who does what to whom) in an article, and then call that “the event” described by the article. So, a news article starting with the sentence: “Police arrested two protesters at a rally attended by 10,000 students, union members, and activists” would be recorded as an article about the arrest of two protesters by police. That simply would not do.

A news article starting with the sentence: “Police arrested two protesters at a rally attended by 10,000 students, union members, and activists” would be recorded as an article about the arrest of two protesters by police. That simply would not do.

I wanted to know everything that was happening: specifically how seemingly tiny on-the-ground altercations might translate into operational and strategic blunders that could define the overall trajectory and outcomes of a city’s campaign. No detail was too small. But I was told that my ambitions were too great. Earlier projects attempting to systematically hand-label so many documents by so many variables had taken a decade to complete, and they still had to reduce the resolution and richness of their data to a couple dozen variables.

The single greatest factor dilating the duration of such large-scale text-labeling projects has been workforce training and turnover. The typical project requires that principal investigators painstakingly train a dozen or so undergraduates in the relatively esoteric task of hand-labeling according to the researcher’s conceptual scheme. The typical research assistant, sufficiently trained, will then hand-label a couple hundred documents, achieve mastery over the task, and either become bored and move on or graduate. The project lead, only partly through with her work, has little choice but to train and manage wave after wave of RAs, often over many years.

Determined not to suffer this fate, I tried and failed and tried and failed and finally succeeded in devising a way to enlist volunteers and paid crowd workers into text labeling tasks. The eureka moment came as I realized that my coding scheme, with over a hundred variables, could actually be divided into a separate coding scheme for each unit of analysis we were studying. (As a quick review: a “unit of analysis” is a type of object described by “variables” and “attributes.” So an “individual human” unit of analysis is described by (among others) variables like “hair color” and attributes like “brown, black, blonde, or red.” All of the variables describing a unit of analysis are organized in one branch of a coding scheme, and are likely to be quite different from the variables describing some other unit of analysis. (“Hair color,” for instance, is a variable that does not describe an “event” unit of analysis.)

The key to organizing work for the crowd, I had learned from talking to computer scientists, was task decomposition. The work had to be broken down into simple pieces that any (moderately intelligent) person could do through a web interface without requiring face-to-face training. I knew from previous experiments with my team that I could not expect a crowd worker to read a whole article, or to know our whole conceptual scheme defining everything of potential interest in those articles. Requiring either or both would be asking too much. But when I realized that my conceptual scheme could actually be treated as multiple smaller conceptual schemes, the idea came to me: Why not have my RAs identify units of text that corresponded with the units of analysis of my conceptual scheme? Then, crowd workers reading those much smaller units of text could just label them according to a smaller sub-scheme. Moreover, I came to realize, we could ask them leading questions about the text to elicit information about the variables and attributes in the scheme, so they wouldn’t have to memorize the scheme either. By having them highlight the words justifying their answers, they would be labeling text according to our scheme without any face-to-face training. Bingo.

To illustrate this approach using our examples from above, a first round of annotators might highlight all the words and phrases, contiguous or not, delineating separate events/situations appearing in documents. Those annotators, for instance, might pick out all the text describing a woman’s walk on the beach, or all the text describing a particular protest march. A second round of annotators would then be tasked with the comparatively easy job of identifying— simply by answering reading comprehension style questions—all the interesting details (variables/attributes) that could occur within such events/situations. Because all of the words and phrases delineating the event were already labeled in the first step, these annotators would easily be able recognize that the woman walking on the beach was holding the binoculars and using them to observe the man on the mountain, or that the arrest of a few protesters occurred within the context of a much larger protest event.

Since imagining this assembly line process, I have been traveling a long road of software prototyping and development. But in a matter of months, social scientists will be able to deploy this approach on giant bodies of textual data. (You can follow or contribute to our progress, here and here.)

Legal scholars will be able to trace judges’ reasoning across cases and through time. Political psychologists will be able to examine, at scale, the rhetoric of politicians’ speeches. Conversation analysts will be able to understand, quantitatively, the qualitatively different turns of discourse that encourage people to change their minds, dig in their heels, or seek compromise solutions. Constructivist scholars will be able to trace the evolution of gender and race categories. Symbolic interactions will be able to empirically elaborate theories of dating, collaboration, religious ritual, and boardroom meetings. And scholars like me will be able to dig into the details of police and citizen interactions to find ways to de-escalate conflicts. Just as AlphaGo learned from humans how to play a strategy game, our supervision can also help it learn to see the social world in textual or video data.Moreover, teachers will be able to engage their students with homework assignments that directly apply theory from a lecture to real-world data. Simply by answering reading comprehension-style questions about some snippet of text (or video), then labeling the text (or video) that justifies their answers, students will be contributing to science as they learn to see the world through new sociological lenses.

This approach promises more, too. The databases generated by crowd workers, citizen scientists, and students can also be used to train machines to see in social data what we humans see comparatively easily. Just as AlphaGo learned from humans how to play a strategy game, our supervision can also help it learn to see the social world in textual or video data. The final products of social data analysis assembly lines, therefore, are not merely rich and massive databases allowing us to refine our most intricate, elaborate, and heretofore data-starved theories; they are also computer algorithms that will do most or all social data labeling in the future. In other words, whether we know it or not, we social scientists hold the key to developing artificial intelligences capable of understanding our social world.

So, let this blog post serve as a call to action. Re-potentiate those neural networks that fired so brightly when you first read Goffman, Blumer, Skinner, Mead, Husserl, Schutz, Berger and Luckmann, and/or Garfinkel. Their theories, till now, have been far too intricate for us to empirically quantify, much less revise and extend. But with a deluge of social data and new crowd-based methods for parsing it all, we can begin to create rich and complex models allowing us to better understand the microsocial units and mechanism through which we humans cocreate and reproduce our realities. Start now: imagine and catalogue all the factors determining the social behavior encapsulated in some set of documents or videos and then go about obtaining and parsing them. The work will be difficult and time consuming to be sure. But with crowds doing the bulk of it, and machines waiting to take over all the future processing of our social data, the upside is considerable.

At stake is a social science with the capacity to quantify and qualify so many of our human practices, from the quotidian to mythic, and to lead efforts to improve them. In decades to come, we may even be able to follow the path of other mature sciences (including physics, biology, and chemistry) and shift our focus toward engineering better forms of sociality. All the more so because it engages the public, a crowd-supported social science could enlist a new generation in the confident and competent re-construction of society.