One puzzling question about the viral spread of “fake” news—news articles that are clearly fabricated—is why people believe and distribute them. Why isn’t it clear what’s fact and what’s not?

“There is no linguistic checkmark or test of authenticity that flashes at us when something is untrue.”

The distinction, as it turns out, is quite hard because facts are not conveyed through special language. There is no linguistic checkmark or test of authenticity that flashes at us when something is untrue. The language of facts draws from the same dictionaries and grammars as the language of falsehoods. There may, however, be stylistic and genre differences—subtle cues that can point to the intention of the writer and their familiarity with the language of journalism.

Drawing on insights from genre and style in linguistics and applying methods from corpus and computational linguistics, colleagues and I are studying the language of fake news and misinformation. We have found that although fake and fact-based news stories can be easily confused, large-scale text analyses point to interesting differences. Some of those differences have to do with the informal and conversational style of modern news stories, which may be a clue to their authenticity.

Authenticity in the news

“The expansion of the opinion section in most mainstream papers has also meant that the reader encounters opinion mixed with hard news as they read a newspaper, either on paper or online.”

The genre of news articles ranges from in-depth investigative journalism to listicles. The increase in clickbait, humorous articles, and puff pieces in mainstream news outlets makes it harder to distinguish serious journalism from attempts at disinformation and misinformation. The expansion of the opinion section in most mainstream papers has also meant that the reader encounters opinion mixed with hard news as they read a newspaper, either on paper or online. This is part of a process of informalization—a shift toward a more conversational style in news discourse and a process of including significant amounts of evaluation in news reporting.1Oxford University Press, 2017More Info → One of the most important characteristics of conversational and oral discourse is that it is more involved,2Cambridge University Press, 1988More Info → that is, it features the perspective and opinion of the writer more prominently.

Against this backdrop, it is not surprising that readers cannot easily distinguish the facts and events reported from the perspective of those doing the reporting. It is a natural consequence of the shift toward a more involved and informal style of news writing. Naturally, there are many gradients in this shift. The traditional “quality” broadsheet newspapers have engaged in such informal styles much less than tabloids, local news outlets, and some online-only publications.

The language of misinformation

Consider the example below, the beginning of a news article labeled as fake by Snopes and part of a dataset we collected. There is nothing in the language itself that would indicate that this is not fact-based. Snopes used external information to determine that it was fake, including the site where the article originated and the photograph accompanying it, which was found in a different news article about a man who had lost an arm to an alligator attack.

An environmental activist was almost killed yesterday in the Indian Ocean, after the great white shark he was trying to hug suddenly attacked him and bit his arm off.
21-year old Darrell Waterford, from Eugene in Oregon, was participating in a promotional video for Greenpeace, some 100 nautical miles away from the Australian city of Perth.

The lede follows a typical news story structure, identifying the protagonist with a descriptor (an environmental activist), followed by the name and further details in the first paragraph of the article. The rest of the language is consistent with more formal newspaper language. It includes detailed descriptions such as name, age, and place of origin. At the same time, it includes informal language (some 100 nautical miles), common in the type of human-interest stories this purports to be.

At the other end of the fake news spectrum, we find articles that look more suspicious upon first reading. Consider our next example:

[Headline:] Loretta Lynch: “Confederate Flag Tattoos Must All Be Removed IMMEDIATELY”
History in general is filled with various artifacts that can either represent the good or bad sign of humans in general. The American flag is a symbol of good because it showed the uniting of colonies and the beginning of the United States of America. However there are symbols of bad times, like the Nazi flag. No matter what side they are on, they represent some sort of historical significance.

The capitals in the title are unusual, as is the very general first sentence. The structure here is: thesis (artifacts can be good or bad), evidence (good: American flag; bad: Nazi flag), then conclusion. This is a common argumentative style, characteristic of debates. It is not, however, typically found in news articles, although it may appear in opinion pieces. Furthermore, the placement of however at the beginning of the sentence and without a comma points to a writer that is not entirely familiar with standard style conventions and to a piece that has not been reviewed by an editor.

While the flaws in this second example are subtle, they should constitute tells for any astute reader, one who is a regular consumer of mainstream news media. It is this kind of linguistic and stylistic analysis that can give us clues about some fake news articles. Although it will not help with articles that follow the conventions of the genre, such as the first example, it will winnow the field of fake content.

Fighting misinformation: The many paths

The current trends to combat the fake news problem take three main approaches: educate the public, carry out manual checking, or perform automatic classification. Educating the public involves encouraging readers to check the source of the story, analyze its distribution (who has shared it, how many times), or run it by fact-checking websites. This is certainly necessary, but it will not be enough and places a heavy burden on the individual.

“Our lab is working on text classification methods based on linguistic features to complement methods that rely on the source of the story or its distribution network.”

Organized manual checking before or after publication is a possibility, but it is also not a realistic solution, given we know now that misinformation spreads fast and wide.3Soroush Vosoughi, Deb Roy, and Sinan Aral, “The Spread of True and False News Online,” Science 359, no. 6380 (2018): 1146–1151. Computational linguistic and machine learning approaches perform automatic classification and can help complement the efforts of fact-checking sites such as Snopes, Politifact, or Public Editor. (Note that, in a strange twist, the co-founder of Snopes has admitted to plagiarizing some of the stories on the site.) Our lab is working on text classification methods based on linguistic features to complement methods that rely on the source of the story or its distribution network.

Text classification for fake news detection

The text classification approach relies on Natural Language Processing to distinguish one type of text from another. Text classification has been successfully applied to spam detection, sentiment analysis, social media monitoring, and authorship attribution. It typically uses supervised machine learning—a form of artificial intelligence—on large, labeled datasets to learn characteristics of the data. For instance, a spam detection system is first fed a large number of email messages already labeled as “spam” and “not spam” and applies an algorithm to learn how to classify new messages.

Two key issues in that description are “large” and “labeled.” Modern machine learning models, especially those deploying deep learning methods, are particularly data hungry. They need very large datasets to extract features that are relevant to one class or the other (spam vs. not-spam). Those datasets need to be accurately labeled; we need human input to know what counts as a spam message.

The need for large amounts of labeled data has been a stumbling block in fake news research. When we first embarked on this project, we assumed that data collection would not be an issue. After all, we had been repeatedly told that fake news and misinformation were freely and widely circulating online.

“We need more data, and we know large amounts of it rests with social media platforms and large tech companies.”

The reality is quite different. While researchers have been compiling datasets for years, none of those are large enough or accurate enough for the simple problem of deciding whether a news article contains misinformation or not. We collected news articles from fact-checking organizations, but the process was painful, not entirely accurate, and resulted in a mid-sized dataset of about 3,000 articles.4Fatemeh Torabi Asr and Maite Taboada, “Big Data and Quality Data for Fake News and Misinformation Detection,” Big Data & Society 6, no. 1 (2019). We need more data, and we know large amounts of it rests with social media platforms and large tech companies. In the meantime, and even with a mid-sized dataset, we have made decent progress in our attempts at distinguishing fake news articles based on their stylistic characteristics. We have found that fake news articles tend to be shorter than fact-based news. They tend to contain more adverbs, more negative words, and more words related to sex, death, and anxiety. They show different patterns of pronoun use, with they more frequently used (perhaps a result of “othering”), whereas fact-based news shows a higher frequency of the first-person pronoun I. Surprisingly, fact-based articles have more punctuation and more apostrophes, perhaps because they are written in an informal style (using don’t instead of do not). These patterns could lead to better identification of the style of fake news.

Machine learning has recently acquired an image problem. We have learned that models trained on naturally occurring data suffer from the same biases as the society that produced them.5New York: Penguin Random House, 2017More Info → In addition to the unintended, existing societal biases, many machine learning models have an intentional human bias motivated by the desire to increase engagement. There are also serious concerns about the environmental consequences of computing the large models needed for accurate results.6Emily M. Bender et al., “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?FAccT ’21: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, March 2021, 610–623.

Nevertheless, machine learning shows promise in the fight against misinformation. But first, we need more data to solve a problem caused by an abundance of false data. Social media companies could do much more by confidentially sharing data with researchers.

The next frontier

More data will solve some of the problems in text classification for fake news detection. Text classification can help filter out some of the most egregious examples of fake news, just like it helps detect crude cases of spam in email or clear instances of abusive messages online. But what if fake news writers become more sophisticated, just like some spammers have? Then we are still faced with the problem of authenticity.

Wendy Chun has pointed out that we expect authenticity, rather than facticity in our news stories. When writers of misinformation and disinformation learn to sound authentic, then we have few tools left in the fight against misinformation. Education and common sense will then be our only tools.

Banner photo: Peter Lawrence/Unsplash.

References:

1
Oxford University Press, 2017More Info →
2
Cambridge University Press, 1988More Info →
3
Soroush Vosoughi, Deb Roy, and Sinan Aral, “The Spread of True and False News Online,” Science 359, no. 6380 (2018): 1146–1151.
4
Fatemeh Torabi Asr and Maite Taboada, “Big Data and Quality Data for Fake News and Misinformation Detection,” Big Data & Society 6, no. 1 (2019).
5
New York: Penguin Random House, 2017More Info →
6
Emily M. Bender et al., “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?FAccT ’21: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, March 2021, 610–623.