Reflections on Assessment Models

Testing and models, medical and educational, have come under scrutiny during the Covid-19 pandemic. In particular, the coronavirus crisis has accelerated the conversation on the challenges of educational testing. Here, William Dardick looks at the reliability, validity, and fairness of educational assessments, and how these varied characteristics all factor into how policymakers employ testing and their results.

by William Dardick August 27, 2020

Michael Feuer’s reflection on the popularization of the word “models” during the current global pandemic challenges us to revisit the origins, meaning, imperfections, and impact of the models we use in the social and behavioral sciences, given that our ability to determine if a model is useful will impact personal and policy decisions during the current crisis. Here I will examine some of those general themes from Feuer’s essay through the context of assessment and psychometric modeling, particularly considering his questions “…how reliable are these models? How good are models generally?” while adding an additional question, “are these models fair?” These three questions exemplify the pillars of assessment: validity, reliability, and fairness.1Validity, reliability, and fairness are the three chapters comprising the “Foundations” in Standards for Educational and Psychological Testing (American Educational Research Association, 2014).

Inequality reflected in testing

“Knowledge is power.”2Often attributed to Sir Francis Bacon, the phrase “knowledge itself is power” but can also be found in earlier Persian texts. Data are a form of knowledge. Models of all types help us interpret our data and provide us with new information. Today more so than in the past, data are abundant, transferable, and valuable. Knowledge can help stop a pandemic, provide racial justice, protect a nation, and inform policy in government, business, and education. Data can also be co-opted for disingenuous purposes, and often for personal gain. We hear in the news that facts or models are manipulated, information is stolen, and data are deliberately suppressed. For example, during the Covid-19 pandemic, some have suggested that we stop or minimize testing so that fewer cases are discovered, an approach that could be viewed as a strategy to mitigate public perception of the pandemic’s severity. Fortunately, the public generally understands that sticking our heads in the sand because Covid-19 tests are not perfect would be foolish. Additionally, we know that halting Covid-19 testing will not make the problem of health and social inequalities for racial and ethnic minority groups go away. Inequality exists with housing, education, income, and access to healthcare, and it is reflected in our testing. Covid-19 testing is not responsible for the inequalities of infection rates being higher among minority populations; rather, it shines a light on these problems in our society.

“It is as easy to say an assessment is biased and blame the test for the troubles in education as it is to blame testing for the severity of a pandemic.”

The idea of blaming the test or ignoring the inequality which it reveals are not new. While there are differences in medical and educational testing, such as instrumentation, methods, and scales of measurement, permissible levels of sensitivity and specificity, claims of causality, and the types of actions we might take base on the results we get, there are also many similarities. Most important is the parallel in testing: we expect results to be reliable and valid for their intended uses and help inform the decisions we make. Testing in the field of education faces the same set of challenges and this has been brought to bear during the pandemic with widespread calls for the suspension or removal of the requirements for university entrance exams. It is as easy to say an assessment is biased and blame the test for the troubles in education as it is to blame testing for the severity of a pandemic. Although I do not believe universities make the decisions to remove, suspend, or retain an entrance exam frivolously, it is easy to react and make a rash decision. These decisions, although made with haste in a pandemic, seem to be agreed on with good intentions and the community in mind, i.e., they could reduce barriers of social inequality and help sustain the university.

Should we keep our testing systems or eliminate them? One issue that should trouble us is that despite perceived racial and socioeconomic testing bias in education, we hear claims that these exams are not biased.3Lynn Letukas, Nine Facts about the SAT That Might Surprise You (The College Board, 2015). However, we do know that tests function differently for different groups of persons in our society, so how can it be true that tests are not biased? Let’s explore this question from a 10,000-foot perspective using the central tenets of testing (i.e., reliability, validity, and fairness), and conclude by considering changes to education and testing.

Reliability, validity, and fairness, oh my

We’re going to visit briefly the concepts of reliability, validity, and fairness in testing starting with the question, how reliable are these models? There are various models used to show how evidence that scores on instruments are consistent. Reliability in testing considers psychometric/statistical model fit, test consistency and precision, and the error of measurement of the scores. In psychometric models, measurement error indicates how far away true scores are from measured scores. A simple and classical model would have us consider the “what if” scenario of testing someone an unlimited number of times without improvement but over all possible situations. We could think of the average score as the true score someone should attain, impacted by random chance on any given testing instance to create an observed score. More encompassing paradigms of measurement accurately explain and predict phenomena, but I will spare the details as this is not really the central issue in public debates about testing in education.

“Just because a test is reliable doesn’t make it valid.”

The short answer to our question on reliability, especially for those intensely studied assessments like entrance exams, is that they are consistent. Reliability is a necessary but not sufficient property of test scores. Intuitively it’s easy to understand that a test that is not reliable doesn’t help us much. If our Covid-19 tests produced random results, we couldn’t use them. Fortunately different types of tests for this virus are now reliable, with test sensitivity for positive Covid-19 results.4Test sensitivity is a true positive rate, while test specificity is true negative rate. Some Covid-19 tests are rapid response with reduced reliability intended as screening. Repeated tests are needed in that case to provide reliability. Other tests are the gold standard and provide more accuracy with increased time of response. However, here’s something that makes the discussion of testing interesting as it pertains to the value of an assessment: just because a test is reliable doesn’t make it valid. Take phrenology for instance, a pseudoscience that measures the lumps on a head to predict mental characteristics.5Phrenology is an example I like to use in the classroom of a reliable assessment that is not valid. The method is reliable. Take my lumpy head as an example; if a few people felt around for the bumps on my head, we would get a pretty accurate and reliable result across raters pinpointing the bumps on a phrenology chart. The instrument and method are reliable. But the assessment is not valid or fair. It doesn’t measure what it reports to measure, the claims are not supported, and there is racial and gender bias inherent in the charts. These ambiguities in reliability are more akin to the issues we want to explore in considering tests in education.

Moving on, it’s difficult, if not impossible, to disentangle validity and fairness, as fairness is fundamental to validity. Let’s start with the question, how good are models generally? Evidence for validity of test scores is explored throughout an exam’s lifecycle. When we say a test is valid, what a test developer would mean is that the scores on the test are valid for a specific purpose or interpretation, which has been researched to provide evidence for the claim they are making.6I reference the standards again here and elsewhere as the purpose of the standard is to support and provide guidelines for the development of tests. Standards for Educational and Psychological Testing (American Educational Research Association, 2014). All kinds of evidence and consequences are considered. For example, you can examine associations and predictions with criterion values, explore underrepresentation or irrelevance of a construct being studied, find support for content from experts and stakeholders, or study the process of the test. Now, if someone uses a test beyond its intended purpose, those who make those claims are responsible to support those claims. This is why college entrance exams cannot be used as hiring criteria; if a test is not used for its designed purpose, claims will be unsupported.

For a university entrance exam, it takes years to build an assessment: the domain of interest is researched, explored, and deconstructed so that it can be reconstructed as an assessment. Subject matter experts and stakeholders (e.g., faculty, administrators, parents, teachers, students, etc.) help understand what to measure (knowledge, skills, ability, understanding), how it is measured (the exam’s structure maps to what students actually learn), and research must support that it is fair for all students in the complicated world related to education for as long as the test is used. The interpretation and claims made about persons are subject to change and need to be updated over time while reflecting values we hold in society. Even after all this work and effort to provide evidence for the validity of test scores, we still often have differences in how the test apparently functions for different groups of persons. So, how can the test be fair?

“Assessment models are not perfect, but fairness helps crystalize the errors and weaknesses of our model.”

In testing, we can “interpret fairness as responsiveness to individual characteristics and testing context so that test scores will yield valid interpretations for intended uses.”7Standards for Educational and Psychological Testing (American Educational Research Association, 2014). Assessment models are not perfect, but fairness helps crystalize the errors and weaknesses of our model. When models fit well, we have more evidence to support the inferences we seek to make regarding persons and outcomes. However, sometimes a test or item does not work as expected for an individual or group (e.g., based on race or gender). One mechanism for assessing fairness starts by asking whether items, tasks, or the complete assessment functions differentially for different groups of persons. If there are differences among groups, evidence of potential bias exists, and one must follow up on each source of potential bias to explain the cause of those differences or determine if bias is real and then correct the problem. When faced with bias, an organization should independently investigate each instance.

Of course, beyond measurement bias, we can view fairness in other ways. For example, it might start with having access to the test to demonstrate ability without needless barriers. When we hear that a university is not requiring entrance exams during the pandemic, although the statistical model is the same, universities view the assessment argument as no longer being supported, perhaps changed in reflection of this global crisis. For example, even when testing organizations move access online, students may not have equal access to testing. Other arguments used in the university entrance process are also in question. One example is that students are in different learning environments and GPAs are impacted beyond the issues already inherent with income disparity. All the issues around fairness in testing impact fairness in education.

Approaches and solutions

This brings us back to our concerns; even if a test is not technically biased, differences exist and serve as a barrier to getting an education. Your education and testing are impacted by something as simple as your zip code. So, what should we do? The idea that we must choose between keeping our testing systems or getting rid of them is a false dichotomy. As it pertains to testing, we could change the way we test in the school system and for entrance exams, using the current assessments as a bridge to the future. Arguments we make about persons and the very reasons for testing can change. There is a need for a modern and unified classroom assessment system. Embedding a new type of testing in education can be a catalyst to expand beyond the traditional industrial revolution-style model that locks in inequality. The long-term solution will see conventional testing ultimately phased out and a new assessment approach developed that links closely to the education systems in the “classroom” to help fix the zip code problem with primary education.8Teacher Empowered Assessment (TEA), is a theoretical framework to solve the problems surrounding the burden of standardized testing on students and teachers by linking a system of assessment. William Dardick and Jaehwa Choi, “Teacher Empowered Assessment: Assessment for the 21st Century,” Journal of Applied Educational and Policy Research 2, no. 2 (2016): 87–98. Advancements in digital tools, software, and psychometric theory can merge formative classroom assessment with beneficial elements of large-scale assessment, assessment design, growth modeling, and automatic item generation technology to yield on-demand assessment for students and teachers that are seamless with the classroom and can help us understand student ability, improve teachers as evaluators and consider social impact.

The shorter term solution reminds me of the modeling ideas linked to George Box: “All models are wrong but some are useful,” and expounds on the idea that simple, evocative models focus on the right kinds of errors.9Box makes these types of commentaries in numerous sources but here is one that is commonly cited. George E. P. Box, “Science and Statistics,” Journal of the American Statistical Association 71, no. 356 (1976): 791–799. Understanding how to interpret errors in models is never more important than when we need to use them to make imperative decisions in our life or for social change.10“Since all models are wrong the scientist must be alert to what is importantly wrong. It is inappropriate to be concerned about mice when there are tigers abroad.” Box, “Science and Statistics.” On this topic, Box offers us this warning: it is “inappropriate to be concerned about mice when there are tigers abroad.” We use tests because they provide evidence for decisions we need to make about persons we do not know, especially when voluminous decisions need to be made.

“The real barrier to college is not a test; it is money and educational preparedness, which some argue is also money.”

I’ll suggest two short-term solutions for entrance into college. First, when assessing a prospective student, it’s necessary to adjust the claims for the entire entrance process, not just the test. Consider how to interpret test results, GPA, community service, and essays in light of the current pandemic and issues of social and economic inequality. Second and more importantly, the tiger in the room: make college free for everyone.11Beyond free traditional four- year college, I include: community college, trade schools, support for certifications, licensure, training programs, and test preparation courses. The entire framework of public education and standardized testing will change overnight with this one short-term solution. The real barrier to college is not a test; it is money and educational preparedness, which some argue is also money. Even when accepted to flagship universities, students of low- and middle-income families cannot afford to attend. With this approach, entrance examinations would become a diagnostic tool to identify placement into prerequisites instead of an apparent barrier to entry.

Final thoughts

Models are indeed useful, but even in the best of times, they are not perfect. I will echo Voltaire’s warning from Michael Feuer’s essay that we should not let the “perfect” be the enemy of the “good.” Universities can use testing without being driven by test results or dismissing them. In this way, the pandemic is a lesson in what users should consider in all time periods for responsible use, removal, or change in the use of an assessment. As individuals, we may need to change how we interpret scores on a test during a pandemic. As governments or institutions, we may need to make hard policy choices, knowing the impact of retaining an exam and considering its alterative interpretation or removing the assessment and losing what might be valuable although imperfect evidence. The current situation shines a light on the need for systematic reform in testing and education (and healthcare) systems, with these changes leading the way to social justice.

References:

Validity, reliability, and fairness are the three chapters comprising the “Foundations” in Standards for Educational and Psychological Testing (American Educational Research Association, 2014).

Often attributed to Sir Francis Bacon, the phrase “knowledge itself is power” but can also be found in earlier Persian texts.

Lynn Letukas, Nine Facts about the SAT That Might Surprise You (The College Board, 2015).

Test sensitivity is a true positive rate, while test specificity is true negative rate. Some Covid-19 tests are rapid response with reduced reliability intended as screening. Repeated tests are needed in that case to provide reliability. Other tests are the gold standard and provide more accuracy with increased time of response.

Phrenology is an example I like to use in the classroom of a reliable assessment that is not valid.

I reference the standards again here and elsewhere as the purpose of the standard is to support and provide guidelines for the development of tests. Standards for Educational and Psychological Testing (American Educational Research Association, 2014).

Standards for Educational and Psychological Testing (American Educational Research Association, 2014).

Teacher Empowered Assessment (TEA), is a theoretical framework to solve the problems surrounding the burden of standardized testing on students and teachers by linking a system of assessment. William Dardick and Jaehwa Choi, “Teacher Empowered Assessment: Assessment for the 21st Century,” Journal of Applied Educational and Policy Research 2, no. 2 (2016): 87–98.

Box makes these types of commentaries in numerous sources but here is one that is commonly cited. George E. P. Box, “Science and Statistics,” Journal of the American Statistical Association 71, no. 356 (1976): 791–799.

“Since all models are wrong the scientist must be alert to what is importantly wrong. It is inappropriate to be concerned about mice when there are tigers abroad.” Box, “Science and Statistics.”

Beyond free traditional four- year college, I include: community college, trade schools, support for certifications, licensure, training programs, and test preparation courses.

William Dardick

William Dardick is the current Director of Assessment, Testing, and Measurement in Education at the Graduate School of Education and Human Development and is an associate professor of educational research at the George Washington University.