We are a peculiar bunch, those of us involved in educational testing. We’ve never been clear on the critical differences between calling oneself a practitioner in measurement, psychometrics, assessment, or just plain testing. And we can disagree on key issues like the role of consequences in validity or on how to set achievement standards (i.e., cutscores); and many of us think that vertical scales are neither vertical nor scales. But one point on which there has been near universal agreement is the primacy of equating as a method for linking educational assessments.
Equating, It’s Been A Privilege
We have all read and accepted as gospel a version of Linn’s, Linking Results of Distinct Assessments, either the 1993 AME article or perhaps the 1992 paper he presented at a conference on Implementing Performance Assessment: Promises, Problems, and Challenges, held as a presession to the annual Boulder conference. (iykyk) Those among us with a more technical bent may have even read Mislevy’s 1992 ETS report, Linking Educational Assessments: Concepts, Issues, Methods, and Prospects.
Whatever the source, the conclusion was clear: Equating is the most rigorous form of linking two assessments. Equating allows us to toss around words like interchangeable and comparable with the appropriate level of conviction in our voices.
The other forms of linking assessments described in Linn’s paper such as Calibration, Statistical Moderation, Prediction, and Social Moderation, we have been told cannot support such claims. Consequently, they have been, for lack of a better term, othered in the testing community. That is, they are looked down upon, marginalized, considered lesser than, viewed as being on the outside looking in.
Well friends, it’s high time that we rethink that posture. It’s time to destigmatize the so-called lesser forms of linking. In raising up these other forms of linking, inevitably we are forced also to wrest equating from its pedestal.
Not that equating has done anything wrong, per se.
- This is not a topple the Confederate monuments, lop off Columbus’ head, hide TJ and TR in a back-room kind of moment.
- It’s not that the conclusions reached and assertions made regarding linking by Linn, Mislevy, and other luminaries in our field were wrong.
- It’s not because the trend toward adaptive testing and building “test forms” from precalibrated item banks has obviated the need for equating (see Hambleton and Swaminathan, 1985, p. 223).
- It’s not even that the concept of equating described by Linn and Mislevy has always been an unattainable ideal – something to strive for that could never be reached. (see Livingston, 2004)
No, my friends, if I may still call you friends, the issue that we must face regarding equating is the very same issue that prompted Mislevy and Linn to write their pieces in the early 1990s. The same issue that prompted the need for the Boulder conference to be expanded to include a discussion of Implementing Performance Assessment: Promises, Problems, and Challenges. As was the case in the early 1990s, external pressures are demanding that we move beyond student testing to student assessment. You can call it performance assessment, authentic assessment, or personalized, culturally responsive and sustaining, instructionally useful assessment in the service of learning.
What you cannot call it is a test.
And equating as we have operationalized and mechanized it over the past three to four decades is about equating tests; that is, on-demand, standardized, stand-alone, tightly defined and tightly controlled tests.
Assessment and Testing
Despite Linn titling his piece linking distinct assessments and Mislevy titling his linking educational assessments, student assessment is not simply a test or testing. We can cloud the issue by sticking an ‘s’ on the end of assessment, but the words assessment and test are not interchangeable. It is not a matter of indifference whether you are administering a test or engaging in student assessment.
Tests and testing can be valuable components of student assessment, but as we consider the current definitions of standards and the demands on curriculum, instruction, and student learning, tests and testing can no longer serve as even a suitable proxy for the whole of student assessment.
What Linn and Mislevy were referring to regarding equating is linking tests. As Linn noted, “equating is the best understood and the most demanding type of link of one test to another.” [emphasis added] I will be gracious and say that we know how to do that and that we equate tests efficiently and well on a regular basis; at least that is, until we don’t. But for the most part, we have a pretty good sense of when equating is going to be problematic, and we figure out ways to work around that.
Adding writing to reading tests, for example, tends to make equating a bit problematic.
When equating really starts to get problematic, however, is when we start to talk about student assessment; complex, messy, and decidedly not unidimensional student assessment. The kind of student assessment that is front and center today.
That’s why we need now to lean on the other methods of linking.
Strengthening Our Links and Our Arguments
I will argue that equating is at its core about tests, scores and scales – ideally, common scales and interchangeable scores.
Linking to support student assessment, in contrast, is in a broader sense about achievement standards and claims and inferences related to those standards. Over the past several decades we have accumulated an extensive body of work on comparability that establishes that common scales and scores are not necessary to support claims of comparability made at the achievement standards level.
What is necessary to support claims of comparability is evidence and arguments.
The task before us is to begin to take a closer look at those othered forms of linking and to identify what kinds of evidence will be necessary to serve our various assessment (and accountability) purposes going forward. What does a comparability argument look like when we are compiling evidence from multiple sources rather than simply comparing test scores? As Linn suggested in his 1993 article, it’s likely that some type of hybrid approach involving professional judgment, social moderation, statistical checks, and perhaps statistical adjustments will be necessary to build a solid argument for comparability.
We don’t have a solid track record in building strong arguments (see validation, theories of action, virtually any technical report), but to be fair, that’s not really where we have focused our attention or devoted our resources. I’m confident that we will be up to the challenge.
A Closing Thought on Tests, Testing, and Student Assessment
What I wonder, however, is whether we will have the appropriate Standards in place to guide our work in student assessment?
In 1955, we had Technical Recommendations for Achievement Tests.
That was followed in 1966 by the Standards for Educational and Psychological Tests and Manuals.
We made the big leap from Tests to Testing in 1974 with the publication of the Standards for Educational and Psychological Testing.
Our focus on testing continued with subsequent revisions to the Standards in 1985, 1999, and 2014.
Every indication to this point is that the revision of the standards currently underway will also focus on testing.
As we make the quantum leap from Testing to Student Assessment, my best guess is that we will need new Standards specifically related to making complex judgments about complex student performance (i.e., student assessment) as well as Standards dealing with the appropriate processes for and uses of student assessment.
We will probably also need to work with new partners to develop those Standards. Partners who are used to dealing with squishy things like qualitative data and people. I know, that’s a scary thought for many psychometricians, measurement, and testing folks.
But here’s a scarier thought. AI is also going to be a key partner in this process of student assessment. And as I ponder the influence of AI, I cannot help but wonder which group of assessment partners is most susceptible to being replaced by AI. Sadly, I have to admit that it’s not looking good for those of us who have devoted our lives to manipulating numbers.
The good news is that we have been warned. For the past five years or so, we have been revisiting the ghosts of testing’s past. We full-throatedly acknowledge the limitations of our present testing practices. We know that the future requires of us. It’s not too late to change. It’s still Christmas morning. The big turkey is still hanging in the window.
It will be a struggle, but I’m confident that we can turn our attention toward supporting student assessment.
God bless us, everyone!