Equating in the early part of the 21st century
Our field is facing a crisis brought on by implausible values. The values which threaten us, however, are not the assessment results questioned above. Those are only the byproduct of the values which our field and society have adopted with regard to K-12 large-scale assessment.
That is, the values which lead us to wait more than a year for the results of the 2017 NAEP Reading and Mathematics tests while expecting nearly instantaneous, real-time results from state assessments like Smarter Balanced, PARCC, and the custom assessments administered by states across the country.
NAEP, the “nation’s report card,” is an assessment program that does not report results at the individual student, school, or even district level (in most cases). NAEP results have no direct consequences for students, teachers, or school administrators.
State assessment results for individual students are sent home to parents. School and district results are reported and analyzed by local media and may have very real consequences for teachers, administrators, and central office staff.
It is a paradox that equating for annual state assessment programs with a dozen tests and multiple forms is often carried out within a week while results for the four NAEP tests administered every other year can be delayed indefinitely with the explanation that
“Extensive research, with careful, detailed, sophisticated analyses by national experts about how the digital transition affects comparisons to previous results is ongoing and has not yet concluded,”
Of course, it is precisely because NAEP results have no real consequences or use that we are willing to wait patiently, or disinterestedly, until they are released. Can anyone imagine a state department of education posting a Twitter poll such as this?
The reality, however, is that the NAEP model is much closer to the time and care that should be devoted to equating (or linking) large-scale state assessment results across forms and years than current best practices with state assessments.
Everything you wanted to know about equating but were afraid to ask
To a large extent, equating is still one of the black boxes of large-scale assessment. It is that thing that the psycho-magicians do so that we can claim with confidence that results are comparable from one year to the next – not to mention valid, reliable, and fair.
Well, let’s take a quick peek inside the black box.
There are two distinct parts to equating – the technical part and the conceptual/theoretical part.
In reality, the technical part is pretty straightforward; at most it is a DOK Level 2 task. There are pre-determined procedures to follow, most of which can be automated. It’s so simple that even a music major from a liberal arts college can pick it up pretty quickly (self-reference). That’s what makes it possible to “equate” dozens of test forms in a week; or made it possible for a former state department psychometrician to boast that he conducted 500 equatings per year.
Unfortunately, the technical part leaves you few options when the results just don’t make any sense.
That brings us to the conceptual and theoretical part of equating, which involves few, if any, complicated equations, but is the much more complex part of equating.
As a starting point, it’s important that we don’t confuse the concepts and theory behind the technical aspects of equating with the theoretical part of equating. That’s a rookie mistake or a veteran diversion.
The concepts and theories that should concern us are those related to how students will perform on two different test forms or sets of items; or on the same test form taken on two different devices; or on a test form administered with accommodations; or on a test form translated into another language or adapted into plain English; or on test forms administered under different testing conditions; or on test forms administered at different times of the year or at different points in the instructional cycle. The list goes on and on.
Unfortunately, we know a lot less about each and every one of those condititions than we do about the technical aspects of equating.
In the past, our go to solution was to try to develop test forms that required as little equating as possible. That approach, sadly, is no longer viable. We have now moved beyond equating test forms to applying procedures to add new items to a large item pool; that is, to place the items on a “common scale” with the other items in the pool.
It was also tremendously helpful that in the past we didn’t really expect any change in performance at the state level from one year to the next. That is, we had a known target, or a fixed point, against which to compare the results of our equating. If the state average moved more than a slight wobble, we went back to find the problem in our analyses. It was a simpler time.
Where we go from here
We cannot return to that simpler time, but neither can we abandon some if its basic principles.
When developing new technology, as we are doing now with large-scale and personalized assessment, it is important to have a known target against which to evaluate our results. When MIT professor Harold ‘Doc’ Edgerton was testing underwater SONAR systems, he is quoted as saying that one of the advantages of testing the systems in the Boston Harbor was that the tunnels submerged in the harbor didn’t move. He knew where they were and they were always in the same place.
We need the education equivalent of a harbor tunnel against which to evaluate our beliefs, theories, procedures, and results. We are now in a situation where the amount of change in student performance that has occurred from one year to the next is determined solely by equating. There is no way to verify (of falsify) equating results outside of the system. That is not a good position from which to operate a high-stakes assessment program; particularly at a time when so many key components of the system are in transition.
Finding such a fixed target is not impossible, but it is not something that can be done on the fly. We cannot continue to move from operational test to operational test.
Our current model of test development for state assessment programs rarely includes any opportunity for pilot testing. That has to change.
We need to rely less on the technical aspects of equating and invest more in understanding the concept of equating.
We need a better understanding of student learning, student performance, and how student performance changes over time before we build our assessments and equating models.
We need to be humble and acknowledge our limitations. A certain degree of uncertainty is not a bad thing, if its presence is understood.
Finally, we need to move beyond the point where whenever I think about equating, this scene from Apollo 13 immediately comes to mind.