To paraphrase Hamilton’s Schuyler Sisters, look around, look around, how lucky we are to be alive right now. History is happening all around us, and we in testing just happen to be in middle of it.
Our testing world, like the rest of the world around us, is changing more quickly and in more ways that we could have imagined even five years ago.
There are lots of unknowns and as we have had drummed into us since Assessment 101 or our first course in Statistics change is unreliable and downright scary.
But growth requires change and the best things in life are often scary. So, rather than be scared, let’s reflect on all in testing that we have to be thankful for this Thanksgiving.
Personally, I came up with five.
Tell Me Why
I’m thankful that the 2020s are shaping up to be a Why? decade – as in the 5 W’s and an H, who, what, when, where, why, and how. The late 1980s focused on How? as the case was made that it was time to stop relying exclusively on NRT and multiple-choice items and to introduce CRT, achievement levels along with a variety of constructed-response item formats and performance tasks. The big question in the 1990s was Who? as we decided that it was no longer acceptable to systematically exclude students with disabilities, English learners, and to be honest, low performers in general, from state testing. Once we had decided that everyone should be tested, the 2000s with NCLB went really big on When? and the answer was all the time, every grade level every year. The 2010s with the Common Core State Standards and the long-awaited transition to computer-based testing were focused on What? and, once again, How?
Through it all, Why? – inarguably the most important question when designing and implementing a testing program – was left largely unasked and unanswered, or at least not answered well. Sure, we had the fallback answer, It’s the law, but aside from vague allusions to equity, excellence, a nation at risk, and honesty gaps, there was not really a coherent answer as to why it was the law. States, of course, filled the void with their own reasons for testing, wish lists of purposes and uses that tended to focus more on modeling and motivation than on measurement.
Now, as we struggle mightily to come up with the who, what, when, where, and how, we are finally being forced to address the question Why? Why are we testing? Answering the Why? question, of course, will not be easy and it will inevitably lead us to reconsider who, what, when, how, and even for the first time, where we test. Yes, it will be messy for a while.
But that’s the way this is supposed to work and that’s something to be thankful for.
It’s Only Fair
It took me too long to come around on this one, but I’m thankful that fairness seems to be assuming its rightful place as foremost in the pantheon of testing concerns. Validity, wit its trusty sidekick, Reliability, in tow may have been preeminent when the main concern was constructing measures and designing tests, but ever the field realized decades ago that testing was its primary focus there has been nothing more important than fairness. We’ve danced around the issue and have been reluctant to knock validity off its pedestal, but that denial to face the truth has only muddied the waters. The time has come to give fairness its due, and I am thankful for that.
Note to the Standards committee: I fully expect the new revision of the testing Standards to lead with Fairness.
The Next Generation
I am thankful that we are finally making real progress in moving through the four generations of “computerized educational measurement” that Bunderson et al laid out in their 1989 framework.
Generation 1: Computerized testing: administering conventional tests by computer.
Generation 2: Computerized adaptive testing (CAT): tailoring the difficulty of contents of the next piece presented or an aspect of the timing of the next item on the basis of examinees’ responses.
Generation 3: Continuous measurement (CM): using calibrated measures embedded in a curriculum to continuously and unobtrusively estimate dynamic changes in the student’s achievement trajectory and profile as a learner.
Generation 4: Intelligent measurement (IM): producing intelligent scoring, interpretation of individual profiles, and advice to learners and teachers, by means of knowledge bases and inferencing procedures.
As referenced above, it took those of us in large-scale state testing decades to get to Generation 1. Reaching Generation 1 in 2015, however, we were ready to begin to test the waters of Generation 2, and that’s where we find ourselves now. With the emergence of AI tools, however, now we have the means at our disposal to view Generation 3 as a realistic possibility – and that’s exciting and something to be thankful for.
A caution, however, as we think about Generation 3. The psychometricians among us (along other measurement, assessment, and testing folks) are predisposed to see the phrase “calibrated measures” and think a) that it is most important because it is first, and b) that we know what it means. Neither is true. The concepts “embedded in a curriculum,” “continuously and unobtrusively,” “estimate dynamic changes,” “achievement trajectory,” and “profile as a learner” are all so much more important – yes, they get to the Why?
But what about calibrated measures? That brings me to the last testing thing that I find myself thankful for this Thanksgiving.
The Revolution May Not Be Parametric
Calibrated measures are more than test items and IRT scales.
We thank you IRT. You served us well to the extent that we used you (and we barely even scratched the surface of IRT in state testing), but it’s time to move on or at least to add on.
As with the discussion of fairness above, we’ve known for such a long time that an individual student’s response to an item depends on so much more than the characteristics of the item and the person’s ability on the construct we were trying to measure. That’s true no matter how many parameters we employ. When we try to account for everything important with just the item and the individual student’s ability, we inevitably fall short. When we try to control for factors post hoc, we bury signal within noise and vice versa.
And we’ve known for such a long time that we really should and need to be “measuring” things that don’t fit nicely within a traditional IRT model – unidimensional or multidimensional.
Fear not. There are lots of other statistical, data science, psychometric, and measurement tools out there to help us model, understand, and provide feedback on student achievement. Skeptical. Look at the course catalog for any reputable graduate program in measurement for the past ten years and then get back to me. Read the foundational competencies of educational measurement documents and then get back to me. Talk to a teacher and get back to me.
Where You Lead, We Will Follow
Finally, as I have been arguing in my recent posts, the focus in education has shifted from testing to student assessment. Most importantly, as a field, we seem prepared to support that shift in a supporting role.
Testing can be a key component of program evaluation. Testing can be a key component of student assessment. Testing can be a key component of instruction. However, it is only a component. More importantly, it cannot really function on its own outside of things like program evaluation, student assessment, and instruction. I don’t know where we lost sight of that fact of testing life. I suspect that it had something to do with a cultural obsession with specialization and breaking complex things into increasingly smaller parts. So, we built programs and courses devoted to tests and testing devoid of necessary context. We seem ready for a course correction.
And that it something we all should be thankful for.
Happy Thanksgiving!