To educate is to change.
Instruction and learning are about change.
Educational measurement is defined by variance. Literally. The fundamental concepts in the field are expressed in terms of variance. One of the first techniques that we learn as eager young graduate students is analysis of variance. Without variance, our lives as psychometricians, much like the concept of reliability, would be undefined.
Yet despite everything else in our field being based on change, virtually everything we do in assessment rests on an assumption of invariance. The development of tests, the creation of scales, and the interpretation of test results are all dependent upon the concept of item invariance.
It’s just one of the many contradictions, logical inconsistencies, and discontinuities in our field.
Now, I know what you are thinking. “That’s not fair. Score variance and item invariance are not contradictory or mutually exclusive. You’re just playing with words to make a point.” Truthfully, I have no idea what you’re thinking. And duh, this is a blog, of course, I am playing with words, but there is a point.
All I Need Is One Fixed Point
A fixed point can come in handy. Doc Edgerton used the Boston Harbor tunnels as a fixed point when he was developing sonar. The Apollo 13 astronauts used the sun as a fixed point when they had to make a critical course correction without a computer. (In the movie, Ron Howard changed it from the sun to the earth for dramatic effect, but we all know that the earth, like an item parameter, is not a fixed point, but I am getting ahead of myself.)
In educational measurement, we have no harbor tunnels and no sun. We develop tests to measure constructs that are latent. Students’ true scores on those tests (or latent constructs) cannot not be observed. (Pun intended.) We yearn for a fixed, or invariant, point of our own.
There is, of course, a real need for measurement invariance; that is, to be able to conclude that a construct is being measured in the same way and equally well across different tests or test forms. Put another way, we need to be assured that different test forms are measuring the same construct and support consistent inferences about student ability, achievement, etc. (We are also concerned about measurement invariance across groups, but let’s hold off on that for just a bit.)
For a time, we were able to use the state mean as a proxy for a fixed point (like Ron Howard used the earth). Nobody expected the state mean in content areas such as English language arts, mathematics, science, or social studies to really change from one year to the next. There might be some variation year-to-year within schools or between districts due to a new instructional program or some external factors, but certainly not enough to change the state mean.
I mean, we were chided and made to stand in the corner for the Lake Wobegon Effect – average scores across the state improving from one year to the next.
It was a simpler time.
But then came Education Reform with its expectations for change, holding educators accountable for improvement. People expected, even demanded, that the state mean change from one year to the next – and at every grade level. We were screwed. We needed a hero.
Along came IRT.
From Measurement Invariance to Item Invariance to Thinking Set in Stone
IRT gave us measurement invariance, item invariance, and NEAT ways to link test forms. We ran with it. We started with a theory of item invariance, an assumption that could be tested in our data. A few Rasch types in Chicago and Berkeley continued to think in terms of theory and testing assumptions. The rest of in large-scale testing, not so much. There was item fit, person fit, we developed some procedures, and yada yada yada, the concept of parameter drift was created – developed in a psychometric lab and spread by loose-lipped academics mixing with operational psychometricians at a conference reception with a buffet table and open bar.
Parameter drift was bad.
Parameter drift was evidence that our theory about the structure of items within the latent construct was wrong (or that there was a problem with the items or the people).
But wait, what theory? We never had a theory. We just modeled the data at a particular point in time.
Then parameter drift must be evidence that the original model no longer fit the data.
So what? (Don’t you just love stopping a conversation with the “So What?” question?)
Let’s Slow Things Down
What exactly was it that we modeled?
In state testing, the norm is for the model to reflect the performance of virtually all of the students in the state shortly after a new set of content standards has been introduced.
- We fully expect instruction to change as teachers become more familiar with the standards.
- We fully expect student performance to change as students become more familiar with the standards.
- We fully expect the improved instruction on the standards and improved student learning at previous grade levels to act in combination with the above to result in a change in student performance.
With all of this change taking place why would we have any expectation that our model of the data would look the same in two years, five years, ten years? (Because we think we have to, that’s why.) That’s foolish.
Of course, the relationships between the items, and consequently the item parameters, are going to change. That’s the way real life and real knowledge and skills work.
In real life, things are difficult until they aren’t. Difficulty (and discrimination) doesn’t stay the same. Further, real increases in knowledge and skills look more like jumps in a step function than a gradual, smooth movement along an ogive.
- I can’t catch a pop-up until I can. Then I never miss one.
- I can’t play across the break on the clarinet no matter how much I practice and then I can all the time. (Play the sunset.)
- My pie crust is too crumbly or as hard as a rock, or heaven forbid, I have the dreaded soggy bottom, then it is perfect every time.
- I can’t solve a work problem or mixture problems or limits then I never get any wrong.
Some knowledge and skills may have been more difficult than others to acquire or were acquired sequentially (which is a distinction with interesting implications) but having been acquired it would be difficult to distinguish among them.
Be Purposeful About What We Model
We have become complacent with modeling based on testing the “population” (i.e., all students in a grade level) and focusing on “measuring” individual student ability. Actually, we have become sloppy, boring, and nearly rendered ourselves irrelevant. We no longer think about what we are modeling and why.
At one point in the development of tests, it was common to attempt to secure a nationally representative sample to support inferences about student performance at a particular point in time.
Between 2012 and 2015, assessment consortia made the politically expedient, but technically horrific, decision to attempt to model a consortium-representative sample (as if a consortium were a real and stable thing that anyone was interested in using as a reference point for inferences).
As described above, we now routinely model the performance of all of the students in the state at the end of a year of instruction shortly after standards are introduced.
Like the situations described above, that state data/model provides a snapshot at a particular point in time.
We can use that model to “measure” how much student and school performance has changed from that point in time (at least for a while), but it makes no sense to continue to expect that model to apply when conditions change.
It would make sense, and, dare I say, be very useful, to understand how and why the model is changing.
What does the model look like when the standards have been in place and teachers and students have had time to adjust to them?
What does the model look like for students who have attained the desired competencies in a grade level?
What does the model look like after a pandemic disrupts learning across three school years?
What does the model look like when learning is occurring over the course of a year? Modeling at the beginning, middle, and end of the year is different than applying an end-of-year model to score student performance throughout the year. (Didn’t you ever wonder why fall-spring and spring-spring performance never lined up on NRT? It wasn’t because of summer learning loss.)
If It Doesn’t Fit You Can Learn Quite a Bit
Consider what it would mean if instead of modeling the performance of all students in the state we selected a sample of students who were prepared to take a course aligned to the grade-level standards and received adequate instruction throughout the year. (Seriously, this is actually a thing for some types of tests. Look it up.)
It’s a safe bet that there would be lots of students, and more importantly schools, that don’t fit the model. And isn’t that what we really want to know?
What might misfit, or item parameter drift, for those students and schools reveal?
It may reveal that there are alternate approaches to instruction that produce equally desirable outcomes. Perhaps. Or as my long-time associate Tia Fechter explored in her 2010 doctoral dissertation, item parameter drift may be an indication of differential opportunity to learn – particularly as we attempt to model and measure more complex learning.
Wouldn’t it be useful for educators and policymakers to know if either were true? I know that it would be a lot more interesting (and fun) for people really interested in using measurement to support education.
Vive La Variance!
Over the past decade, advances in technology and resulting improvements in infrastructure have increased exponentially our ability to model changes in performance over time under a variety of conditions – for both grade-level cohorts and with regard to longitudinal data for students.
If we want to use large-scale testing to provide actionable information to educators and policymakers, this type of information from modeling is the best that our field has to offer – and I mean that in a good way.
We can provide information about what proficiency on the standards looks like as a student progresses from novice to expert. We can provide information about how relationships among items, knowledge, and skills change as a school successfully implements an appropriate and balanced system of curriculum, instruction, and assessment.
We can build models and we can help people interpret them.
As I have argued for a long time, teachers don’t need us to tell them whether a student is (or isn’t) proficient. They already know which students are proficient as do most students and parents.
Educators and policymakers need us to model learning over time within and across years.
We can do that, but only if we aren’t invariant.