For the past three decades we have weathered the storm of test-based accountability in a skiff called IRT. Buttressed only with a few psychometric tools, loosely defined unidimensional latent constructs, and the promise of measurement invariance we have been buffeted by wind and waves, sometimes taking on water, sometimes finding ourselves run aground in shallow waters.
To this point, however, we have learned to live with a rain cloud constantly overhead. We have managed to avoid the icebergs and to find safe harbor whenever the gales of November came early. Much like he who shall no longer be named, we may have no idea where we are in a place thousands of miles from our intended destination, but we are happy nonetheless to drop anchor and unload our cargo of unintended consequences.
But now it appears that our, self-contained, and largely self-created, depression has absorbed, or been absorbed by, the tempest raging throughout society. The result producing a perfect storm that has raised questions about everything from why we test to what we test to whom, when, where, and how we test. As is it must be with storms, the havoc this storm will wreak on large-scale testing, and assessment in general, is largely out of our control.
We can take stopgap measures to keep our heads above water during this storm, but as we look ahead toward recovery and being better prepared in the future, it is incumbent on our field to take a close, hard look at not only at how we use assessment and large-scale testing, but also at the structural foundation of IRT.
Let’s begin with two core tenets upon which we claim our belief in Item Response Theory is based:
- Student performance on an item (i.e., the probability of a correct response) is dependent solely on the ability of the student and the characteristics of the item.
- IRT models yield invariant item parameter estimates.
It is difficult for me to think of any concept that is more antithetical to prevailing societal views than that student performance is dependent solely on the ability of the student and the characteristics of the item. No more needs be said about that.
Invariance, however, would be a close second. Education Reform is predicated on the belief that existing structures can be changed, should be changed, and will change. It may be comforting to think that there is a fixed scale against which that change can be measured, but in the social sciences such scales are rare, if they exist at all (and educational measurement is a social science – on its best days).
We can make comparisons to the way things used to be, and such comparisons can be useful. It is usually much more useful for a field such as ours, however, to be able to describe things as they are now and ideally to be able to predict how they are likely to be in the future if current conditions do not change, and more importantly if and when those conditions do change.
You may argue that the field and measurement theory have moved beyond my simplistic restatement of those two core principles. There is certainly some truth to that. The recent work of Mislevy on the sociocognitive foundations of educational measurement raises serious questions about how we deal with the many forces affecting student performance on individual items and tests. And there are certainly more sophisticated statistical modeling techniques and similar approaches that attempt to account for other factors influencing, or causing, student performance. But practice in large-scale testing often lags well behind theory.
Also, like most religions and nations, we continue to cite these founding principles as gospel regardless of how often and how well we actually apply them. Although that practice might be relatively harmless when things are functioning well, or at least normally, it can become problematic when we lean on those core principles in times of trouble.
It will not be particularly helpful if we operate for even a short time under the belief that the only options available to us in our current efforts to improve tests and testing are to alter our definition of student ability or to attempt to develop items that account for everything that might influence an individual student’s performance.
Even more harmful, I believe, is holding onto the concept that invariance – as currently defined and applied – is a desirable property. Yes, we would like to be able to claim (and expect) measurement invariance in the sense that student scores on the same test or different test forms are comparable. We would like to be assured that we are measuring a construct equally well for all groups of students or at least understand whether the construct functions in the same way for all groups of students.
Invariance of items on a scale, however, is a totally different matter for several reasons.
- First, the large-scale testing wing of our field long ago eschewed the use of Item Response Theory to develop “test scales” (in the same sense of the term as other psychological or physical measurement scales) in favor of simply modeling existing data.
- Second, as described briefly above, it is not clear that invariance over time is something that we should expect or want in a large-scale testing setting.
- Third, it is clear that typical practitioners in large-scale testing have a limited understanding of invariance, the factors that affect it, and how to handle and interpret departures from it from a strictly statistical/psychometric perspective; and as a field we have even less of an understanding of the implications of all of the above from an educational perspective.
I will address these topics more fully in a subsequent post this month, but the bottom-line takeaway is that in the long run a continued focus on invariance will do more harm than good.