Predicting, The Future of State Testing

My focus this week is on the future of state testing. Like many state tests, my thoughts contain a mix of wishful thinking, faulty forecasting, harsh reality, and not enough deep thinking.

For those of you who prefer a quick read, a succinct summary of my prediction for the future of state testing is provided by the title of the post.

Predicting, The Future of State Testing

I’m sure that some of you read the title and wondered, “What is that comma doing there?”

Others read the title and text above and are still asking yourself, “What comma?”

Is it a typo? No.

Is it typically poor American grammar?  No, I can usually make it through the title before becoming handcuffed by grammar and paying a visit to Grammarly.com (although I do miss Miss Grammar).

Lynne Truss fans undoubtedly have already made the connection between the header image and my quirky use of punctuation.

In a nutshell, my prediction for the future of state testing is that predicting is the future of state testing.

Of course, as one steeped in the traditions of educational measurement with decades of state testing experience, I use the term “predicting” simultaneously and interchangeably in the narrowest and broadest, most highly technical yet comfortably colloquial sense of the word. You would expect and accept nothing less.

I am not talking about “looking deep into a crystal ball” predicting or “Magic 8 Ball” predicting – although I would never question the power of either.

I am talking about the type of predicting that has been at the root of educational measurement, large-scale assessment, and state testing since the very beginning.

So, before looking at the future of testing let’s start at the very beginning – a very good place to start.

And if you will allow me to put a slight twist on the words of a fellow Boston Latin School alum, if we can remember the past, perhaps we will be able to repeat it.

Predicting, The Past

In the beginning there was criterion-related validity.

Criterion-related validity was an index, an indicator, a measure (take your pick) of the extent to which performance on a test instrument predicted performance on/in something of interest (i.e., the criterion).

Criterion-related validity was divided into concurrent validity (predicting current performance) and predictive validity (predicting future performance). But there was always a criterion and always a prediction.

And it was good.

Well, it was pretty good. We thought it could be better.

Initial efforts to improve testing and criterion-related validity focused exclusively on improving the test instrument. It didn’t take long, however, for our forebears to turn their attention to the criterion.

Because in educational measurement the criterion is almost always a complex set of behaviors without an established standard or scale, the selection and quality of the indicator of the criterion is just as important (perhaps even more important) than the technical quality of the test instrument.

So, efforts proceeded simultaneously on two fronts: improving the test instrument and improving our indicator of the criterion.

It doesn’t take a PhD in psychometrics to see how the distinction between these two lines of work quickly became blurred. Both were trying to develop a better indicator of the same complex phenomenon. Eventually, quite quickly in historical terms, the distinction between test and criterion evaporated, and the lines of work converged, resulting in Messick 1989 (to be clear, the criterion still exists in Messick’’s concept of validation, it’s just harder to see).

The criterion is dead. Long live the construct.  

In the 30-plus years since Messick 1989, educational measurement has focused its efforts on the construct – better understanding, better defining, and better measuring the construct. Content-oriented organizations like NCTM developed content standards, and shortly thereafter states did the same. State tests were developed to assess student achievement of those standards. Measurement’s construct underrepresentation and construct-irrelevant variance merged with testing’s alignment.  Alignment gave birth to alignment studies.

And through it all, we lost the concept of the test being separate from the criterion.

And it was not good.

Because for better or worse, virtually all educational measurement is indirect measurement. We have bastardized the concept by relegating the use of the term indirect measurement to describe surveys and self-reports of attitudes, thoughts, and perceptions of student learning and indicators of correlates of student learning – reserving the term direct measurement for our cognitive, academic tests.

But no amount of wishing to make it so or purging of dead, white, male, eugenicists, and their regression-based statistical methods will make educational measurement any more direct.

There will always be a criterion that is separate from, magnitudes more complex than, and exponentially more important than what we are measuring with our test instruments or data we are collecting through other means.

We will always be making inferences and claims about student performance that go well beyond the information collected through large-scale test instruments like a state test. We will always generalize from a score to a criterion of interest whether that score is derived from a single end-of-year test, a through-year set of tests, or the entire set of tests administered across grades 3 through 8.

In other words, we will always be predicting.

And that is good.

And the sooner we come to grips with that reality, the better off and more useful we will be as a field.

Predicting, The Future

What is the future of state testing?

I don’t know. I have no crystal ball and no flux capacitor. And I sold my Magic 8 Ball in a yard sale years ago.

However, when I read work as diverse as Mislevy’s Sociocognitive Foundations (2018), Mehta and Fine’s In Search of Deeper Learning (2019), and Randall’s “Color-Neutral” is Not a Thing (2021), a common thread that I see is the ongoing search and call for both better criteria and better tests (or other tools) to assess them. The same theme emerges in the literature on engagement and performance-based assessment and also in anti-testing screeds — better criteria, better tests.

Tests and criteria are not one and the same even if both are intended to be indicators of the same underlying construct.. It’s not a question of improving the tests or improving the criteria. As it was in the beginning, it’s both.

Even when I read more atheoretical, data science-y type things such as young children’s response patterns during an online game serving as early predictor of dyslexia, I see the potential that exists in re-acquiring a better understanding of the concepts of a criterion and criterion-related validity – now divided into concurrent, predictive, and retrospective validity — if for no other reason than to protect ourselves and others from spurious relationships in our data.

But it is when I read Computational Psychometrics (von Davier et al, 2021) that I begin to see a vision of what the future might look like and how models might be used to better understand and inform instruction and student learning – and it’s a good thing.

I have long argued that state testing should be viewed as a data collection problem to be solved rather than a measurement problem. When I think about what the combination of new computational models being developed and ongoing advances in technology breaking down the barriers to continuous data collection from schools and classrooms, I am cautiously giddy about the future of “state testing”.

Will there still be on-demand, end-of-year or through-year tests? Yes, I think that those tests will be around for some time to come because they serve a purpose. In an ideal future world, however, their import will be greatly diminished. They will contribute one piece of information to inform better and more complete models that predict student learning and other key criteria.

The focus of state testing will shift from tests and test scores to models and the predictive power they bring to explaining, informing, and improving student learning.

And I predict that will be very good.

For at that point, educational measurement and state testing will finally begin to realize their potential to support P-12 education.

Published by Charlie DePascale

Charlie DePascale is an educational consultant specializing in the area of large-scale educational assessment. When absolutely necessary, he is a psychometrician. The ideas expressed in these posts are his (at least at the time they were written), and are not intended to reflect the views of any organizations with which he is affiliated personally or professionally..

%d bloggers like this: