It’s time for educational measurement, or at least assessment specialists, to say goodbye to latent traits, hidden constructs, and true scores. They are useful concepts, in theory. In practice, well, they really haven’t helped us provide useful information to stakeholders, and that is the name of the game.
Admit it, you were a bit skeptical the first time they told you about correcting for attenuation and correlations greater than 1.0. And that was before you started to hear rumblings about Spearman and his cronies.
There’s no shame in focusing our efforts on simply understanding and explaining what we can see.
Einstein did it.
I mean, what is the general theory of relativity if not a focus on understanding and explaining observed scores in terms of observed scores. Or as Einstein put it,
I am anxious to draw attention to the fact that this theory is not speculative in origin; it owes its invention entirely to the desire to make physical theory fit observed fact as well as possible.
There was only one lightning strike. And I’m still fairly certain that time doesn’t actually speed up as altitude increases; at least as certain as I am that student achievement doesn’t actually slow down as grade level increases.
If observed scores are good enough for Einstein…
It’s not just observed scores, of course. Those of us in psychological measurement envy what researchers in physical science have. Who among us hasn’t fantasized about being able to whip out a long, hard, meter stick and measure something with authority.
But as the poet said, our hands (and minds) were meant for different work. Denied a simpler fate, we toil in the soft sciences, measuring soft skills, with soft measures, and softly report results with little conviction. The only time we exude confidence is when hedging our bets with 95% confidence intervals.
Baby, What a Big Surprise
Here’s the shocker.
Letting go of true scores will set us free to both find ourselves and better serve our stakeholders – you know, those people who need and use test results, the people in whose physical reality we practice our craft.
What will we find if we shift our focus to observed fact?
As it turns out, the things that educators and policymakers have been most interested in all these years don’t have a whole lot to do with latent traits, constructs, true scores, or even scale scores.
In the classroom, the decisions that educators need to make are about mastery of standards and attainment of competencies. Their decisions are based on students’ ability to demonstrate and apply specific knowledge and skills. Those demonstrations of performance require low levels of inference, transfer, or generalization. We can help with the development of tasks that elicit those behaviors.
Educators seek the best way to combine information across time and tasks to arrive at an overall judgment of student proficiency. We can help each other with that.
Speaking of student proficiency, those classifications of proficiency that policymakers have craved for the past two decades – our bread and butter – well, it turns out that focusing on what we can observe rather than some vague, aligned, yet undefined, underlying construct makes that task easier, too.
Imagine the possibilities if we start with detailed descriptions and examples of what proficient students know and are able to do rather than starting by mapping individual items to individual standards and then trying to divine what borderline proficiency looks like in a sweaty hotel meeting room.
Mislevy pointed us in the right direction with evidence-centered design, but we remained stuck on individual tasks. My colleagues who advocate relentlessly for writing and making better use of detailed achievement level descriptions as part of instruction as well as assessment have seen the light and are fighting the good fight.
Starting with rich descriptions, we can design and develop collections of tasks that elicit evidence of student performance consistent with those descriptions of proficiency, or competencies, or mastery. Will that be accomplished with a single, on-demand test? Unlikely. Is that a problem? Shouldn’t be.
Viewed from this perspective, classifying students (or their performance, if you prefer) as Proficient is not a heavy lift. The same holds true claims of competence or mastery, identifying misconceptions, or locating students along our illusive learning progressions.
It Seems My Life Was Nothing More Than Wasted Time
No, that’s not what I’m saying. Don’t be so hard on yourself. There is still a need for practitioners to understand concepts such as validity, reliability, fairness, and even measurement error.
We simply must conceive of those concepts in terms of how observed scores behave rather than in terms of true scores. We need to focus on what we know rather than what we don’t know, on helping people understand what they can do with test results rather than what they cannot do. We can do that.
We can discuss sufficiency, consistency, validity, and yes, even measurement error from the perspective of observed scores, and we can do it in a way that makes sense to and supports stakeholders. Imagine it. It’s easy if you try.
We won’t be starting over from scratch.
No reasonable professional educator expects to be able to make a judgment about student proficiency on the basis of a single item or a single test. They understand the need for repeated measurement.
They understand the implications of outlier performances even if they haven’t heard of regression to the mean.
Do they fully understand how an overall grade relates to student proficiency after combining scores across tests, projects, and assignments? Do we fully understand what claims and inferences can be made about student proficiency based on a total test score after combining performances across items and tasks – or across time with a through-year assessment? People in glass houses…. Perhaps we can help each other.
Educators understand that when student performance is very close to a borderline (e.g., Proficient/Not Proficient, A/B, Pass/Fail) that there is a lot of uncertainty, and that the classification decision is pretty much a coin flip.
They also understand that uncertainty decreases and classification decisions become easier as student performance moves away from the borderline and toward the extremes. What they don’t understand why we insist on telling them that uncertainty and measurement error increase as student performance moves toward the extremes.
Aligning our claims and expectations about measurement error with those of educators and policymakers may be where shifting our focus from true to observed scores pays the greatest dividends. I have never understood why we relentlessly insist on reporting error in terms of theoretical true scores when Lord (1980) showed us that error decreases at the extremes when one applies item response theory to practical testing problems.
You know, sometimes this stuff really is intuitive.
Do I believe that here today, with a single blog post, I’ve changed your mind about true scores? Probably not. But the year is still young. There’s much more to come. Together, we’ll find the way.