A Case for Classifications

The question stuck in my craw this week is how we (i.e., the large-scale testing community) came to be so dependent on scaled scores, or scale scores if you prefer, as the primary method of reporting state test results. Test that are supposed to be providing criterion-referenced information.

I could hold Andrew Ho, and Dan Koretz over at Harvard accountable. However, all they did was point out the obvious flaws in the way that proficiency and achievement level classifications were being used in accountability systems, as is their wont.

It would be easy to lay the blame at the feet of one of my favorite foils, NAEP. NAEP not only lives and dies by the composite scales they created so long ago, but for years cast a pall over proficiency classifications with the prominent display of the fundamentally flawed caveat about NAEP achievement levels. To this day, decades into their use, NAEP achievement level results are still accompanied by the message, “NAEP achievement levels are to be used on a trial basis and should be interpreted and used with caution,” more a wringing than a ringing endorsement.

But still, NAEP did take the lead in attempting to create behavioral anchors for their score scale. They do deserve a lot of credit for that, because behavioral anchors are at the heart of reporting criterion-referenced test results.

It’s most likely, however, that our dependence on scaled scores could is just another case of the unintended consequences that result from changing one aspect of a program but leaving others as is.

Scaled scores worked relatively well when assessment reports were chock-full of supporting content and interpretive context. It was not uncommon for state test reports to connect state test results to responses on student, teacher, and administrator responses to questionnaires. Assessment programs like NAEP or the SAT still feature such context. Sadly, state test reports largely abandoned the context but kept the scaled scores.

And scaled scores without context are just numbers. There is no criterion being referenced.

It Don’t Mean a Thing …

It is difficult for me to overstate how little useful information is conveyed by a scaled score reported without context.

A scaled score is a little like flour in baking. Not much on its own, but a staple in most recipes. And people who know how to combine flour in just the right way with other ingredients can produce beautiful and oh so tasty treats.

Our NRT ancestors never served up scaled scores on their own. They created a whole menu of processed scores using scaled scores as their base. Sure, we know now that those scores weren’t necessarily good for us, but they satisfied our sweet tooth at the time.

Today we might find several different varieties of growth scores on the shelf, but not much else.

Accountability folks, who seem to drive everything we do, prefer scaled scores for their perceived precision and versatility.

Scaled scores offer the illusion of precision in comparison to achievement level classifications. Achievement classifications are coarse, it’s true, as Koretz and Hamilton (2006) described. Personally, I’ll take coarse classifications which tell me something about what students know and can do over filling in gaps along the achievement continuum with meaningless scaled scores.

As for the versatility of scaled scores, don’t get me started on the unnatural acts that accountability teams like to perform with scaled scores, such as computing means across different tests and scales. My heart weeps a little every time a see a state data dashboard that includes an average scaled score across grades 3 through 8 tests, sometimes even across content areas. (My reaction is similar if the assessment program employs a vertical scale, but for different reasons.)

Means are convenient, but tend to be meaningless, especially in a criterion-referenced world.

At some point we are going to have to accept that day-to-day life, achievement, and instruction of individual students is non-parametric even if at the macro level the world fits nicely under normal curves.

When The Scale Is The Thing

The biggest problem with our over-reliance on scaled scores, however, is not their lack of interpretability.

No, the biggest problem with our over-reliance on scaled scores is that it makes us beholden to a scale (see NAEP), which in turn ties us to a particular test or a particular mode of testing (see IADA).

And as anyone who has ever tried to slim down can tell you, there is nothing worse than becoming beholden to a scale.

Although we have tended to do things ass backwards in psychometrics and large-scale testing, we are beginning to understand that achievement standards are independent of any particular test or any arbitrary scale. It took us a while to arrive at this point, so let me repeat it:

An achievement standard has nothing to do with a test or a scale.

A Continuum Not a Scale

Accepting that reality allows us to think of achievement standards in terms of the particular knowledge and skills within a well-defined domain that fall along what Glaser (1994) described as “a continuum of knowledge acquisition ranging from no proficiency at all to perfect performance.”

Defining and identifying key points along that continuum (i.e., achievement standards) allows us then to focus on determining where along the continuum a particular student falls at a particular point in time. That is, it allows us to focus on determining what knowledge and skills the student possesses and which they do not yet possess.

Now don’t go and get all deficit mindset, student agency, learning is a journey with open endpoints on me. At some point, instruction, whether student-centered or teacher-centered is about knowing what comes next and figuring out how to get there – even if the ultimate destination is unknown.

A first step in determining where a student falls along the continuum is to agree on what type of evidence we would need to make that determination and how best to collect it.

Haven’t I Heard This Song Before?

If you have read this far, by this point you are probably thinking that you have heard all of this before.

Isn’t the continuum of knowledge and skills within a well-defined domain simply another way of saying learning progression?

Isn’t beginning with a discussion of what type of evidence we need to see the basis of evidence-centered design and principled assessment design?

Yes, of course.

But the key point here is that when we start to talk about learning progressions, we tend to start with the scale rather than the behaviors (i.e., knowledge and skills) and that’s just wrong. As Mark Wilson taught us in his 2017 NCME presidential address, our responsibility is to adjust our theory (and our tests and scales) until it fits reality, not the other way around.

And when we are talking about evidence-centered design and principled assessment design, we are already in the process of designing a large-scale test and all the baggage and constraints that come along with it. That’s too late.

At some point, somebody will ask us to design and develop some type of large-scale test as one method of collecting evidence to determine where along the achievement continuum an individual student falls “as indicated by the behaviors he displays during testing.” (Glaser, 1994). And we will perform that task to the best of our ability.

But we have to be ready to join the rest of the education community in understanding that our test does not define the achievement standard, is not the only way to collect evidence of where students fall along the achievement continuum, and for some students, our large-scale test, or any “test” is not the best way to collect evidence of where they fall along that continuum.

No Seriously, Haven’t I Heard This Song Before?

If you are still reading and you are of a certain age, you are no doubt thinking,

But Charlie, we’ve tried this before: portfolios, performance tasks, alternative test forms; and we failed miserably. What makes you think it’s going to be different this time?

Good question. I will close with a 3-part answer.

It has to be different this time.

Necessity and desperation together are always the #1 predictor of successful innovation. It will be different this time because it has to be. We don’t have the choice this time to fall back on our old methods.

The game has changed. 2023 is not 1990, 2001, or even 2015. First, we are a lot more concerned about individual students and providing a fair opportunity for each and every one of them to demonstrate what they know and are able to do than ever before. Second, we can no longer pretend that the skills that we are interested in measuring can be measured well enough through an on-demand, stand-alone, test instrument.

There have been significant advances in technology in all key areas.

Again, the game has changed since 1990, 2001, or even 2015. In 1990, we were still collecting paper portfolios and needed real people in one physical location to score them by hand. In 2001, we did not have individual student identifiers and systems for storing previous test scores and all sorts of other relevant data. In 2015, many states were still struggling to bring computer-based testing on line – and online. As they say, if we fail this time, it will be a failure of imagination.

We’ve learned from our mistakes.

After all, that’s what this game is all about. And it’s not just our technical and logistical mistakes. Of all the lessons learned, the most important lessons we have learned are the danger of hubris and our place in the world, or as Dylan Wiliam put it, “assessment should be the servant, not the master, of the learning.”

Header Image by Džoko Stach from Pixabay