The Revolution Will Not Be Parameterized

Later this week, the Mastery Transcript Consortium (an ets* company), with its promise of promoting and celebrating learner agency, meaningful assessment, and a different set of metrics than we are used to seeing holds its 2024 symposium in Denver.

Simultaneously, the 5th NCME special conference on classroom assessment with the theme “Reclaiming the Promise of Balanced Assessment Systems: Achieving Deeper Learning at Scale for Both Students and Adults” will be taking place in Chicago. The theme says it all – or at least it tries very hard to include everything that could possibly be said.

Next Tuesday and Wednesday, we have the AI-loaded program of the 49th annual conference of the International Association for Educational Assessment in Philadelphia. The integration of AI with the distinct international perspective on educational assessment that the conference is known for promises to be fascinating.

And the week ends with those eager beavers from the Center for Assessment, fresh off supporting the Classroom Assessment conference, holding their own annual RILS conference in Portsmouth, New Hampshire, taking stock and looking ahead at consequential uses of assessment; place your bets now on the number of times the phrase “instructional utility” is uttered across those two days. Figuring out how high to set that over/under line is as difficult as any assessment challenge that will we face.

It’s a safe bet, however, that over the course of these two weeks proficiency in English language arts and mathematics will take a back seat to competencies, creativity, collaboration, and the rest of those ever-elusive 21st century skills.

Calls for personalization will trump standardization.

Multiple data points collected throughout the year will be favored over a single data point collected at the end of the year.

On-demand performance will give way to extended, DOK-4-type tasks.

I’m fairly certain that what you will not hear a lot of over the next two weeks is discussion of scales, scaled scores, and the traditional IRT models that have served us well for the past 35 years or so. You may hear a lot about difficulty and discrimination, but not in our technical sense of those terms.

It no longer makes sense to try to explain a three-dimensional world with two-dimensional plots, univariate thinking, and unidimensional constructs. Truth is, that hasn’t made sense for a long time, but advances in technology have finally reached the point where it’s feasible to start thinking differently about assessment.

It’s a brave new, multidimensional world out there that requires a new way forward in educational assessment.

Where do we begin?  First step, closure.

Thank you IRT, you served us well.

IRT replaced classical test theory in the development, implementation, and maintenance of large-scale assessment programs universally (with the possible exception of a certain college admissions test, no, not that one, the other one) because it solved a lot of our problems. So many testing problems were eliminated or alleviated by the apparent ease with which IRT allowed us to administer multiple test forms and report results on a common scale.

IRT software came along at just the right time, as first Dr. Cannell and then Dr. Koretz pointed out the problems associated with the then common practice of administering the same test form over and over again.

The problematic reality, at least as I see it, however, is that while IRT addressed our problems, it did little to help us address their needs; that is, the needs of the constituents or stakeholders we were supposed to be supporting with our tests: teachers, students, and even policymakers.

The biggest problems with our use of IRT, among many: arbitrariness, abstraction, and irrelevance in three critical areas:

  • Scaled scores don’t provide information about what students can and cannot do.
  • The distance between two scaled scores, or between a scaled score and an achievement target, is both arbitrary and abstract. Arbitrary – the same gap in performance could be 5 points, 10 points, or 50 points dependent only on the reporting scale I decided to create. Abstract – standard deviations and effect sizes. Need I say more.
  • As I explained in a post back in 2016, the IRT-supplied equal interval scale that we value so much does not reflect equal intervals on the most important factor to teachers and students, the amount of time and/or effort it takes to move between two points on the scale.

None of these problems is insurmountable, but whatever our reasons, we chose not to surmount them, or in many cases to even attempt to address them.

Note that a lot of this could have and might have been different if our assessment forebears had chosen the Rasch route; that is, chosen scales over scaling. Alas, they didn’t and here we sit.

So, moving forward, how do we prioritize the needs of teachers, students, and yes, even policymakers while maintaining critical technical quality.

Putting The Needs of Others Ahead of Our Own

We’ve been very good at backfilling, retrofitting, and acquiescing in an attempt to “give the people what they want.”

Shorter tests. Sure.  How short?

Subscores. No problem. How many?

Develop custom tests. OK. Provide off-the shelf tests. Of course. Offer off-the-shelf tests that let me feel like I am developing a custom test. You got it.

Alternate forms. Extended time. Immediate results. Direct Writing. No Direct Writing.

The list goes on and on.

But now is the time to be proactive instead of reactive. To figure out how to build technically strong assessment instruments that measure what needs to be measured and support the inferences that need to be made.

And to figure out how to use information from those instruments and other relevant sources to build models that support technically strong assessment, evaluation, and accountability systems.

As a starting point, we need to acknowledge five educational assessment realities which can also serve as design principles:

Embedded assessment > external assessment

We’ve understood since the late 1980s that a single, external, on-demand, end-of-year assessment is insufficient to measure the complex student performance that we want to measure, to support the inferences that we want to support. We jumped into performance assessment and portfolios in the 1990s with both feet and our eyes closed and suffered the consequences. But at least we jumped.

Advances in technology over the past 30 years make many of the logistical and technical problems that vexed us then eminently solvable. And that’s without even considering AI.

Progressions > Scales

Assessment is all about real content and skills that people can see and touch and understand. Latent traits won’t cut it. Well-researched and established constructs can guide our practice, but we are neither psychologists nor measurement theorists. Our job is not to measure constructs.

If and where they exist, we need behaviorally anchored learning progressions that will lend themselves to the development of a portfolio of exemplars of student performance at key points in the developmental and educational process.

Guiding instruction > Guiding Policy

When we decided to focus test design and reporting on individual student scores, guiding instruction became more important than guiding policy. That might not have been the correct decision at the time, but it was inevitable and is the correct decision now.

The good news is that starting from a position of guiding instruction can still yield information that will help guide policy. As current practice has proven, the converse is not true.

Prediction > Precision

Prediction has become a four-letter word in assessment circles, but the fact remains that educational assessment always has been and always will be more about “prediction” than about “precise” measurement – particularly with regard to individual students.

Not prediction in the “crystal ball” sense, but prediction in the sense of supporting sound inferences based on the indicator (not the measure) that is the common test and test score. Inferences about the student’s current status based on their test performance (and other available information) and inferences about the student’s future status given their current status and context.  Those statements apply to mastery tests of individual or small clusters of standards as well as to comprehensive tests of proficiency in a content area such as English language arts or mathematics.

How did we ever allow our probabilistic science to be used to produce test scores precise to four decimal places?

Standardization > Personalization

Bet you didn’t see that one coming. The three pillars of standardization (content, administration, and scoring) remain critical. However,  as my longstanding colleague, Steve Sireci, explained in his 2020 article, we need a much more sophisticated understanding of what needs to be standardized and why.

Let’s begin with the premise that there is a common core of knowledge and skills that all educated students must obtain and possess. My personal belief has long been that with regard to these skills, there is a common, shared, standardized (if you will) outcome and that personalization comes into play in instruction and in how students demonstrate their knowledge and skills – but not personalization of the outcome itself.

It’s critical that we stop thinking that common “knowledge and skills” requires identical content across students. Except perhaps for the most basic low-level facts, it never has.  Ultimately, if linking personalized manifestations of a desired standardized outcome is our greatest challenge, we’ll be on easy street.

Beyond the common core and common outcomes, which for the most part should be covered by the eighth grade, students will embark on a variety of personalized pathways, which will vary among students, will lead to different outcomes, and require different assessments. The scenario I am describing is not new. It was the prevailing model of public education for the majority of the past 100 years.

It will not be our place or our responsibility to attempt to link those different outcomes or to scale levels of postsecondary success. It will be our responsibility to ensure that every pathway is supported by sound assessment.

The assessment solutions that emerge from these five principles may involve parameters of both the technical and practical variety. They will almost certainly also include nonparametric components. They will include high-quality assessment materials available to teachers and schools, along with high-quality curriculum and instructional materials. There will be many exemplars of student performance and desired outcomes. The assessment solutions will likely lean more heavily on complex analytical models than our current so-called measurement models.

There it is. I’ll leave you to it.

Image by Gerd Altmann from Pixabay

Published by Charlie DePascale

Charlie DePascale is an educational consultant specializing in the area of large-scale educational assessment. When absolutely necessary, he is a psychometrician. The ideas expressed in these posts are his (at least at the time they were written), and are not intended to reflect the views of any organizations with which he is affiliated personally or professionally..