Do Re Mi-ting Expectations

Let’s start at the very beginningA very good place to startWhen you read, you begin with A-B-CWhen you sing, you begin with Do-Re-Mi
When you report state test results, you begin with …

What do you begin with when reporting state test results?

I know that testing begins with hopes, dreams, and aspirations (pernicious or propitious), which lead to theories of action, claims, inferences, and generalizations, but where does reporting test results begin? And what is the connection to music?

Some of my colleagues might argue that reporting starts with a scale. You create the scale, conduct standard setting, and begin to report results.

And once it has been created, preserving and protecting the scale becomes paramount. We place new items onto the scale, rejecting those that don’t fit or refining them until they do. The scale is what allows us to report comparable results across years. The scale gives us trends. The scale is what we use to quantify differences between groups and compute effect sizes. When our needs outgrow a single grade-level scale, we jury-rig a vertical scale.

Is there anything psychometricians fear more than scale drift? Well, besides having to mingle with people in an unstructured setting.

But what is a scale but an arbitrary contrivance. A scale is a tool that helps us organize and classify data in ways that support our needs – nothing more, nothing less. As Maria said, “Now children, Do-Re-Mi-Fa-So and so on are only the tools we use to build a song.”

Constructing scales to meet our needs

We build scales to organize things.

Our traditional music scales were built around sound intervals that were pleasant to hear; the fourth, the fifth, the minor fall, the major lift. But you don’t really care for music, do ya?

That’s OK, because you do care about large-scale assessment, and we can learn a lot about our test scales from musical scales.

The third, fourth, fifth, and octave each have physical properties which can be measured. The ratio of vibrations in notes an octave apart is 2:1, in a fifth, 3:2. Pythagoras understood that. Bob Dylan and Taylor Swift may or may not have as they started writing songs. It didn’t matter. Those properties were secondary to whether the sound of the octave or the fifth was pleasing to us.

We used the mathematical properties to help construct a musical scale that featured sounds that were pleasing to us. We did not start with the mathematical relationships and then try to convince people that the sounds were pleasant. We certainly didn’t worry about equal intervals.

We understand that a simple “linear transformation” of that scale, changing to a different key, cannot only change the sound and the feel of a piece of music, but more importantly, can dramatically affect how well two people with the same level of musical ability can interact with it (i.e., perform).

And we know that the seemingly minor adjustment of starting the same series of notes three semitones lower makes a major difference – only songs like these played in minor keys keep those memories holding on.

We are not afraid to accommodate modifications to the scale such as the blue note and we embrace rather than attempt to control for the mood effects of altered modes.

We acknowledge that the seven-tone scale that we grew up with and are comfortable listening to is not the only scale. People familiar with a different scale may bring a different perspective. You gotta make your own kind of music. Sing your own special song.

Scale construction in educational assessment has worked the same way. We identify benchmarks such as achievement level thresholds (Below Basic/Basic, Basic/Proficient, Proficient/Advanced) and we create a reporting scale to support communication of those benchmarks. The knowledge and skills that define what a student at the Proficient level can do and what a student at Basic level can do are independent of the reporting scale. Those knowledge and skills are also independent of the underlying theta scale or raw score scale.

There literally are an infinite number of and types of scales that could be “discovered” or created to help convey information about whether students possess the knowledge and skills necessary for their performance to be classified as Basic or Proficient, or for them to be certified or licensed to perform a particular task, or even to judge their musical ability. Some scales work better than others in a particular context. The best scale is the one that people understand and will interpret and use correctly.

Putting a Thumb on Scales

While working on my memoir last week, I found myself drifting back and forth between writing “scaled score” and “scale score”; so, I posted a Twitter poll (you remember Twitter). The results came in at a 60% – 40% in favor of “scale scores” – by traditional standards, a landslide. The result was not surprising to me, but why the preference for “scale” over “scaled” when describing test scores. When I was in graduate school, back before IRT, test scores were raw scores that one scaled or transformed in some way before reporting results in order to make them more useful and informative to the intended recipients. As Will Lorié pointed out in a reply to my poll, scale score implies that the scores are on a scale, which further implies that said scale is important in its own right and is something that conveys useful information to the intended recipients.

That’s a big step up from an arbitrary contrivance. Unfortunately, it’s not a step that large-scale testing is prepared to take. Whether it’s achievement levels on a single test or learning progressions across the curriculum, we have a hard enough time identifying key benchmarks and putting them in the right order, never mind trying to understand how far apart they are, what happens between them, and what it takes for students to move from one benchmark to the next. Yet we have become beholden to arbitrary scales that purport to do all of those things. (purport – love that word)

Somewhere along the line in large-scale assessment, the properties of the scale became more important than the knowledge and skills that we were trying to describe (the content) and the people whose knowledge and skills we were trying to assess. I attribute this imbalance to three distinct, but interrelated, factors:

the growing separation of curriculum and instruction from assessment and the corresponding separation of psychometrics (and psychometricians) from the content on tests,
the shift from norm-referenced testing to criterion-referenced testing, and
the shift from classical test theory to item response theory (IRT)

Each of these factors to some degree, and in their own way, contributed to shifting the focus inward onto the assessment instrument and away from students and accurately representing the knowledge and skills that they could demonstrate. We became obsessed with measuring a latent trait or construct that could not be seen and was rarely well defined by anything other than the assessment instrument used to measure it. Consequently, we have paid too little attention to how performance on those assessment instruments related to the behaviors of people that could be observed.

Put another way, each of these factors shifted the focus of large-scale testing from assessment to measurement, and as Derek Briggs reminded us in his recent book, assessment/testing and measurement are not the same thing. Measurement needs scales. As it turns out, assessment doesn’t, at least not in the same way.

Scales can be useful tools to support educational assessment, but they cannot drive the decisions we make about assessment. As we work on re-imagining educational assessment, we have to remain open to the idea that we may have to leave some of our long held views of measurement, scales, and they way that they support educational assessment behind.

Let’s start at the very beginning. A very good place to start.

Image by Clker-Free-Vector-Images from Pixabay

Constructing scales to meet our needs

Putting a Thumb on Scales

Share this:

Published by Charlie DePascale