Meaningful Scales: Help Wanted

In their recent EM:IP article, Brookhart and Bonner address the question, “When Is Classroom Assessment Educational Measurement?” With an approach that evokes the old punchline “I only need to outrun the bear” they keep their focus squarely on how classroom assessment fits within educational measurement as it is currently defined and commonly practiced; therein, acknowledging but deftly avoiding larger and largely academic debates about whether much of what we call educational measurement is actually measurement.

A primary goal of the authors is to promote enhanced communication between practitioners in the fields of educational measurement and classroom assessment (historically quite distinct) so that they can each learn from the other. Specifically, they ask those of us on the educational measurement (i.e., large-scale testing) side of things to be more open to considering “how classroom assessment can inform and advance the more general assessment field” (p. 8/11).

The Great Divide

As I read through their examples of scales that are commonly used in the classroom, it occurred to me that one area in which classroom assessment seems to be head and shoulders above us is in the development of meaningful scales or at least in deriving meaning from their scales. Now that statement might seem counterintuitive given that scales are the foundation of IRT-based large-scale testing and rarely acknowledged as a factor in classroom assessment. Nevertheless, the difference is stark:

  • Scales developed for classroom assessment typically begin with content.
  • In scales developed for large-scale testing, content typically is considered post hoc or worse, as an afterthought.

I must note here that with regard to large-scale testing I am referring to the typical use of IRT to model data and develop scales for scoring, and not to the use of Rasch models to develop research-based scales or to work on the development of learning progressions.  

In The Classroom Content Is King

As described in the article, when “scales” are used in classroom assessment, they are often categorical classifications of student performance such as rubrics used for holistic scoring of essays, mastery/proficiency scales for standards-based grading, learning progressions for specific skills, or broader performance level descriptions. The scales typically contain 4-6 levels of classification, but fewer or more levels are possible dependent upon content and purpose.

In all cases, the basis for the scale is content; that is, content-based descriptions of student performance that demarcate critical points in achievement, mastery, or development.

The scales are ordinal in nature. There is no claim or expectation that the “distance” between any two points on the scale is equal.  

In large-scale testing, the equal interval scale is the foundation of all that we do, but if we’ve learned anything over the last two decades it is that reporting scaled scores or even changes in scaled scores is not enough.

So, where does that leave us?

Our Equal Interval Emperor Has No Clothes

The equal interval scale allows us to apply statistics to do some magical and some maniacal things with test scores. What it cannot do, however, is allow us to make useful inferences about what students can and cannot do until we are able to anchor that scale to content.

At the outset, we (i.e., the field) thought IRT solved this problem by placing students and items on the same scale. A Wright Map (Rasch) would provide clear indications of what students at certain levels of θ could and could not do. But there were those pesky probabilities to contend with (nobody likes to interpret probabilities), and furthermore, a 2PL or 3PL “Item Map” was just not the same.

So, we tried to anchor our scales to content by finding items that discriminated at selected points along the scale. But with relatively flat discriminations and p-values clustered between “acceptable” levels, it took a pretty large item bank or several years of test forms to identify a sufficient number of anchor items.

As criterion-based achievement levels became the thing , we tried to go in the other direction. We created content-based descriptions of performance categories and then tried to find ranges of student performance along the scale that matched those content descriptions. Recent efforts such as principled assessment design and embedded standard setting attempt to make this process more deliberate and precise, but other factors still come into play when you move from individual items to a whole test and then put those items in front of real students.

As a field, we have reached the point where we understand the importance of connecting our large-scale IRT scales to content, but we haven’t quite figured out the best (i.e., most useful) way to do that (no, subscores are not the answer).

Are our equal interval scales part of the solution?

It Was Never About The Interval

At the time that we were learning that percentiles were not equal interval, we were probably told something like it’s much easier to move 5 or 10 percentile ranks in the middle of the distribution (e.g., between 45 and 55) than at the extremes. Using the normal curve, we were shown that it takes a much smaller increase in raw scores to move from the 45th to 55thpercentile than from the 1st to the 10th or 90th to 99th; that is, there was a greater distance to cover.

A similar phenomenon could be seen on our equal interval q scale as our TCC flattened at the extremes. You had to cover a much greater distance across the θ continuum at the extremes to see changes in the expected test score.

As it turns out, however, we were duped.

Knowing the distance between two things is not sufficient information to determine what it will take to cover that distance.

It may take much less concentrated time and effort to cover great distances with a novice at the low end of the scale than it takes for an expert to move a small distance at the top. Our smooth continuous θ scale may operate more like a step function when individual students and real learning are considered.

The same principles can apply to intervals in physical measurement, leading to memes such as:

Or as many participants in the Boston Marathon earlier this week learned, traversing the 0.5 miles known as Heartbreak Hill roughly 20 miles into the race is not the same as covering flat 0.5-mile sections or even climbing steeper hills earlier in the race.

For equal intervals to be “useful” in a practical sense, the interval has to be related to something relevant, and in most educational settings that relevant information is going to come from outside of the test. Working together with our colleagues in the classroom, perhaps we can wring more information from out intervals. Perhaps not. But we won’t know for sure until we ask for their help.

 

Image by Gerd Altmann from Pixabay

Published by Charlie DePascale

Charlie DePascale is an educational consultant specializing in the area of large-scale educational assessment. When absolutely necessary, he is a psychometrician. The ideas expressed in these posts are his (at least at the time they were written), and are not intended to reflect the views of any organizations with which he is affiliated personally or professionally..