Batting .500 with the NAEP Scale

Like any serious long-term relationship, my relationship with the NAEP scale has had its share of ups and downs, highs and lows, moments of clarity that led to pure joy, and moments of frustration and confusion that led to a blog post. Love-hate may be too strong to describe my feelings toward the NAEP scale, but it most definitely has been a topsy turvy ride on the NAEP roller coaster.

I was a wide-eyed (and not-so-wide-waisted), impressionable graduate student fresh off attending my first AERA conference in our nation’s capital, with no clue that large-scale tests and reporting scales would become my life’s work when I became infatuated with the NAEP scale. It was the Reading Anomaly. Sure, we had learned about anomalies like NRT reversals in mean scores from one grade to the next – between grades 5 and 6 in mathematics and 9 and 10 in reading if I recall correctly. But the sheer amount of attention and intellectual firepower an unexpected downturn in NAEP performance received was breathtaking.

[Aside: As far as I know, those NRT mean score reversals still existed but were hidden beneath monotonically increasing (albeit flattening) achievement level thresholds across grade levels when the testing world shifted from percentiles to percent proficient.]

Early in my career in large-scale testing, I became addicted to NAEP’s scale anchoring process. The surprisingly straightforward and transparent approach used to develop content-based descriptions of student performance at 50-point intervals along the NAEP scale; that is, NAEP Performance Levels, not to be confused with NAEP Achievement Levels (more on those later).

Over the next fifteen years, I spent countless hours in search of those elusive items that discriminated between achievement levels on state assessments. It was through such efforts that I discovered that basic vocabulary questions routinely discriminated between the highest achievement levels on Reading tests, leading to the observation that in large-scale testing exposure to content often may be a better discriminator than depth of knowledge or rigor. For example, in mathematics, there is nothing cognitively complex about determining the first derivative of an algebraic expression, but only students exposed to calculus will know how to do it.

The NAEP scale also maintained a sense of mystery. Not a single scale at all, but a composite of separately maintained subscales. And that 0-500 across grade levels. “Is it really a vertical scale?” I asked countless ETS folks over the years? Nobody knows.

But there were rocky times, as well.

In the early 1990s, we waited anxiously for the release of the first state NAEP results, prepared to meet a client’s request to map student performance on their state assessment onto the NAEP scale, or vice versa. When results were released, however, we were shocked to discover that State NAEP results were not on the same scale as the regular NAEP results that we had come to know and love.

Oh, it used the same 0-500 numbers, but it was a different scale. Why? Why would they do that?

At the time, it was not clear that there was a difference in content between NAEP and the new State NAEP – and perhaps there wasn’t. There was, however, a different scale. Eventually State NAEP was rebranded as the “Main” NAEP administered every other year under NCLB and the OG NAEP became the Long Term Trend NAEP.

The main NAEP and LTT remain on different 0-500 scales, but every once in a while, planets align and you get results like the average scale score on the 2019 Main NAEP Grade 4 Reading test and the 2020 LTT 9-year-olds Reading Test both coming in at 220 – and both falling to 2015 on subsequent administrations. Of course, they are not the same 220 and 215, once more bringing life to Beaton and Allen’s (1992) opening sentence:

Constructing scales is an important and well-developed function of educational and psychological measurement, but communicating the meaning of scales to their users and to the general public has not always been effective.

Or to rephrase that sentiment in a way parallel to Beaton’s most famous maxim: If you want to convey meaning, don’t use a scale score.

In a normal world, those overlapping scores of 220 and 215 would cause all sorts of confusion.

Lay people would see the scores and assume that the two tests measured the same thing.
Assessment folks would be asked why we need two different NAEP tests when they give “the same result,” once again forcing them to explain (with a straight face) about trends and a scale that stretches back to the 1970s.

Alas, NAEP does not live in a normal world. With the possible exception of a few fine folks at Harvard and Stanford nobody gives a rat’s ass about or remembers the actual numbers associated with the reporting of NAEP results. The general public remembers whether the results went up or down. The media might focus on the percent at or above Proficient. A few of us might remember whether the change was 1, 2, or 5 points. But exaggerating only slightly for effect, I’ll bet dollars to donuts that even a week after the release nobody (not even people named Lesley, Peggy, or Carey) could tell you the actual scale scores.

And in a very real way, that’s as it should be. The arbitrary NAEP scale has no inherent meaning, nor have its scores picked up any meaning of the own over the years. The scale’s primary purpose, as described in that same Beaton and Allen article, is being the mechanism that allows us “to compare groups of students both across and within assessment years.”

Those ubiquitous and wonderfully simple line graphs of NAEP results would convey just as much information with simply the lines, dots, and indicators of significance. No scale scores required.

Turning to Indirect uses of the NAEP Scale

Although the underlying scale itself may be utterly forgettable, it has come in handy for some practical applications beyond simply maintaining the NAEP trend line. At the top of the list, I would place

Mapping state achievement levels onto the NAEP scale, and
Defining the NAEP Achievement Levels (Basic, Proficient, Advanced).

About two decades after we were foiled in our attempt to map state assessment results onto the NAEP scale (and ridiculed for trying), such mapping became chic. We were ahead of our time. Again, the actual NAEP scale scores were not important at all, only the relative position of the state achievement level thresholds displayed graphically in relation to the NAEP Basic and Proficient thresholds mattered.

My favorite outcome of the mapping process was the mapping of the NECAP states into different locations on the NAEP scale even though they administered the same test and applied the same achievement levels. That result demonstrated two important and fundamental truths about state testing. First, that the mapping approach simply relied on the relationship between student performance on the two tests and it cannot be assumed that relationship is the same across all states. Second, whenever you draw samples (i.e., individual states from the population taking the NECAP test), there’s going to be some noise and variation. It was truly a rare teachable moment with test scales. Alas, by the time PARCC and Smarter Balanced arrived on scene, each of their states were placed in the same location on the scale.

Then there are the NAEP Achievement Levels: Basic, Proficient, Advanced. How can you not fall in love with achievement standards that have been the linchpin of Education Reform for the past quarter-century, the standard against with state achievement standards are judged, yet are the result of a process labelled “fundamentally flawed” for years and still reported as being used “on a trial basis” after nearly four decades. Not to mention the published response that the “fundamentally flawed” label elicited from titans in the assessment field that included words such as “one-sided,” “incomplete,” “faulty,” and “uninformed” before ending with

In our judgment, Chapter 5 of the NAS Report does not conform to generally accepted scientific standards of objectivity in conducting research or evaluation. In sum, we are surprised, dismayed, and disappointed with the inadequate scholarship reflected in the review of NAGB’s 8 years of research and development to set performance standards on NAEP.

SNAP! Man, I love this job.

The college-and-career preparedness project gets an honorable mention in NAEP scale applications simply for a) finding a use for the Grade 12 NAEP tests and b) having the size, cost, and duration reminiscent of the Big Dig in Massachusetts.

Maintaining the NAEP Scale – All The Live Long Day

The scale has survived events such as the addition/expansion of accommodations in the 1990s, the 2004 revision of the mathematics framework, and the 2017 shift from paper-and-pencil to computer-based testing. There have been little “jogs” here and there and at least one “leap of faith” (see 2017), but the scale survives.

And this is not like the College Board approach with the SAT: recenter, change the test, change the blueprint, change the underlying scale, but continue to reports scores between 200 and 800 approach. (not that there’s anything wrong with that) No, maintaining the NAEP scale has involved bridge studies, complex modeling, high-level statistics, and a staggering series of t-tests. It has already been suggested that the 2026 results might take a bit longer than expected to evaluate thoroughly the effects of the latest changes to the test.

I have to admire the grit, perseverance, and chutzpah that goes into maintaining the NAEP scales – the state scale back to the 1990s and the LTT back to the 1970s. To quote Blake Shelton’s first big hit, Austin, “What kind of man would hang on that long? What kind of love that must be.”

No doubt, part of the reason is existential. If the chain is broken, it becomes that much more difficult for NAEP justify its existence?

Part of it’s psychometric swagger. Sure, I can maintain a scale for 25, 35, or 50 years. No problem.

And I’m sure that perhaps the biggest part is a belief a) that the scale is useful (perhaps essential) and maintaining it is worth the effort, and b) that it actually is feasible to maintain a scale across external changes, internal changes, and simply across time.

In my next post in this series, NAEP by the Numbers, I’ll address when and why it might be useful to move on from the current NAEP scale and what the future might look like with a new NAEP scale and even consider the possibility of a scaleless future for NAEP.

Turning to Indirect uses of the NAEP Scale

Maintaining the NAEP Scale – All The Live Long Day

Share this:

Published by Charlie DePascale