Midnights (NAEP edition)

It’s not very often that one weekend is framed by two such highly anticipated and consequential events. On Friday, Taylor Swift released her 10^th studio album, Midnights. On Monday, we have the release of the 2022 State NAEP results, described by many as the most important/significant/consequential release in the history of NAEP. It’s almost too much for those of us residing, as Andrew Ho put it, at the intersection of these two worlds to handle.

For a solid decade from 2005 to 2015 we could look forward to one of two events each fall. In odd numbered years, State NAEP results. In even numbered years, a new Taylor Swift album. It was all so simple then. Life was good.

But then in the fall of 2016, no new album from Taylor Swift. Her reputation was shot, and she went into hiding for a year. It appeared that Taylor and NAEP were on a collision course in the fall of 2017, but this time it was NAEP who went into hiding – delaying the release of NAEP 2017 until April 2018.

With Taylor on a new timetable, a Taylor/NAEP convergence seemed inevitable in 2019, but Taylor said you need to calm down and released Lover in August and leaving the fall for NAEP.

Then came the pandemic, two new albums and an Album of the Year Grammy for Taylor, and anxious waiting for NAEP.

So, we have arrived at this weekend.

When two beautiful things come together like snow on the beach or chocolate and peanut butter you just have to combine them – the gold standard and a gold record (Midnights is likely platinum and perhaps diamond, but you get the idea).

And so, I give you Midnights (NAEP edition), pairing Midnights tracks 1-12 with a significant event in NAEP history (both State NAEP and the LTT).

In some cases, NAEP led, in others it followed the lead of states, but across 50 years it has been a pretty good reflection of the state of education and assessment in the United States.

Are a couple of the pairings forced? Sure. But as when equating a state test, let’s marvel at the fit in 10 of 12 items and not perseverate over the two that are slightly off the line.

The table below shows the pairing in track list order. The descriptions of NAEP events that follows are in chronological order.

1969 – First NAEP administration (Sweet Nothing)

There’s not much to say about the first trial administration of a test other than you have to start somewhere.

As we ponder our unhealthy obsession with test scores, however, it’s important to note that NAEP was the third leg of the federal education reform stool. Title 1 had already provided funding for instructional programs and interventions that would provide the educational opportunities needed to close achievement gaps. Title 1 also include a mechanism for states to evaluate the effectiveness of those programs (a new statistic, the NCE, was created for this purpose). NAEP provided the high-level tool to monitor long-term progress.

1983 – A New Design (Lavender Haze)

After a decade or so, NAEP needed A New Design for a New Era – or at least that’s what a few guys name Messick, Beaton, and Lord thought.

The new design brought us IRT, BIB spiraling, scale anchoring (item mapping), plausible values, and the beginning of a whole lot of ghost stories and mysteries appropriate for a late October weekend.

They built the first of many bridges over troubled water in an attempt to preserve their precious trend line. Applying every statistical trick and loophole in the book to satisfy their needs while maintaining what some of a particular age and upbringing might refer to as a “technical trend line”.

Did they or didn’t they create a vertical scale? Why is there still confusion about that after all these years? Well, it looks like a duck, walks like a duck, and quacks like a duck, but sometimes they tell us definitively that it’s a goose.

And why do they make the LTT and State NAEP scales look so much alike and then tell us the tests are so totally different?

1983 – Acknowledging Students with Disabilities (Midnight Rain)

It was not uncommon in the 1980s for state tests and other large-scale assessments to exclude students with disabilities. With other exclusions, test participation rates of approximately 80% were routine.

In 1983, NAEP took the first step by including students with disabilities in their sampling plan and requiring states to provide a reason for excluding them. When State NAEP emerged a decade later, reports included exclusion rates and asterisks were place next to scores with high rates of exclusion.

1985-86 – The Reading Anomaly (Karma)

It turns out you can only make so many design changes before they come back to bite you in the butt. And sometimes there is no bridge that will take you where you need to go.

As summarized in a report published two years later whose authors included Al Beaton, Bob Mislevy, and Rebecca Zwick:

Analysis of the data for the 1985-86 National Assessment of Educational Progress (NAEP)and its comparison with the results of previous NAEP assessments suggested a substantial decrease in performance in reading for 9- and 17-year-olds, but a slight improvement for 13-year-olds. NAEP staff examined a number of hypotheses about what might have caused such anomalous results. While some hypotheses, such as inaccuracies in sampling, scaling, and quality control, were ruled out beyond any reasonable doubt, results of the studies are inconclusive.

Rejected in this era between A Nation at Risk and the Charlotteville Education Summit was the conclusion that student performance actually declined.

1990 – NAEP Establishes Achievement Levels (Maroon)

And how the blood rushed into my cheeksSo scarlet, it was maroon

By the late 1980s the assessment world was turning away from NRT and group scores toward criterion-referenced tests and individual student scores. NAEP could do nothing about reporting individual scores, but legislation pushed the assessment to report results in terms of achievement levels.

Almost immediately, a report from the National Academy declared that the achievement levels and standard setting process were fundamentally flawed – a caveat that remained on NAEP websites until well into the 2000s. Thirty years later, the NAEP website today still carries the message,

“NAEP achievement levels are to be used on a trial basis and should be interpreted and used with caution.”

A Reading Anomaly followed by fundamentally flawed achievement levels. Not a good stretch.

Nevertheless, NAEP persisted.

1990, 1992 – Trial State Assessment (Snow on the Beach)

State assessment was good. NAEP was good. Let’s put the two together.

Actually, it was more a case of NAEP responding to the national demand for state-level results following the aforementioned A Nation at Risk and 1989 summit.

Some states stepped forward for the new experiment.

So, we prepared ourselves to receive state-level NAEP results. We had no clue (or warning), however, that the state results were going to be on a different scale than the regular NAEP results. It also didn’t help that the difference was not immediately obvious.

But NAEP was able to fulfill a desperate need. Another new era in NAEP testing was born.

And somewhere down the road, the trial state NAEP became “Main NAEP” and good old NAEP was relegated to being “Long Term Trend NAEP”.

**1994 – The 2^nd State Administration (Vigilante Sh*t)**

While the first report of state results was “interesting” for a number of reasons, multiple years of state results gave rise to a brand-new cottage industry that would have a significant and lasting effect on the field – comparing gains on state tests to gains on NAEP – with NAEP (the gold standard) held up as the truth. The RAND report on the “validity of gains” on state tests in Kentucky was one of the first such reports and the one which still causes me restless Midnights from time to time.

It must be exhausting always rooting for the anti-hero.

Oops. Wrong song.

No time or space for details here. Suffice it to say that comparing and interpreting the results from two different tests, particularly gain scores from two different tests is not as straightforward it might seem.

1996 – Accommodations (Bejeweled)

More than a decade after NAEP started collecting data on students with disabilities, active steps were taken to make the NAEP tests accessible with the addition of accommodations. Like most baby steps, these were not easily taken and resulted in a lot of stumbles and falls at first.

Of course, the change led to more bridge studies and the reporting of two sets of results in the transition. Ultimately, however, students with disabilities were more fully included in NAEP testing and the trendline was preserved.

2001 – No Child Left Behind (Anti-Hero)

NCLB dramatically increased the frequency of state testing, and its accountability requirements also raised the stakes associated with state testing significantly. We do not need to dwell on 100% Proficient, safe harbor, AYP and AMOs.

NCLB also established yet another new era for NAEP – a “higher stakes” era. State NAEP, now the Main NAEP, would administer Reading and Mathematics tests every other year to students in grades 4 and 8. Participation of sampled schools was now mandatory – although the fine print still allowed students and parents to opt out.

For some time we were left hanging about exactly how NAEP might be used to “confirm” state test scores. But NAEP fiercely resisted all attempts to be pulled formally into the accountability game.

Officially, NAEP remained above the fray, but comparisons of NAEP and state test results (i.e., discrepancies between NAEP and state test results) had a profound impact on the state of everything. (Cue the theme from Jaws.)

2009 – Mapping State Achievement Levels onto the NAEP Scale (Question…?)

As part of the mandate to confirm state test scores, NAEP began to map state achievement levels onto the NAEP scale. Using procedures conceptually similar to those Rich Hill used in Kentucky back in the 90s, the location of the achievement level cut scores for each state test, in particular the high-stakes Proficient cut, were placed on the NAEP scale. The first report, mapping scores to the 2005 NAEP scale was published in 2007. With a similar two-year lag, a report mapping state proficiency standards to the 2007 NAEP scale was published in 2009.

What makes this report interesting is that it contained results for Maine, New Hampshire, Rhode Island, and Vermont – four states participating in the New England Common Assessment Program. They were a mini consortium of states administering the same test and using the same performance standards. All IRT analyses and standard setting were conducted using consortium-wide data.

Imagine the surprise when the proficiency standards for the four states ended up at different places on the NAEP scale. Not wildly different, but different.

Turns out, of course, that this outcome was not the result of an error in computation. It was not explained simply by sample size. Rather, the outcome reflected what the analysis was actually examining – the entire relationship between performance on two tests in each state. It was not necessarily noise. There are a number of factors that might influence that relationship.

That story, however, was not user-friendly and didn’t fit the narrative. In the next version of report, the four states were located together on the scale.

By the time the RTTT state consortia were up and running, the mapping report had been further refined.

Of course, the different results for individual states were never an error. Therefore, as explained in the report, those results were also computed and were available for inspection.

Notably, the mapping reports also show changes in state proficiency standards relative to the NAEP scale across years – when there have been no changes to either the state or NAEP scales and achievement standards. This could raise questions about equating by the state, by NAEP, or both. It could reflect statistical noise. Or it could reflect changes in the relationship between how two different tests are perceived with a state at two different points in time.

It’s critical to keep all of those possibilities in mind when comparing NAEP and state test results.

2015 – The Next Generation Common Core Tests (You’re On Your Own, Kid)

The states were all in on the Common Core State Standards (really, they were).

The White House was all in on the Common Core State Standards (probably to the detriment of both).

NAEP, however, as it did during the NCLB era, took a hard pass.

Would the world of the CCSS look different now if NAEP had been on board? Hard to say.

Did the way things played out prove that NAEP made the right decision? Hard to say.

Are differences between state standards and the NAEP framework a problem? Oh, yeah.

2017 – NAEP’s Transition from Paper- to Computer-Based Testing (Labyrinth)

As mentioned above, the reporting of NAEP 2017 results was delayed from October 2017 to April 2018. The transition from paper-based testing in 2015 to computer-based testing in 2017 did not go as smoothly as planned – at least with regard to the results.

States across the country had already discovered the mode effect. NAEP was not immune and the planned bridges that were built that year were collapsing under the weight of the change.

Now an assessment program can survive one Anomaly, but not two. Not even when the program is the gold standard and required by law.

Enter the labyrinth. No statistical test was left untested, no hypothesis (null or otherwise) was off the table to be accepted if it could help produce acceptable results.

Results were reported. The trendline was preserved – at least for a little longer.

That’s all I have to say about that.

Mastermind and a Master Plan

“If you fail to plan, you plan to fail. Strategy sets the scene for the tale”

Notably missing from my Midnights (NAEP edition) is track 13 – Mastermind. I am holding that spot in reserve until I have a chance to review the NAEP 2022 results – the most important release in NAEP history I have heard.

I would like to think NAEP has a master plan. Looking at the NAEP assessment calendar, it appears that NAEP will start a new two-year cycle from 2022, administering these state NAEP tests again in 2024.

I urge them to consider returning to the original schedule and administering the tests again next year – even if a new framework isn’t ready. The reason for testing in 2023 is not to be able to assess the recovery from 2022 to 2023 – that outcome would likely be more noise than signal. The reason for testing in 2023 is to establish a more solid baseline for the recovery – to acknowledge the noise. More frequent testing at the outset is standard practice when establishing a baseline. There’s nothing wrong with following standard practice.

Image by Gerd Altmann from Pixabay