Friends, psychometricians, countrypersons, heed my call.
If as the poet says, the evil that men do lives after them; then it’s time to face the music, man.
We got trouble my friend, right here, I say trouble, right here and it isn’t pretty.
Why sure, I’ve set standards on large-scale tests. Certainly mighty proud I say that I even developed a new standard setting method.
I’m always mighty proud to say it. I consider the hours I spent with colleagues and clients designing, pilot testing, modifying, implementing, validating, and then modifying some more, the Body of Work Method golden. Helped me cultivate horse sense and a cool head and a keen eye.
Did you ever take 25 teachers and try to get them to set three performance level cut scores on a 51-point high school mathematics test that had an average score of 24?
But just as I say it takes judgment, brains, maturity, and pedagogical content knowledge to synthesize performance level descriptors, test items, and student work to discern performance level cut scores, I say that any boob can look at impact data and pick three numbers that will be acceptable to policymakers and palatable to the general public. And they call that standard setting?
The first big step on the road to the depths of deg-ra-Day–, I say, deg-ra-dation, is the introduction of impact data after the second round of individual judgments and panel discussion. Soon, it’s after the first round. An’ the next thing you know, the next thing you know, Gary Phillips, in a pinch-back suit, is touting a method that incorporates NAEP benchmarks from the very beginning and provides real-time impact data after every judgment a panelist makes – or even thinks about making.
You got trouble friend, and it just isn’t pretty.
But friends, I do not come before you today to lament the state of standard setting on large-scale tests. No, standard setting is what it is, what is was, and what it shall be. As a wise man, Stuart Kahl, once told me, “Charlie, there’s only so many places that a panel of teachers can place three cutscores on a 51-point test.”
So, I participated in, observed, and reported on the results of standard setting meetings across this fine land that all incorporated impact data to some degree and went on with my life, a generally happy man.
But that’s not what I came to tell you about…
Now friends, there was only one or two things that I thought using impact data during standard setting could’ve done.
The first was that it could have produced “real standards” and performance level classifications that could be validated through examples of student proficiency obtained independently outside of the testing situation; all of which ultimately led to significant improvement in instruction and student learning and closed persistent achievement gaps, which wasn’t very likely, and I didn’t expect it.
The other thing was that it could have called attention to the fact that large percentages of students were not performing at the proficient level, particularly in certain subgroups, produced some unintended negative consequences that would be a nuisance, but we could deal with, and ultimately, water would find its own level and, in the absence of any real reform, there would be little impact on the lives of school administrators, teachers, and students until the next time we decided to set standards – which is what I expected to happen.
But it seems there was a third possibility that I hadn’t even counted upon and that is that the students we had labeled each year as Basic, Proficient, or Advanced, would become adults, leaders of society, and having been reared on our peculiar brand of setting standards would apply it to much more important aspects of daily life than individual student scores on state tests or even a school accountability rating. But it appears that is exactly what has happened.
I come before you today because it seems clear that this norm-referenced beast that we created and called standard setting has broken free of the constraints of standardized large-scale testing and threatens the normative expectations, rules, and standards of behavior that we have depended upon for two-to-four centuries (give-or-take, based upon your perspective on such things).
It started small. I first noticed it during stories on the network news (yes, we’ve previously established that I’m old and get my news from Lester Holt). Night after night, Lester would tell me there was “breaking news” on a particular policy, bill, or issue of the day and then he would throw it to a reporter for a 2-minute story that featured a 90-second pre-recorded interview of one “regular person” (maybe two) offering a personal (often emotional) account of how the issue had an impact on them,
Then I saw that “impact data” had made its way into the legal system, most directly in Minnesota in the form of “spark of life” testimony. I was used to this type of emotional, deeply personal testimony during the sentencing phase of trials, but here it was being used to “help” jurors determine whether the defendant was guilty beyond a reasonable doubt. Excuse me.
And then as 2021 slipped into “2020 too”, the disease that we unleashed struck at the heart of the CDC. Revised COVID isolation guidelines (or maybe they were quarantine guidelines, who can keep track) were issued apparently based on their “impact” on the economy.
Folks, what have we done? These honorable professions and the people within them are applying the same fundamentally flawed practices we use to determine performance level cut scores on large-scale state tests to life-and-death decisions.
And if network news, the legal system, and the CDC weren’t enough, we now see the scourge of “impact” data spreading to the most sacred of American institutions – voting for the NFL Most Valuable Player and the baseball Hall of Fame. (aside: Roger Clemens belongs in Cooperstown.)
O judgment! thou art fled to brutish beasts, and men have lost their reason.
Where do we go from here?
How do we keep the focus of standard setting panelists, jurors, or the CDC on the issues at hand once they have tasted the forbidden fruit of impact data? Once released, can the genie be returned to the bottle?
The trickiest part to counteracting this trend might be our society’s exponentially diminished capacity to handle nuance.
It’s not as simple as saying
- I want standard setting panelists to focus solely on content and performance level descriptors.
- I want the CDC to focus only on minimizing spread of the disease.
- I want the legal system to convict the person who committed the crime.
The questions posed to these groups are not that straightforward. They require expert judgment and consideration of external factors that might appear similar to impact data.
- Standard setting panelists must consider the context of the test and testing situation in making their judgment of the proficiency of student work. The quantity and quality of writing in an essay judged “proficient” will differ for an on-demand task with time only for a single draft versus an on-demand task that allows for revising a preliminary draft versus a take-home performance task that allows for research and the entire writing process. Similar differences in expectations would exist for responses to questions about the reasons for the Revolutionary War or a discussion of climate change.
- The CDC, likewise, will consider human behavior in establishing guidelines to reduce or minimize the spread of disease. A change in COVID guidance in spring 2021 to say that vaccinated people no longer needed to wear masks indoors was designed to incentivize more people to get vaccinated, a desirable outcome – even if vaccination plus masks was the ideal. At least a secondary reason for the recent reduction in the isolation period from 10 days to 5 days was the belief that people might be more likely to isolate for 5 than 10 days. Isolation for 5 days, although not ideal, is better than isolation for 0 days among people inclined to reject the10-day guidance out of stubbornness or necessity.
- Contrary to what we have watched on crime shows, in almost every one of the high-profile trials over the past two years jurors were not asked to determine beyond a reasonable doubt whether the defendant committed the act of which they were accused (i.e., “the crime”). There was no dispute about whether the act had been committed and who had committed it; and more often than not, jurors could watch it all unfold on video. Rather, jurors were asked to decide whether the act fit the definition of a crime. “Witness” testimony often consisted of a parade of “experts” on each side explaining why the act did or did not fit that definition. Without the benefit of exemplars or training available to standard setting panelists, jurors were asked to compare the act committed to complexly worded lists of acts that constitute the crime and find a match. Jurors talk about parsing sentences and interpreting individual words like “intentional” or “consciously” contained in those sentences.” Sound familiar, standard setting folks?
Cleanliness is next to godliness, and we are not gods. We’re people. No more, no less.
It would be nice if decisions made by standard setting panelists, jurors, the CDC, and even baseball Hall of Fame voters could be clean or, at a minimum, be the result of a clean decision-making process. Alas, life is messy. The decisions that we make will be messy and some of them will be wrong.
Thirty years of experience with standard setting, however, has convinced me that the indiscriminate use of impact data only makes life messier in the long run, even in situations where impact data is used primarily to produce palatable results in the short run.
This is especially true when impact data is used as an alternative to actually addressing deep, complex societal issues. No, we’re not gods, but we’re better than that.