Tell Me Why

Your current state assessment is not working.

You want to build a new one.

Tell me why.

Measurement Factors Influencing Choices in State Testing

In graduate school, sometime back in the previous century, we were taught that there are three primary measurement reasons for developing a new test: to increase validity, to increase efficiency, or to increase utility.  It was a given that we would be replacing an old test – an existing test that we were going to make better. Dealing with a brand new construct for which there is not test is a measurement task not a test development task.

Obviously, a Venn diagram would show considerable overlap among those three reasons – particularly if you are of a mind that utility is subsumed under validity – but you get my drift.

Validity covers a wide range of reasons for change that we can comfortably place under the heading “measuring better.”

Better might mean more accurately and/or more precisely. Or better might even less precisely if a particular application doesn’t a high degree of precision. I’m picturing any one of my favorite “you must be at least this tall to go on the ride” dichotomous measuring instruments. And you can start to see the overlap with efficiency and utility.

Better might mean more comprehensively, which might also include more accurately, but not necessarily.

Better might mean more fair or more fairly or with a higher degree of fairness. “Fairer” just doesn’t feel right despite MS Word telling me that it’s more concise. I’ll leave it to the joint Standards committee to figure that out.

Efficiency is easier to understand. It means that the new test is either faster or cheaper, preferably both. Faster could apply to the time it takes any or all aspects of the testing process from development to administration to scoring to reporting.

Utility, at least on the surface, seems the most straightforward. You want the test to be more likely to be used or to be more useful – which are not necessarily the same thing.

In addition to the overlap among those three categories there are interaction effects to consider; that is, improvements in one area might make things worse from in one or more of the other areas.

Non-measurement Influences on Choices in State Testing

Even before leaving those ivory towers and ivy-covered walls behind for the “real world,” we learned of another key reason for developing a new test: to make money.

Money entered the picture a little earlier than might be expected because at the time universities and individual professors were a central part of the K-12 large-scale assessment infrastructure. A key reason to develop and market a new test is to make money.

Taking our diploma and bag of psychometric skills into the world of K-12 education and state testing, we learned that there were several non-measurement reasons for developing and administering a test. Among these were instantiating state content and achievement standards, modeling effective assessment practices, collecting and communicating accurate data about student proficiency. I classify each of these as non-measurement reasons for administering the test because the primary goal is not to quantify some otherwise unknown, or unknowable, aspect of student achievement, in other words, measurement.

Instantiating: representing an abstraction by concrete instance. I learned the concept of tests being used to instantiate state standards from George Madaus, one of many important ideas I picked up from George over the years. For whatever reasons, fatigue, politics, and establishing consensus probably chief among them, state standards are an abstraction. They fall short of placing before educators, in terms plain and firm, what it means to know, understand, apply, analyze, evaluate, synthesize, produce or create; let alone, problem solving, communicating, or collaborating; not to mention the perceived “need” to bridge the gap between end-of-grade standards.

In a (non)decision with staggering unintended negative consequences, states left it to the state assessment team to make that abstraction concrete.

Modeling effective assessment practices for teachers was a primary factor influencing the design of state tests in1995 and one that re-emerged in 2010.  I expect that given the natural ebb and flow of K-12 assessment combined with a new sense of urgency about what tests should and should not look like, we should be just in time for a renewed focus on modeling in 2025.

Collecting and Communicating accurate data about student proficiency. As noted in Multiple Choices, the recently published Bellwether report, “state summative assessments are an important tool to provide stakeholders with credible, comparable information about student learning,” information which they did not have access to in the 1990s and early 2000s.

The key takeaway from this section is that none of these three reasons for administering state tests is directly related to measurement. Yes, you can argue that using a state test to collect accurate data requires accurate measurement, but in doing so you are presupposing that the state test is the only way, or the best way, to collect accurate data. That conclusion might have been true at the turn of the century, it is much less likely to be true now.

You may also have fallen into the trap of believing that the summative state test is necessary to determine whether students are proficient or to determine their level of proficiency. That is a popular fallacy.  I’m willing to concede that the reliance on state testing for instantiating state standards, as described above, increased the importance of measurement for one or two state test administrations after new standards are introduced, but beyond that not so much.  As measurement and assessment specialists, we have been guilty of promulgating the measurement fallacy.

Anyone with even a basic familiarity with K12 education understands that any teacher worth their salt is better positioned to make a determination of student proficiency than an on-demand summative state test and that teachers know each student’s level of proficiency long before any state test results are returned and even before the student sits down to take the test.

Then why don’t we rely on teachers to provide student proficiency data directly to the state? Short answer, we should. Slightly longer answer, there is a plethora of reasons, well known to all, why we cannot simply trust (without verification) data collected from teachers and schools. One of those reasons, however, is not that the teacher doesn’t know their students’ level of proficiency.

Now, don’t waste our time with arguments about kids near the borderline. A student whose true level of proficiency is right on the borderline has at best a 50-50 chance of being classified as proficient on the test – a coin flip. A teacher would do at least as well. The correct answer, however, is that both the teacher and the test should report that the student is on the borderline.

Why You Need to Tell Me Why

Validity, Efficiency, Utility, Instantiating, Modeling, Collecting & Communicating, & Money

As we stumble headlong into a brave new AI-supported, culturally relevant, instructionally useful, through-year world of educational assessment, all of the reasons listed above will be in play each time we are asked to or decide on our own to develop, revise, and decide to administer a new test.

As noted, the categories of reasons are not mutually exclusive, and they often do not work in concert with each other.

When you ask me to build a test, I have to consider all of the categories of reasons (including making money), but it’s critical that I know what you hope to accomplish with the test.

In 2010, Sec. Duncan made it clear that Collecting and Communicating accurate information was his major concern.

In 2015, I might have guessed that Instantiation and Modeling were of paramount concern.

In 2019, I would have told you that Efficiency and Utility were at the forefront.

After 2020, I think that I’m looking at an argument built around new mix of Modeling and Validity with a twist of Instantiation.

But I might be wrong.

Given the overlap among the categories and our piss-poor record in defining even the technical terms, I might interpret what you want as validity when what you actually want is instantiation or modeling; or perhaps what you are really saying is that you want a new set of standards.

So, adapting the “Toyota five why” method, I’ll probably ask you repeatedly to tell me, in terms plain and firm, why you want a new test. What is it that you want, what do you really, really want? I’ll do this not to get to the “root cause of a problem” as much as to understand what problem it is that you are trying to solve.

When we reach the point together where we agree on the problem, we might mutually conclude that a new test is not the best way to get what you want; or more likely, we will conclude that a new test alone is insufficient to accomplish your goals. And together, we’ll figure out how to best navigate the tradeoffs among those various categories.

It won’t be easy because at any given time the loudest voices will be shouting at you to reduce testing time, or to make the tests more culturally relevant, or to make the assessment program more useful.

You will certainly want to pay attention to those voices, but you can’t attack any testing problem with blinders on, ignoring all of the other factors.

Because at some point, within a year or two, another problem and another group of people will rise to the forefront, and those people will ask you how in the world you did you not see the gorilla.

 

Image by Gerd Altmann from Pixabay

 

 

 

Published by Charlie DePascale

Charlie DePascale is an educational consultant specializing in the area of large-scale educational assessment. When absolutely necessary, he is a psychometrician. The ideas expressed in these posts are his (at least at the time they were written), and are not intended to reflect the views of any organizations with which he is affiliated personally or professionally..