Measure Twice, Model Once

As states and their assessment contractors prepare for the possible resumption of state assessment programs in spring 2021, one critical question being considered is how, or even whether, to link results from the spring 2021 tests to the last-administered spring 2019 tests; or to put it more generally, what equating procedures should be used to place results from the spring 2021 tests onto the state’s existing reporting scale. Some have suggested recently that the most prudent approach might be to treat spring 2021 as a stand-alone year; that is, not to attempt to link that this year’s tests to the state’s existing scale and achievement standards at all. However, among those who favor attempting to report results on states’ existing scales, the pandemic has renewed the age-old debate over the relative merit of post-equating and pre-equating – the psychometrics equivalent of tastes great – less filling.

Although I spent my career firmly in the post-equating camp, I am convinced that the best approach for spring 2021 is pre-equating – better than post-equating, and much better than not equating at all. Why the change? Like the World Series, the election, the 2020-2021 school year, Thanksgiving, grocery shopping, and just about anything else you can think of in 2020, this will be a test administration like no other. Given that few, if any, of the assumptions about instruction and student achievement that inform post-equating discussions and decisions will apply this year, little good can come from having traditional post-equating data affect the reporting of spring 2021 test results.

Equating, Shhh!

Before we get too far into this discussion, we need to acknowledge a few of the inconvenient truths about equating.

First, although everyone understands that equating happens, it is one of those things that you just do not talk about in polite company (i.e., with policymakers). The first time that you scare the bejesus out of a policymaker by telling them that equating choices can have an impact on results they make their way through the stages of equating acceptance:

Which method produces “better” results for my state?
Which method reveals the truth about student performance?
Which method is most appropriate for our program and can be defended?

When a policymaker has settled on an equating process (hopefully at stage 3), they don’t want to hear about equating again.

With the emergence of computer-adaptive testing, licensed test content, and a return to the glory days of tests built from pre-existing item banks, some assessment contractors are able to play along and avoid the topic of equating altogether – You don’t need to worry about equating because all of the items are already on the same scale. Don’t ask, don’t tell.

When people do want to talk about equating, it is more often than not a very bad sign. Policymakers, and other stakeholders, trust the equating process – until they don’t. Nothing erodes trust in equating more quickly than test results that do not make sense and cannot be explained (e.g., too high, too low, too different, too consistent, too unexpected). When test results don’t make sense, equating is often seen as the last opportunity to fix them before the results are reported. Regardless of whether equating is the cause of the problem (and let’s admit that it often is, at least indirectly), it is simply too late to make changes to the test design, administration procedures, or scoring, so we turn to equating.

Strangely, although I expect spring 2021 test results to be anything but normal, they will probably be neither unexpected nor unexplainable; so, it is not the reaction of policymakers and the public that concerns me about post-equating in spring 2021. Rather, I am concerned about two key aspects of the psychometrics and the equating process itself: the standard practices that have been established to drop items from anchor sets and the zealous psychometricians determined to ensure that data fits the model. Although neither of those things would be much of a concern under normal circumstances, this is school year 2020-2021.

My fear is that normal post-equating procedures applied in spring 2021 will

unwittingly and undetectably alter the state’s existing scale, and/or
trigger unnecessary and inappropriate equating discussions with policymakers.

Model Once – The Scale’s The Thing

Our goal in creating a scale for a state test is to be able to make inferences about student proficiency on the state content standards based on their performance on a set of test items; and also, we want to be able to make inferences about how students will perform on a particular type of item or task based on our estimate of their proficiency. Having created the scale, the last thing that we want to do is to change the scale unless there is clear and compelling reason to do so.

We routinely conduct analyses as part of the post-equating process to identify individual items that are not performing as expected. Through this task we hope to uncover idiosyncrasies that affect the performance of students on an individual item such as:

construct-irrelevant complexity or vagueness
context effects related to the presentation of the item
exposure to the item prior to the test

When such an item is identified we either remove the item from the anchor set, or if necessary, remove it from the test.

Those analyses can also provide evidence that the scale is not functioning as expected and may need to be examined further. Within a particular year or across years, it may become apparent that student performance is different on a particular item type, on a particular standard, or on items measuring particular cognitive skills. Such information may lead to the decision that a different model, scoring approach, and scale is needed to accurately capture the change in student performance that is taking place. That is not the type of decision, however, that should made during the heat of the equating process; and it is certainly not the type of decision that is made by psychometricians without the input of content experts and policymakers.

With all that has occurred with curriculum and instruction since March 2019, it is more likely than not that the spring 2021 tests will reveal that students are not performing in the same way as expected (i.e., the same way as students in previous years) on particular standards, types of items, or perhaps on items measuring higher-level cognitive skills. Why should they? That is not a reason to change the scale – particularly, if we expect instruction and student performance to return to normal in the near future.

Even if we believe that a new scale will be needed moving forward, the spring 2021 test administration is neither the time nor the place to attempt to establish that scale. There’ll be time enough for creating new scales when the pandemic is done.

Measure Twice

The best that we can hope for in spring 2021 is to collect information about student performance on the state’s existing scale. I trust that policymakers, educators, and the public are smart enough to interpret that information appropriately with regard to the conditions within their state or community.

Many assessment specialists are recommending that states reuse a previously administered test form. Reusing a test form is certainly not a first (or even second) choice in normal years, but it is arguably the best option in spring 2021, if feasible. If a state does not have a complete test form available and needs to include some new items to round out the spring 2021 test, there are certainly ways to artfully apply the science of psychometrics to estimate the location of those items on the existing scale.

Remember, we only equate because we have to. If that’s one thing that we can take off of the table for spring 2021, let’s do that. Policymakers, educators, and the public are not going to complain.

Image by Mahesh Patel from Pixabay

Model Once – The Scale’s The Thing

Measure Twice

Share this:

Published by Charlie DePascale