Choosing Our Battles Wisely

We appear to be in a lull in the storm that has been battering large-scale assessment, in general, and state assessment programs, in particular, for the last several years. Knowing that the next band of thunderstorms is on the horizon, this may be a good time to assess our situation to determine what it is that we can do to make it through the storm.

What are the highest priority fixes, what can we put off until later, and what is it just too late to do anything about?

1. Through-year assessment

Through-year assessment is inevitable. For more than a decade, it has been a question of when and how, not if. Its flaws (perceived and real) and are vastly outweighed by its benefits (perceived and real). Educators want it and are tired of us explaining to them why the results of their interim assessments and state tests don’t line up. They want the results to line up and we better figure out how to make that happen – sooner rather than later.

2. Testing Students Every Year

For better or worse, for interim assessments and growth scores, for continuing to point out where the rich are getting richer and the poor getting poorer, testing every student every year through middle school is here to stay. As much as it might pain the creative assessment design side of me to say it, I don’t see an alternative.

We built the student information systems, we created growth scores, we marketed the honesty gap. In other words, we created a demand for testing students every year.

Looking to the future, however, it doesn’t have to be 8-10 hours of on-demand, pull-out, external testing. It really doesn’t. But we have to start from clearly identifying the information that we need to collect from state testing and then focus on the best way to accomplish that. Never again start with a focus on the best way to meet federal regulations. Regulations are made to be challenged.

3. Subscores

For decades, we have adopted a bend but don’t break strategy with regard to reporting subscores on state tests. Report something to keep them happy, but make sure that it’s pretty much meaningless and useless, and caveat the hell out of the thing. At least I like to think that my partners in crime have been doing that intentionally.

We have to continue to fight the good fight with regard to reporting subscores on state tests. There are some lines you just cannot cross.

4. Standardization

Excuse me. What standardization?

Despite the calls from Steve Sireci for more reasoned flexibility and a concept he calls UNDERSTANDardization, which as he explains, “refers to ensuring sufficient flexibility in standardized testing conditions to yield the most accurate measurement of proficiency for each examinee,” the simple fact is that the standardization ship has sailed, the barn door was left open and that horse ain’t coming back again. So long, farewell, auf Wiedersehen, goodbye. Elvis has left the building.

Two of the three pillars of standardization, content and administration, crumbled in the last attempt by the state assessment community to choose our battles wisely (see accommodations, untimed tests, months long testing windows, 2% tests, CAT).

The last pillar, scoring, remains upright for now, like the Leaning Tower of Pisa, primarily because interested stakeholders long ago found it a fairly simple matter to adjust test scores after scoring in the name of fairness, equity, etc. It would not surprise me, however, to see “conditional scoring” of student responses within the next 5 years. To borrow from Sireci, the current state of state testing may best be described as UNDER-standardization.

5. Automated Scoring

State testing needs automated scoring now in the same way that large-scale testing needed machine-scoring to thrive in the 20^th century and needed scanned images and electronic scoring systems to support the introduction of constructed-response items and essays as we approached the 21^st century. At this point, automated scoring might actually become “AI Scoring” as we have been calling it for a couple of decades.

Now that there are actual scoring engines and we have a pretty good sense of what they can and cannot do, what we need to focus on is a policy agenda and research agenda focused on what we want from scoring. So many of the current processes and procedures that we use in scoring grew out of necessity and the constraints of paper-based scoring systems. How would we want scoring to function in a world where we were starting without those constraints?

6. Measurement Community Concerns

This is as good a time as any to accept the fact that there are very few issues that interest measurement researchers that a) anybody involved in state testing cares about and b) are important enough to make a difference in the way that state tests are developed or administered, or affect the way that results from state tests are used. And there are few state testing issues that, at their core, would spark the interest of researchers focused on measurement.

One only needs to have spent a day attending NCME sessions on multidimensional IRT and other models at any point in the past 10 years to see the disconnect between research in measurement and state testing. That’s OK, advances in measurement will find their way into state testing when the time is right.

The lone exception might be if you want to include Fairness as something that the measurement community is concerned about,. Although the community has not yet really reached a consensus on how Fairness, clearly a testing issue, manifests itself as a measurement issue distinct from validity.

7. Things that should be Measurement Concerns

Equating and Item Banks.

The shifts in state testing that began with the Race to the Top Assessment Program in 2010, and its implementation in 2014 and 2015, put tremendous strain on IRT and traditional methods that have been used for equating, placing items on a scale, and maintaining item banks. If anything, the collapse of PARCC, the streamlining of Smarter Balanced, and the disruptions caused by the pandemic have masked very real flaws in attempts to apply the same old methods to next generation, mixed format tests consisting of items snatched from an item bank built over multiple years with items from multiple sources.

8. Use of College Admissions Tests as High School State Tests

On paper, this was the best idea since sliced bread – at least for states. When the ruling came down from above that state content standards would herewith and henceforth be college-and-career-readiness standards, the use of the ACT and SAT as the state test at high schools was a no-brainer. The question of alignment to state standards is a non-issue and should be of no concern at all to the USED and their Peer Reviewers. As I have written, state tests and high school were never a good match.

Whether the use of college admissions tests as high school state tests was a good long-term business decision for ACT and the College Board remains to be seen.

9. School accountability

As Jim Popham screamed from the rooftops in the early 2000s, student assessment and school accountability are not the same thing. Student achievement and school effectiveness are not the same thing. We mucked things up a while with that whole validity-use squabble, but we kind of get that now.

The next step is to treat school accountability systems like the “school assessment programs” that they are. They need their own technical reports. They need their own Technical Advisory Committees and technical experts – perhaps a role for the econometricians and statisticians to keep those folks away from student assessment. They need to adhere to standards. They need validity studies. They need federal Peer Review – oops, I went too far.

10. Growth

Can you believe that it’s been more than 15 years since we were all caught up in Margaret Spellings, bright line principles, and seeking approval for “growth models” under NCLB?

We’ve come a long way, baby, from simply figuring out clever ways to get additional kids into the numerator when computing percent proficient.

But there is still a long way to go. We have some solid growth statistics, or measures, or indicators, now, but we have not made a lot of progress answering, or even addressing, the questions posed in Dale Carlson’s four quadrants? As Dale wrote to state leaders at the end of his 2006 paper,

“States are obligated to think long and hard about the specific goals of their [accountability] systems and to be sure the approach selected matches their goals.”

And to researchers:

“New statistical approaches … [to measure growth] are becoming widespread. This work is currently in the context of individual students; the study of the applicability of these to the study of school effectiveness is urgent.”

11. Alternate Assessment

After a flurry of excitement and activity in the late 1990s and early 2000s brought about by IDEA 1997 and NCLB, just about everyone seems content ceding the alternate assessment space to DLM. That’s fine. There’s no need to poke that bear while it’s sleeping. And in their relatively low-stakes environment, the researchers with DLM just might come up with some innovations to advance state testing, in general.

12. English Learners

Unlike my “let sleeping dogs lie” approach with regard to alternate assessment – to mix animal metaphors – I think that it’s probably time to place as much effort into thinking through the assessment of English learners as has been placed into figuring out what to call these kids. Is “multilingual learners” the latest terminology or am I behind the curve?

If this were just a small percentage of students taking the ACCESS test, I might say let’s circle back and tackle this one at a later date. But it’s not a small percentage of students and it’s not simply the English language proficiency test where technical and policy solutions are going to be needed fairly soon.

13. Improving instruction and student learning

Stare directly in the mirror and ask yourself the following three questions:

1. Do I have any training or expertise in improving instruction and student learning?
2. Do I collaborate on a regular basis with people whose life and research are devoted to improving instruction and student learning? (Leading workshops to spread assessment literacy throughout the land doesn’t count.)
3. Have I stepped inside a school or district office in the past five years other than for a parent-teacher conference, extracurricular event, or craft fair?

If your answer to any of those questions is No, every time you feel the urge to use state testing to improve instruction and student learning, recite the following phrase while striking yourself smartly in the hand with a ruler.

It’s me, hi, I’m the problem, it’s me

14. Early childhood education

See above.

15. Performance Tasks

See two items above.

16. Justice, Equity, Inclusion, Diversity, Anti-racist, Culturally Sustaining/Revitalizing

Circling back to our storm analogy, the past has passed, the future is to come, but there are things that we can do right now.

We can address equity issues in policies related to the use of test scores pretty easily and quickly. Good, let’s do that.

We can continue to increase diversity and representation on committees. We have been doing it for a decade or more and we can do it better.

We can continue to increase access to the current tests, both through selection of content and enhanced accessibility features, while working on designing better tests.

We can get to work on increasing diversity in the field and in leadership positions within the field.

What might take a bit longer? When the pandemic hit, we had barely started to process Mislevy’s Sociocognitive Foundations of Educational Measurement and were struggling to unpack the implications of the Massachusetts experience with stereotype threat on the selection of passages for reading tests and problem scenarios for other tests. We don’t want to make changes to test items, test forms, and analytic procedures without at least a basic understanding of the issues or even the terminology. We want to learn lessons from the recent past and not get too far ahead of research or make changes that are not consistent with advances in pedagogy. We have tried leading reform from the top down through assessment and it has not worked well.

Just Aim Beyond the Clouds and Rise Above the Crowds

If we want to be able to look at state testing five years from now and say we made it through the rain, we have to take a deep breath, keep our heads about us, and keep them above water in the midst of this storm. We can lament the preparations and upgrades that we should have made in the past and think about improvements that we will make in the future, but right now we have to focus on the present and choose our battles wisely.

Image by Enrique from Pixabay