Those Crazy, Hazy, Lazy Days of Testing

Lived through those crazy, hazy, lazy days of testing
Those days of W, Obama, and peers.
Thrived on those crazy, hazy, lazy days of testing
We thought state testing would always be here.

[Fast forward through 20 years or so with carefree, upbeat verses]
[End with one gloomier verse, maintain the upbeat tempo but a more earnest attitude]

But those seeds of doubt took root and now are blooming.
Large-scale state testing, your future’s unclear.

Who knew what those of us in state testing were in for when that final chad fell in December 2000?

It’s been a wild ride. There have been enough twists and turns, reckless careening, well-intentioned convening, and plunges into the depths and darkness of hell to make Mr. Toad’s head spin and to make Dante proud.

If you had to pick just one word to describe the past two decades of large-scale state testing, what would it be? Tough, right.

Well, I couldn’t settle on just one, so with a shoutout to Nat King Cole and Hans Friedrich August Carste, I selected three:

Crazy, Hazy, Lazy


Crazy needs the least amount of explanation. However, for those of you who didn’t experience it firsthand or have effectively blocked most of the past two decades from your memory, here’s a quick recap.

No Child Left Behind brought a hard stop to the 1990s and the dreams of those of us who held out hope for incorporating authentic, performance-based assessment into large-scale state testing.

Instead of testing at grades 4, 8, and 10, students would now be tested at grades 3 through 8 and at least once in high school.

  • More than doubling the number of required tests and students tested (two equally enormous, but very different challenges, by the way).
  • Leading to an increased need for coherence in results across tests. There’s a lot of water under the bridge that you can use to explain differences in test results between grades 4 and 8, and even between grades 8 and 10. Not to mention that in the vast majority of cases, that’s only one test per school. Not so with testing at grades 3 through 8.
  • Creating an interest in longitudinal data at a time when there were no systems in place for tracking students across years. There were also no systems in place for storing data across years, but that seemed like the easier problem to solve.

Tests that were low-stakes or medium-stakes for schools and teachers now had very high stakes and expectations for improvement associated with them. Stakes became even higher for teachers around 2010.

And then there were alternate assessments and tests of English language proficiency and alternate assessments for tests of English language proficiency.

The role of USED shifted from the cool uncle slipping you a 20 to have a good time on Friday night to the strict parent checking what you were wearing when you left the house and waiting up for you in the living room when you returned.

Tests and testing programs went full circle from commercial off the shelf to augmented commercial (remember those fun times) to custom to custom-consortia back to basically off the shelf.

The testing industry expanded then consolidated then contracted then consolidated again then expanded again in different directions.

  • Some of which was directly related to computer-based testing finally replaced paper-based testing across the country – nullifying companies without a platform.


Who am I? Why am I here?

Suddenly, everyone was paying attention to large-scale state testing – and not in the same way as Dan Koretz did in the 1990s. Eager to please, the list of things that we agreed large-scale state testing could be used for grew longer and longer:

  • Measuring student achievement of state content standards.
  • Determining school effectiveness
  • Informing instruction of individual students
  • Serving as a model for classroom assessment
  • Measuring student and school growth
  • Measuring student and school improvement
  • Measuring the breadth and depth of state content standards
  • Determining where each and every student falls along the achievement or proficiency continuum
  • Providing “honest” information to parents
  • Determining teacher effectiveness. Or was it teaching effectiveness?
And please do it all good, fast, and cheap.
Sure, no problem. Because you like me, you really, really like me.

Of course, there were other factors contributing to the fog enveloping state testing as well.

State testing became conflated with state accountability – which only exacerbated our field’s inner struggles over validity.

Like a voice in the wilderness, Popham warned of the difference between student achievement in mathematics and ELA and school effectiveness. But was anybody there? Did anybody care?

Is that norm-referenced or criterion-referenced? Both? Does it matter? Does it change how we build or score the test?

Cross-sectional or longitudinal?

Growth. Improvement. Progress.

  • We documented at least five very different ways to define growth and there at least three different methods of calculating growth associated with each definition. A year’s worth of growth???

Standardization v. Flexibility.

  • Manuals listing page after page of standard and non-standard accommodations. Timed, untimed, and loosely timed tests. Months-long testing windows.

College AND Career Ready. On track to college-and-career ready. At third grade.

And Alignment. Ah, alignment.

Who am I? Why am I here?


When I think of all of the people, friend and foe alike, who gave everything they had to large-scale state testing over the past two decades, lazy is the last word that comes to mind.

But it is a different kind of lazy of which I speak. I’m talking about the kind of laziness that slowly overtakes a field.

There is intellectual laziness, the atrophy that comes from two decades of state testing programs being defined and constrained by federal rules and regulations.

Ours is not to question why. Ours is to manage effectively and efficiently.  And for the most part, we do.

Innovation. Disruption. Maintenance. All are very important and require highly skilled people, especially for systems as large and complex as state assessment programs. But not the same people.

More importantly, there is measurement laziness, which when combined with the ever-present layer of psychometric arrogance creates what I’ll label an infallibility mentality. What I mean by that is that we have become comfortable, much too comfortable in my opinion, with the practice of a score on the state test being the sole determiner of student proficiency.

There is no student proficiency without the state test. A student is not proficient until we test them, and our test declares them so.

To be clear, I am not trying to engage in an “I measure it; therefore, it is.” philosophical argument about latent traits, hidden constructs, or whether there can be real measurement in the social sciences.  My claim has nothing to do with nihilism, existentialism, positivism, or any other “-ism” associated with unlocking the mystery of the meaning of test scores.

No, I am just talking about laziness.

You can approach large-scale state testing and student proficiency from one of two perspectives:

    1. There are students in the state who are Proficient in mathematics and those who are not. My test must accurately distinguish between them.
    2. My test measures student proficiency in mathematics to determine which students are Proficient and which are not.

Those two perspectives might seem very similar, and the difference between them might even have little impact on how you design your state test.

They differ greatly, however, in the type of actions and amount of effort that you would be expected to take in validating your test scores.

In short, Perspective 1 requires a lot more work.

Across the past two decades, the field adopted Perspective 2.

To be fair, the field’s attraction to Perspective 2 is not the fault of any individual psychometrician or testing company.

At the top of the slippery inward-looking slope was the original claim:

 State tests are designed to measure state standards and the state has the legal authority to define those standards.

More than Messick. More than Kane or the joint Standards. That simple sentence right there legitimately, and very effectively by the way, influenced the validation of state tests. It eliminated the need for those in state testing to be concerned about a host of Validity Standards focused on external evidence. Everything we needed to worry about was contained within the test and the state standards.

When you start rolling down that hill, it’s hard to stop.

So, we didn’t. (even when the claim changed to college-and-career ready)

We didn’t ask how well our tests did in identifying students who were Proficient or CCR or on track to CCR (whatever that means). Did anybody care?

Well, for some very good reasons, there has been a lot more interest in

a) examining the intended and unintended consequences of the uses of test scores and the utility of assessment programs than in

b) validating inferences about whether individual students, or subgroup of students, are Proficient in mathematics.

And the impact of the field’s eagerness to conflate the two in defining validity certainly cannot be discounted as a contributing factor.

So, again, it became easy to dismiss the healthy skepticism we should always possess about the ability of state test scores to do the one thing they were supposed to do – to correctly identify the students who had achieved proficiency on the state standards.

And recalling the opening line of this section, it’s not like people were just sitting around looking for more questions to ask and analyses to conduct.

Where do we go from here?

Great question, but this post is already at 1,500 words so the answer will have to wait. For now, take a breath, find a quiet place and take some time to think. Then grab a soda and pretzels and beer and head to the beach. Enjoy the summer. It’s been a wild ride.

Image by suju-foto from Pixabay

Published by Charlie DePascale

Charlie DePascale is an educational consultant specializing in the area of large-scale educational assessment. When absolutely necessary, he is a psychometrician. The ideas expressed in these posts are his (at least at the time they were written), and are not intended to reflect the views of any organizations with which he is affiliated personally or professionally..

%d bloggers like this: