assessment, accountability, and other important stuff

Remember the Alamo


The Alamo


Charlie DePascale

This spring  I returned to San Antonio to attend the 2017 NCME conference.  The trip brought back memories of my many visits to the Harcourt office there as a member of the MCAS management team for the Massachusetts Department of Education. My last MCAS trip was in August 2002.  Some things in San Antonio were exactly as I remembered them fifteen years ago, others had changed a little, and some are now merely memories.

HEM Badge

Harcourt, of course, is gone.  Schilo’s still has the best root beer I’ve ever tasted.  The Alamo, is still nestled in the midst of office buildings, souvenir shops, and tourist “museums”.  I still would have found the convention center with the street map I used in 2002 (no smart phone back then).  However, unlike the Alamo, the convention center had moved just a bit from where I remembered it being in 2002.

Like Harcourt, the original MCAS program is gone; finally replaced this spring by the next generation MCAS tests. But, what about assessment and accountability, in general?  How have they changed since the summer of 2002?  What is the same, what has changed a little, and what is now a memory?

Still crazy after all these years

Like the Alamo, much of the facade of assessment and accountability remains unchanged.

  • State content standards, achievement standards, annual testing, and accountability ratings are still the foundation of our work.  
  • Individual state assessments and achievement standards are still the norm despite the best efforts of a lot of earnest folks.
  • NAEP and its trends are still the gold standard. The nation’s report card appeared to be teetering on the brink of irrelevancy back in 2009 with the advent of the Common Core and state assessment consortia.  Like the cockroach and Twinkies, however, it appears that NAEP is impervious to any kind of attack.

What’s new

NCLB has come and gone, but its effects on state assessment and accountability remain. Technology has also had an impact on how we test, and to a lesser extent for the time being, what we test.

  • Annual testing of all students at grades 3 through 8 remains in place – the primary legacy of NCLB.  

We have not yet come up with a good reason for testing all students in English language arts and mathematics every year, but there is surprisingly strong support for doing so. No, telling parents the truth and being able to compute growth scores are not good reasons for administering state assessments to every student every year.

  • Growth as a complement to status and improvement – I place growth scores as a close second to the annual testing of all students when considering the legacy of NCLB; but second and not first only because annual testing was a sine qua non for growth scores.  

Despite all of the value they add to the interpretation of student and school performance, I remain skeptical about our use of growth scores. Much like the original ‘1% cap’ on alternate assessments, growth scores emerged as a means to include more students in the numerator when computing the “percent proficient” for NCLB.  Yes, there are sound reasons for giving schools credit for growth, but growth scores would have died a quick death if they could not have been used to help schools meet their AMO.

I am worried about what I have referred to as the slippery slope of growth; that is, the use of growth scores as the 21st century approach to accepting different expectations for different groups of students.

I am worried about the use of growth scores as another example of our predilection for treating the symptom rather than the disease.  People are not fit – take a pill to lose weight.  College debt is too high – improve student loans.  Students have no reasonable chance to master grade level standards within a school year – compute growth scores.

  • Online testing (née computer-based testing) replacing paper-and-pencil testing.  For years, it appeared that online testing was the proverbial carrot being dangled in front of the horse – always just a bit out of our reach.  Now, however, it appears that we have finally reached a tipping point.  Although there are still glitches, in states across the country the infrastructure is in place to support large-scale online testing.

Online testing itself, of course, is only one way in which technology has impacted large-scale assessment.  Figuring out the best way to make use of things like automated scoring, student information systems, adaptive testing, dynamic reporting, and the wealth of process data available from an online testing session should keep the next generation of testing and measurement specialists quite busy for years to come.

  • Communication across states There is now constant communication among states, and that communication occurs at multiple levels (commissioners/deputy commissioners, assessment directors, content specialists, etc.).  Some might argue that there is often more and better communication within levels across states than across levels within a state, but let’s save that discussion for another day.

The increased communication across states can be attributed to the common requirements and pain of NCLB, technology, the assessment consortia, or all of the above.  Cross-state communication, however, deserves its own place in any list of things that are new since 2002.   The bottom line is that although it may not have resulted in common assessments to the extent expected, increased communication across states has changed how we think about and how we do state assessment and accountability.

Gone, but not forgotten (I hope)

Nothing lasts forever (except NAEP trend lines, see above), but there are a few things that faded away or simply disappeared over the last fifteen years that surprised me.

  • District accountability – It feels as if there is a lot less of a focus on district accountability as something distinct from school accountability now than there was in 2002; even after NCLB was first enacted.   

District report cards are often simply aggregated school report cards, with districts evaluated on the same metrics and indicators as schools.  Although technology and laws/regulations have made it much easier for states to interact directly with schools, there is something critical in the hierarchy among states, districts, and schools that must be maintained.  Perhaps as the accountability pendulum swings slightly back from an exclusive focus on outcomes (i.e., test scores), the impact of district inputs, processes, and procedures on student performance will receive greater attention.

  • Standardization – So critical to large-scale assessment that it was actually part of the name (i.e., “standardized testing”), standardization is dead; and it will not be returning any time soon.  To some extent, standardization, in general, was a casualty of the backlash against traditional “fill in the bubble tests” and test-based accountability, but that is only part of the story.  

We (i.e., the assessment and measurement field) gave up standardization in administration conditions in stages.  Time limits were abandoned in the name of proficiency, “standards-based” education, and allowing kids to show what they could do  (don’t ask me to explain how that made sense).  Standardized administration windows were abandoned to accommodate the use of technology.  Concerns about validity and the appropriateness of accommodations for students with disabilities and English language learners gave way to “accommodations” for all in the name of equity and fairness.

We were never truly invested in standardization of content for individual students, so we gave that up willingly as it allowed us to play with measurement toys like adaptive testing, matrix sampling, and “extreme equating” that is worthy of the name.

As for standardization of scoring, well, if we limit the concept of scoring to mean the scoring of an item then as soon as we moved away from machine-scored, multiple-choice items, standardization of scoring took a hit.  The larger concept of scoring, however, is a bit too complicated to address adequately near the end of a post such as this.  For now, let’s just end our discussion of standardization of scoring with the question that has been posed in countless stories and songs, can you lose something that you never really had?

  • Time – I believe that the single most important change in large-scale assessment since 2002 may be the loss of time available to design, develop, and implement an assessment program.  

The RFP for the original MCAS program was issued in 1994; after some delay the contract was awarded in 1995; and the first operational MCAS tests were administered in spring 1998.  As new tests were added to the MCAS program, it was the norm for a test to be introduced via a full-scale pilot administration one year prior to the first operational administration.  

In contrast, the RFP for the next generation MCAS tests was issued in March 2016; the contract was awarded in August 2016; and the first operational MCAS tests were administered in spring 2017.

And Massachusetts is not alone.  In states across the country, it is becoming normal practice to go from RFP to operational test in less than a year.  This would not be as much of a problem if states were simply purchasing ready-made, commercial tests that had been carefully constructed and validated for use in the state’s particular context.  For most state assessments, however, that is not the case.

Four years from initial design to implementation may or may not be too long, but there is no question that 10 months is too short.  And I don’t want to hear “the perfect is the enemy of the good” or “building the plane while we are flying it” as arguments for minimizing the test development cycle.  First, those concepts don’t really apply to things like planes (and high-stakes assessments).  Second,  they do not apply in one-sided relationships in which mistakes are met with fines, lawsuits, and loss of contracts. Third, what’s the rush?

Where do we go from here?

I identified the loss of time as the most significant change not so much because of its negative impact on the quality of state assessments, but rather because of what it signifies.  The requirement to design, develop, and implement custom assessments in less than a year is a clear indication that the measurement and assessment community may have lost what little control or influence we had over assessment policy.  Recent trends in the area of comparability is another example.  We are more likely to be asked after the fact  “how to make things comparable” or to “show that they are comparable” than to be asked to offer input in advance on “whether an action will produce results that are comparable” To paraphrase an old expression, policymakers find it much easier to ask the measurement community for justification or a rationalization than permission.

In part, this may be a reflection of policymakers’  facing many constraints and limited options. In part, however, I believe that we have ourselves to blame; we are a victim of our success or at least of our own public relations machine.  Psychometricians can do anything with test scores!  Unfortunately, the field cannot survive as a scientific discipline by simply saying yes to every request that is made, regardless of how unrealistic or unreasonable that request may be.  (Note that I am making a distinction between testing companies surviving and the assessment/measurement field surviving.)

The fate of the field, however, may have already been sealed by the emergence of Big Data. Only time will tell.  

I will meet you at Schilo’s in 2032 and we can discuss it over a frosty mug of root beer. The first round is on me.




%d bloggers like this: