assessment, accountability, and other important stuff

Archive for April, 2020

A rejoinder



Charlie DePascale

“Father, forgive them, for they know not what they do!”

In the spirit of the Easter season, that was my initial reaction when I read what Steve Sireci had to say in his final newsletter column as president of NCME. I support Steve as a born-again public policy advocate.   I agree wholeheartedly with Steve that NCME should be more proactive as an organization in protecting measurement principles,  promoting the proper use of assessment, and particularly in supporting states in their attempts to use educational measurement to improve student learning.

For example, it would have been wonderful to have NCME weigh in when the two happy hoopsters from Chicago were running their 3-card Monte game (aka NCLB waivers) on desperate states. In essence, they offered states a waiver on a requirement that never actually existed (100% proficiency) for agreeing to a “new and improved” requirement that was mathematically equivalent to the requirement that did actually exist (reduce non-Proficient students by 10% annually) while forcing states to accept the horrific, ill-conceived teacher evaluation requirements that Steve decries.

Yet, I came away from the column with an uneasy feeling.  I was uncomfortable with the  underlying message that if the states just knew better, if we just did a better job of informing them about the Standards and research everything would have been better.  Although at a very high level, I agree with the direction NCME is taking, the devil, as they say, is in the details. I decided that a rejoinder would be an appropriate NCME-type approach to point out a few areas of disagreement.

I am not going to comment on Student Growth Percentiles (SGP) in this post.  We can re-engage on the SGP debate at a future NCME conference.

I am also not going to spend much time on Steve’s comment about state assessments under NCLB:

In my opinion, these early 21st century tests met the goal of demonstrating how well students had mastered specific aspects of curricular goals at a specific point in time.  But as illustrated above, that was not the only goal of NCLB.”

As a member of the MCAS team at the Massachusetts Department of Education, I held a similar view of the English language arts and mathematics tests our content team and contractor developed.

It was only after several lengthy discussions over three years with Jim Popham (sadly at AERA, not NCME, sessions) that I realized that the quality of the English language arts and mathematics tests and even the use of those tests to measure schools’ adequate yearly progress (AYP) toward 100% proficiency was largely an irrelevant question.  The critical issue was that student performance in English language arts and mathematics was being used as a proxy for the yet-to-be-defined construct of school quality. You can define school quality in terms of student performance in English language arts and mathematics, but you have to be prepared to live with the consequences (pun intended).

In this rejoinder, I do want to focus on two issues Steve raises in the column: technical reports and the role of TAC members during NCLB, the NCLB waivers, ESSA, and beyond.

Technical Reports

Let’s consider this statement regarding Technical Reports:

“Are the criticisms from teachers and others true—that tests have caused undue student anxiety, narrowed the curriculum, and led to even narrower curricula for historically marginalized groups?  Only specific empirical study of testing consequences can answer these questions.  If someone can show me such evidence in a technical manual for a statewide summative assessment, I will buy the authors of that manual a drink, for those authors have fulfilled the goals required in 21st-century validation.”

 The content of state assessment Technical Manuals (i.e., what should be included and who is responsible for writing it)  has been a hot topic as long as I have been working with Technical Advisory Committees.  The two people whom I regard as my TAC mentors, George Madaus and Ron Hambleton, had strong feelings about what should be, but most often was not, included in a Technical Manual.

George’s pet peeve was related to validity and evidence that the test was meeting the lofty purposes laid out in the opening paragraphs of most technical manuals (e.g., improving instruction and student learning, reducing achievement gaps). His partial victory in this area was getting Massachusetts to rewrite and tone down its description of what the test could accomplish.  Ron, of course, remains concerned about score reporting: what was included on various reports, whether the information was understood by stakeholders, and how it was used.

The documents that we regard as state assessment Technical Manuals, for the most part, are written by the state’s assessment contractor as part of its contractual obligation to the state.  In various capacities, I have written, edited, and reviewed my fair share of them.  Over time, my view on the content of technical manuals has been somewhat fluid and amorphous, but has now congealed to the belief that what is needed is the following set of documents:

  • State Assessment Technical Manual (i.e., the current documents) – these documents are pretty much fine the way they are with the exception of the aforementioned information about reporting and updates needed to reflect the shift to computer-based testing.These documents provide evidence of the technical quality of the test that was developed, administered, scored, and the scores that were reported.
  • State Assessment Program Technical Manual – the manuals described above do not provide all of the information required to support a validity argument for the state’s testing program.A separate manual containing additional information and evidence produced by the state and other external sources is needed to properly document the testing program.
  • State Accountability Systems Technical Manual – a state’s accountability system is separate and distinct from its testing program.It requires its own theory of action, validity argument, and technical manual.  To the extent that an accountability system uses state test scores, it must produce evidence that they, along with all other data included in the system, are appropriate for use in the accountability system.

Without wading into a validity argument, it simply does no practical good to conflate tests, testing programs, and accountability systems.  Furthermore, the test-based information that is used in accountability systems is often what I have referred to as a “fruit loops” version of a test score – something that has been so heavily processed that it bears little resemblance to the original scale score from which it was derived.

Technical Advisory Committees

I feel compelled to address comments Steve made regarding Technical Advisory Committees and their actions, or lack thereof, during NCLB.  First, a disclosure/disclaimer:

  • The overlap between my TAC experiences and Steve’s is minimal.We worked together for a brief time on two TACs, neither of which was a general assessment TAC.  Steve’s impressions may very accurately reflect his TAC experiences.

Regarding the era of NCLB and Adequate Yearly Progress (AYP) Steve wrote:

Over the past 17 years, fascinating explanation, discussion, and debate regarding AYP was published in our NCME journals Journal of Educational Measurement (3 articles) and Educational Measurement:  Issues and Practice (35 articles).  However, these discussions were largely absent from conversations involving education policy makers and state boards of education.

 Test scores were being used for new accountability purposes, and there was little validity evidence to their use for the purposes of determining AYP.  Did such use put state departments of education in violation of the American Educational Research Association, American Council on Education, and National Council on Measurement in Education’s (1999, 2014) Standards for Educational and Psychological Testing?  If so, did we as responsible TAC members inform the states of such violations?

 My impression of these times is we did issue warnings, but they were not strong enough. 

 Later, he added the following regarding the NCLB waiver era and the use of test scores and other indicators derived from test scores for high-stakes teacher accountability.

Specifically, we co-author standards that require evidence for the validity of test score use, and then we stand idly by, collecting our TAC honoraria, while teachers lose their jobs based on test score derivatives that lack validity evidence for teacher evaluation.  Clearly, our actions must change if we are to partner with the education community in using educational tests to help students learn. 

 Steve suggests that research contained in NCME publications and other peer-reviewed journals does not find its way into TAC discussions. Although state assessment teams and policy makers might be absent from NCME conferences, research does reach them in a variety of ways.

  • In the early NCLB era, TACs with which I worked included many highly regarded researchers and measurement specialists, six of whom are past presidents of NCME: Bob Linn, Dale Carlson, Barbara Plake, Laurie Wise, and the aforementioned George Madaus and Ron Hambleton.  These were not shy people.
  • Bob Linn published multiple papers on NCLB that were shared widely at TAC meetings including Accountability Systems: Implications of Requirements of the No Child Left Behind Act of 2001. Dale Carlson produced an elegant paper on school accountability that can be considered a seminal work in the use of both growth and status for school accountability.
  • Organizations such as CRESST and CCSSO produced documents and sponsored conferences that shared information directly with states. Between 2002 and 2007 new NCME Board member Ellen Forte and former Wisconsin state assessment and accountability administrator Bill Erpenbach worked with CCSSO to produce multiple detailed reports for states about accountability systems across the country.
  • Work that ultimately became NCME publications, such as the 2003 EM:IP article that I co-authored with Rich Hill on the Reliability of No Child Left Behind Accountability Designs traveled a long road before being published by NCME including presentations to state and federal officials at conferences such as the CCSSO Large-Scale Assessment conference, the CRESST conference, AERA, NCME, and foundation- and company-sponsored conferences.

If states were receiving all of this research, why wasn’t it used as Steve would hope?

TACs are convened to provide advice directly to the assessment team (or accountability team) in a department of education and indirectly to the commissioner. The state assessment team is charged with implementing programs to meet the state and federal laws, most often under very narrow constraints and tight timelines.  In the current context, there are two documents driving their actions.

Both documents are complex, are the result of compromise, rely heavily on imprecise and ill-defined terms, and contain contradictory statements or requirements.  One document is a federal law (i.e., NCLB, ESSA, IDEA) and the other is the Joint Standards  (compare the introduction and standards in the validity chapter for an example). When states assessment teams are backed to the wall and faced with the option of “violating” one of the two documents, the Joint Standards lose 99 out of 100 times.

The one time when the Joint Standards might win is with regard to the high stakes use of tests for students such as promotion or graduation decisions where there is a high likelihood of lawsuits.  In that case, the state moves forward, because to not do so is not an option, but moves forward following the guidance offered by the AERA position statement on High-Stakes Testing in Pre-K – 12 Education. AERA doesn’t simply say don’t do it.  They say, if you have to do it, this is what must be done.

If TAC members were not advising states to simply adhere to the Joint Standards regarding AYP, what were they doing?  In most cases, TAC members were trying to help states make the best of a bad situation and do as little harm as possible.  In other words, they were saying, if you have to do it, this is what must be done.

In my work with TACs dealing with AYP and later teacher evaluation I saw some of the most creative applications of confidence intervals that I have ever seen in the name of trying to help states do what had to be done. Also, remember that AYP was only one of many technical challenges that NCLB presented to states simultaneously, not the least of which was ensuring that all students were assessed fairly through the use of accommodations and alternate assessments.

Learning from States

In closing, I would also like to note that some state leaders were quite proficient in dealing with the technical challenges presented by AYP and teacher evaluation.  One shining example was the late Mitch Chester. As accountability director in Ohio, he worked within NCLB to develop a balloon-payment accountability system that effectively deferred many problematic AYP decisions until a point in time where real improvement could be measured, or the law would have been reauthorized.  Later, as commissioner in Massachusetts, again working within the law, his team effectively nullified the use of test scores for teacher accountability.

So, yes, NCME should be proactive in promoting good measurement, assessment, and accountability at the state and local levels.  As has been discussed recently with regard to classroom assessment, however, NCME cannot begin by assuming that states make the decisions they do simply because they don’t know any better.


Why is this time different?

The K-12 testing industry survived, even flourished, during past economic downturns. There are signs, however, that this time might be different.



 Charlie DePascale

There have been two major economic downturns in the past twenty years: the bursting of the Dot-Com bubble in the early 2000s and the Great Recession of 2008.  Much like the proverbial cockroach in a nuclear winter, the K-12 standardized testing industry emerged from each not only unscathed, but in fact, strengthened.

In the early 2000s, the savior was No Child Left Behind (NCLB).  When Assistant Secretary of Education Susan Neumann declared, Let them use multiple-choice tests, it was off to the races.  Testing companies were the cock of the walk, strutting their stuff, spreading feathers that had been ruffled during the Psychometric Spring of the 1990s, and exalting in the good fortunes brought by the ‘No Psychometrician Left Behind Act’ or the ‘Psychometricians Full Employment Act’ that mandated annual testing of all students in grades 3 through 8 and once in high school.

In the latter part of that decade, it was President Obama to the rescue with NCLB waivers and the Race to the Top Assessment program.  Fears that the new administration would reduce required testing proved to be unfounded. Testing companies continued to feed at the federal trough and roll around happy as a pig in an internal alignment study. Like pigs being fattened for slaughter, however, they never saw what was coming.

The structure laid out in the Obama blueprint for reform contained heavy pillars that strained the sandy foundation of K-12 standardized testing.  To a greater degree than might be apparent, the success of standardized testing in schools relies on a longstanding social contract with local educators and communities.  As long as the tests do not take too much time and have no real consequences everything is fine. NCLB might have pushed the boundaries of that contract, but the requirements that the new assessments measure college-and-career readiness standards and be used to measure teacher effectiveness tore the social contract to shreds.

The Every Student Succeeds Act (ESSA) removed the teacher evaluation requirements of the NBLB waivers and states have made repeated concessions on the length of tests, but a social contract once violated is difficult to restore.

Accountability and Equity

K-12 standardized testing, of course, relies on more than a social contract with local educators.  The dual principles of accountability and equity have been the driving force behind federally mandated assessment since the early days of the original Elementary and Secondary Education Act (ESEA) passed in 1965.  Sen. Robert Kennedy pushed for the evaluation of Title 1 programs funded by ESEA as a tool for providing information to parents that federal money was being spent wisely to improve student outcomes; a theme echoed by Secretary of Education Arne Duncan throughout the Obama administration.  From the very beginning, standardized tests became the tool of choice to provide that accountability.

The close link between accountability and equity resulted in standardized testing receiving bipartisan support. NCLB with its assessment requirements was championed by President George W. Bush and Sen. Edward Kennedy, perhaps the last truly bipartisan bill passed by Congress.  There are clear signs, however, that standardized testing is no longer viewed as a tool to promote equity; and to the contrary, is perceived as a threat to equity.

Wearing the Black Hat

In response to a question posed at a candidates’ forum last December, presumptive democratic nominee Joe Biden seemingly declared that he would end standardized testing in public schools.

“Given that standardized testing is rooted in a history of racism and eugenics,” the audience member asked, “if you are elected president, will you commit to ending the use of standardized testing in public schools?”

“Yes,” Biden responded. “As one of my friends and black pastors I spend a lot of time with . . . would say, you’re preaching to the choir, kid.”

The premise of the question may seem extreme, but it is becoming quite mainstream.

  • In her 2019 presidential address, AERA president Amy Stuart Wells described testing policies as the Jim Crow of education.
  • Last spring, an uproar over a passage and prompt on the Massachusetts 10th grade English language arts test led to the item being dropped and rules for meeting graduation requirements loosened in an “abundance of caution” over concerns for stereotype threat.
  • Late in 2019, a lawsuit was brought against the University of California system regarding the use of the SAT and ACT in college admissions, not necessarily because the tests are flawed technically, but because the use of the tests “privilege affluent families who can afford to send their children to tutoring,” and “illegally discriminates against applicants on the basis of their race, wealth and disability.”
  • This weekend, the National Council on Measurement in Education faced with offering only one virtual session from the 2020 NCME Conference program chose a session with the following description: “If learners vary in their cultural experiences, appreciations, characteristics, and needs, all aspects of culturally relevant pedagogy may need to reflect such variations. However, educational assessments have been most resistant.”

Standardized tests have been challenged for decades over technical concerns related to test bias.  The current challenges, however, go beyond technical issues that can be easily fixed through adjustments to the test development or scoring process.

Although the loss of trust among educators and concerns over equity may be serious threats to the continued viability of standardized testing, the greatest threat may come from within the field itself.

Necessary, maybe, but not sufficient

While the panelists in the NCME session described above did not spend much time addressing the cultural relevance of assessments, they did spend a great deal of time discussing the need for assessments to measure more and different things than the knowledge and skills that are currently being measured effectively through traditional on-demand, large-scale assessments.

The past five years have demonstrated clearly that the English language arts and mathematics college-and-career readiness standards that were a byproduct of state testing during NCLB  require different kinds of assessments than traditional standardized tests.  An NRC committee on designing assessments for the Next Generation Science Standards concluded that large-scale standardized tests were only part of a comprehensive assessment solution.  A handful of states across the country, including standardized testing stalwarts such as Louisiana and Massachusetts are exploring alternative methods of assessment under the Innovative Assessment Demonstration Authority provision of ESSA.

This Time May Be Different

Although a President Biden may not fully retreat from the testing policies he implemented as vice-president, this time may be different.  As the COVID-19 crisis shrinks state budgets and schools refocus on what is really essential, a return to fully embracing a standardized testing system that was meeting few stakeholder needs would seem unlikely.  Standardized testing may remain part of the evaluation process for years to come, but it will likely be a much smaller part than it has been in the past; and traditional testing companies should not expect the boom time they experienced in 2002 and 2010.