“Father, forgive them, for they know not what they do!”
In the spirit of the Easter season, that was my initial reaction when I read what Steve Sireci had to say in his final newsletter column as president of NCME. I support Steve as a born-again public policy advocate. I agree wholeheartedly with Steve that NCME should be more proactive as an organization in protecting measurement principles, promoting the proper use of assessment, and particularly in supporting states in their attempts to use educational measurement to improve student learning.
For example, it would have been wonderful to have NCME weigh in when the two happy hoopsters from Chicago were running their 3-card Monte game (aka NCLB waivers) on desperate states. In essence, they offered states a waiver on a requirement that never actually existed (100% proficiency) for agreeing to a “new and improved” requirement that was mathematically equivalent to the requirement that did actually exist (reduce non-Proficient students by 10% annually) while forcing states to accept the horrific, ill-conceived teacher evaluation requirements that Steve decries.
Yet, I came away from the column with an uneasy feeling. I was uncomfortable with the underlying message that if the states just knew better, if we just did a better job of informing them about the Standards and research everything would have been better. Although at a very high level, I agree with the direction NCME is taking, the devil, as they say, is in the details. I decided that a rejoinder would be an appropriate NCME-type approach to point out a few areas of disagreement.
I am not going to comment on Student Growth Percentiles (SGP) in this post. We can re-engage on the SGP debate at a future NCME conference.
I am also not going to spend much time on Steve’s comment about state assessments under NCLB:
“In my opinion, these early 21st century tests met the goal of demonstrating how well students had mastered specific aspects of curricular goals at a specific point in time. But as illustrated above, that was not the only goal of NCLB.”
As a member of the MCAS team at the Massachusetts Department of Education, I held a similar view of the English language arts and mathematics tests our content team and contractor developed.
It was only after several lengthy discussions over three years with Jim Popham (sadly at AERA, not NCME, sessions) that I realized that the quality of the English language arts and mathematics tests and even the use of those tests to measure schools’ adequate yearly progress (AYP) toward 100% proficiency was largely an irrelevant question. The critical issue was that student performance in English language arts and mathematics was being used as a proxy for the yet-to-be-defined construct of school quality. You can define school quality in terms of student performance in English language arts and mathematics, but you have to be prepared to live with the consequences (pun intended).
In this rejoinder, I do want to focus on two issues Steve raises in the column: technical reports and the role of TAC members during NCLB, the NCLB waivers, ESSA, and beyond.
Let’s consider this statement regarding Technical Reports:
“Are the criticisms from teachers and others true—that tests have caused undue student anxiety, narrowed the curriculum, and led to even narrower curricula for historically marginalized groups? Only specific empirical study of testing consequences can answer these questions. If someone can show me such evidence in a technical manual for a statewide summative assessment, I will buy the authors of that manual a drink, for those authors have fulfilled the goals required in 21st-century validation.”
The content of state assessment Technical Manuals (i.e., what should be included and who is responsible for writing it) has been a hot topic as long as I have been working with Technical Advisory Committees. The two people whom I regard as my TAC mentors, George Madaus and Ron Hambleton, had strong feelings about what should be, but most often was not, included in a Technical Manual.
George’s pet peeve was related to validity and evidence that the test was meeting the lofty purposes laid out in the opening paragraphs of most technical manuals (e.g., improving instruction and student learning, reducing achievement gaps). His partial victory in this area was getting Massachusetts to rewrite and tone down its description of what the test could accomplish. Ron, of course, remains concerned about score reporting: what was included on various reports, whether the information was understood by stakeholders, and how it was used.
The documents that we regard as state assessment Technical Manuals, for the most part, are written by the state’s assessment contractor as part of its contractual obligation to the state. In various capacities, I have written, edited, and reviewed my fair share of them. Over time, my view on the content of technical manuals has been somewhat fluid and amorphous, but has now congealed to the belief that what is needed is the following set of documents:
- State Assessment Technical Manual (i.e., the current documents) – these documents are pretty much fine the way they are with the exception of the aforementioned information about reporting and updates needed to reflect the shift to computer-based testing.These documents provide evidence of the technical quality of the test that was developed, administered, scored, and the scores that were reported.
- State Assessment Program Technical Manual – the manuals described above do not provide all of the information required to support a validity argument for the state’s testing program.A separate manual containing additional information and evidence produced by the state and other external sources is needed to properly document the testing program.
- State Accountability Systems Technical Manual – a state’s accountability system is separate and distinct from its testing program.It requires its own theory of action, validity argument, and technical manual. To the extent that an accountability system uses state test scores, it must produce evidence that they, along with all other data included in the system, are appropriate for use in the accountability system.
Without wading into a validity argument, it simply does no practical good to conflate tests, testing programs, and accountability systems. Furthermore, the test-based information that is used in accountability systems is often what I have referred to as a “fruit loops” version of a test score – something that has been so heavily processed that it bears little resemblance to the original scale score from which it was derived.
Technical Advisory Committees
I feel compelled to address comments Steve made regarding Technical Advisory Committees and their actions, or lack thereof, during NCLB. First, a disclosure/disclaimer:
- The overlap between my TAC experiences and Steve’s is minimal.We worked together for a brief time on two TACs, neither of which was a general assessment TAC. Steve’s impressions may very accurately reflect his TAC experiences.
Regarding the era of NCLB and Adequate Yearly Progress (AYP) Steve wrote:
Over the past 17 years, fascinating explanation, discussion, and debate regarding AYP was published in our NCME journals Journal of Educational Measurement (3 articles) and Educational Measurement: Issues and Practice (35 articles). However, these discussions were largely absent from conversations involving education policy makers and state boards of education.
Test scores were being used for new accountability purposes, and there was little validity evidence to their use for the purposes of determining AYP. Did such use put state departments of education in violation of the American Educational Research Association, American Council on Education, and National Council on Measurement in Education’s (1999, 2014) Standards for Educational and Psychological Testing? If so, did we as responsible TAC members inform the states of such violations?
My impression of these times is we did issue warnings, but they were not strong enough.
Later, he added the following regarding the NCLB waiver era and the use of test scores and other indicators derived from test scores for high-stakes teacher accountability.
Specifically, we co-author standards that require evidence for the validity of test score use, and then we stand idly by, collecting our TAC honoraria, while teachers lose their jobs based on test score derivatives that lack validity evidence for teacher evaluation. Clearly, our actions must change if we are to partner with the education community in using educational tests to help students learn.
Steve suggests that research contained in NCME publications and other peer-reviewed journals does not find its way into TAC discussions. Although state assessment teams and policy makers might be absent from NCME conferences, research does reach them in a variety of ways.
- In the early NCLB era, TACs with which I worked included many highly regarded researchers and measurement specialists, six of whom are past presidents of NCME: Bob Linn, Dale Carlson, Barbara Plake, Laurie Wise, and the aforementioned George Madaus and Ron Hambleton. These were not shy people.
- Bob Linn published multiple papers on NCLB that were shared widely at TAC meetings including Accountability Systems: Implications of Requirements of the No Child Left Behind Act of 2001. Dale Carlson produced an elegant paper on school accountability that can be considered a seminal work in the use of both growth and status for school accountability.
- Organizations such as CRESST and CCSSO produced documents and sponsored conferences that shared information directly with states. Between 2002 and 2007 new NCME Board member Ellen Forte and former Wisconsin state assessment and accountability administrator Bill Erpenbach worked with CCSSO to produce multiple detailed reports for states about accountability systems across the country.
- Work that ultimately became NCME publications, such as the 2003 EM:IP article that I co-authored with Rich Hill on the Reliability of No Child Left Behind Accountability Designs traveled a long road before being published by NCME including presentations to state and federal officials at conferences such as the CCSSO Large-Scale Assessment conference, the CRESST conference, AERA, NCME, and foundation- and company-sponsored conferences.
If states were receiving all of this research, why wasn’t it used as Steve would hope?
TACs are convened to provide advice directly to the assessment team (or accountability team) in a department of education and indirectly to the commissioner. The state assessment team is charged with implementing programs to meet the state and federal laws, most often under very narrow constraints and tight timelines. In the current context, there are two documents driving their actions.
Both documents are complex, are the result of compromise, rely heavily on imprecise and ill-defined terms, and contain contradictory statements or requirements. One document is a federal law (i.e., NCLB, ESSA, IDEA) and the other is the Joint Standards (compare the introduction and standards in the validity chapter for an example). When states assessment teams are backed to the wall and faced with the option of “violating” one of the two documents, the Joint Standards lose 99 out of 100 times.
The one time when the Joint Standards might win is with regard to the high stakes use of tests for students such as promotion or graduation decisions where there is a high likelihood of lawsuits. In that case, the state moves forward, because to not do so is not an option, but moves forward following the guidance offered by the AERA position statement on High-Stakes Testing in Pre-K – 12 Education. AERA doesn’t simply say don’t do it. They say, if you have to do it, this is what must be done.
If TAC members were not advising states to simply adhere to the Joint Standards regarding AYP, what were they doing? In most cases, TAC members were trying to help states make the best of a bad situation and do as little harm as possible. In other words, they were saying, if you have to do it, this is what must be done.
In my work with TACs dealing with AYP and later teacher evaluation I saw some of the most creative applications of confidence intervals that I have ever seen in the name of trying to help states do what had to be done. Also, remember that AYP was only one of many technical challenges that NCLB presented to states simultaneously, not the least of which was ensuring that all students were assessed fairly through the use of accommodations and alternate assessments.
Learning from States
In closing, I would also like to note that some state leaders were quite proficient in dealing with the technical challenges presented by AYP and teacher evaluation. One shining example was the late Mitch Chester. As accountability director in Ohio, he worked within NCLB to develop a balloon-payment accountability system that effectively deferred many problematic AYP decisions until a point in time where real improvement could be measured, or the law would have been reauthorized. Later, as commissioner in Massachusetts, again working within the law, his team effectively nullified the use of test scores for teacher accountability.
So, yes, NCME should be proactive in promoting good measurement, assessment, and accountability at the state and local levels. As has been discussed recently with regard to classroom assessment, however, NCME cannot begin by assuming that states make the decisions they do simply because they don’t know any better.