assessment, accountability, and other important stuff

A Square Peg for a Round Hole

Square-peg-round-hole

Photo courtesy of rosipaw

Charlie DePascale

We have reached a stalemate.

It has been nearly five years since ESSA and the assessment flexibility it offered to states, particularly at the high school level, became law.  Next week, we celebrate the tenth anniversary of the release and almost immediate and universal adoption of the Common Core State Standards (CCSS) – reflecting the shift in the focus of education reform from the vague notion of state-defined proficiency to nationally agreed upon standards for college- and career-readiness.  As we sit here today, however, states are still struggling to gain approval from the United States Department of Education (USED) for the use of the ACT or SAT as their high school assessment.

The main sticking point appears to be the alignment of those college admissions tests to the breadth and depth of the states’ academic content standards.  USED is not incorrect in their interpretation of the law.  States are not incorrect in either their desire to use the college admissions tests as their state assessment or in their inference that the law encourages them to do so – regardless of whether one thinks that the use of the ACT and SAT as a state assessment is advisable.

Unfortunately, this stalemate was as unavoidable as it is unsolvable under the current rules of play. It is also only the latest example of the difficulty in trying to design high school state assessment programs that meet federal assessment requirements.  Like square pegs and round holes, state assessment and high school simply do not fit well together.

The American concept of the comprehensive high school has been structured around students pursuing a variety of pathways to diverse postsecondary destinations.  State assessment has been structured around the concept of all students traveling the same route at the same rate; arriving at a common destination at the same time.  It should not come as a surprise, therefore, that when the irresistible force of high school meets the immovable object that is USED the result is nothing more than raised temperatures and a lot of wasted energy.

In a new white paper, State Assessment and High School, we review conflicts between the traditional roles of high school and state assessment, examine the current context of state assessment within the federal requirements of ESSA, and share a vision for state assessment in high school that meets the spirit, if not the letter, of the law.  The conclusion is that it is simply time to acknowledge that high school is fundamentally different from elementary and middle school and the assessment requirements that make perfect sense and function well at grades 3 through 8 don’t work for high school.

Where do we go from here?

It would be easy to continue down the current path. I can attest that one can build a nice career helping states fit square pegs into round holes. If you are really good, you help states convince USED that the state’s round peg is actually the square peg that the federal government is looking for – or at least comparable to it.  And if I’m being totally honest, solving those types of problems for a state can be exhilarating.

But states have more important things to focus on than trying to shape high school assessment programs to meet ill-conceived applications of  federal assessment rules and large-scale measurement principles.  The assessment community, too, has much more important matters to attend to if we hope to support innovation and improved student learning.

  • Collecting information on student performance throughout the year.
  • Supporting the use of curriculum-embedded performance assessments.
  • Measuring higher-level thinking, 3-dimensional learning standards, and 21st Century skills.
  • Understanding and measuring the learning process as well as outcomes.
  • Providing timely and useful information to support instruction and learning.

I promise that solving those challenges will be much more fulfilling and exhilarating than fitting a square peg into a round hole.

 

 

 

bright-fireworks-in-the-sky-vector-clipart_800

 

As we begin the 2020s, let’s take a moment to look back at some of the state assessment moments that defined the 2010s

Charlie DePascale

Last week, the New England Patriots announced their all-decade team for the 2010s.  The Patriots release was quickly followed by local media offering their own selections of all-decade teams for the Celtics, Bruins, and Red Sox (there’s not much to talk about on sports radio these days).

That got me thinking, what tests would I include on my all-decade team of state assessments (seriously, I don’t have much going on these days).  After all, it was quite a decade in state assessment.  As my former colleague and frequent foil Chris Domaleski noted, “It’s hard to describe the enthusiasm and high expectations for assessment in 2010.”  That level of enthusiasm, of course is impossible to sustain and the result was a decade of state assessment which Chris described aptly as “an exuberant start, setbacks, and, finally, renewed optimism and clarity about the road ahead.”

So, what state assessments stood out during the roller coaster ride of the 2010s?  Which tests would I include on my all-decade team – state assessment version?

Criteria for Selection

The first step in creating a list like this is to determine a set of criteria for selection. Was there a minimum number of years the test had to be administered?  Did it have to meet the CCSSO Criteria for High Quality Assessments? Should I consider cost, testing time, number of subscores reported, best name or acronym?

What about “measurement-y” factors like validity, reliability, fairness, alignment, utility, and unintended consequences?

In the end, I went with the same simple criterion I use for everything that makes it into this blog: things that interest me and I can use to make a point.

Other Tests Considered but Not Included

No list such as this would be complete without a brief discussion of other tests that were considered but didn’t quite make the cut.

I will start with the two most obvious omissions: PARCC and Smarter Balanced – the 800-pound gorillas of state assessment in the 2010s.  Although both programs appear in the list indirectly through various states, I decided to save discussion of the programs themselves for my forthcoming Netflix documentary Assessment King: Measurement, Mayhem, and Mystery.

I also decided not to include any of the new assessments being developed under IADA by states such as New Hampshire, Louisiana, and Georgia.  Similarly, I didn’t include the new next generation science assessments under development such as the Cambium (formerly AIR) cluster-based science assessment which is attempting assessment innovation on so many levels at the same time it makes my head spin.  As the development of those testing programs plays out over the next several years, they can be considered for the 2020s All-Decade Team.

With that out of the way, on to the list.

All-Decade Team

(in chronological order)

 New York (2012)

When people list their biggest fears things like public speaking, bugs, clowns, or certain politicians being re-elected top the list.  Worst nightmares include showing up naked at work or school, falling off of a cliff, or drowning.

For those of us in state assessment, all of those fears rolled into one are realized when one of your assessment items ends up on the front page of the newspaper.  A former colleague still shudders whenever he hears a song from Evita, Cats, or Phantom pop up in his AirPods.  Personally, whenever I am asked to review a test item, the first thought that goes through my mind is  WWJD – what would Jennifer do?

For New York, the nightmare played out through The Pineapple and The Hare; and because everything is bigger on Broadway, the 2012 New York assessment became a national punchline.  Truth be told, if you can get beyond the cannibalism (spoiler alert: they ate the talking pineapple), the passage and items really aren’t bad at all.  As became even more clear as the decade progressed, however, you can lose control of a story in the blink of eye.

Georgia (2013)

Georgia earned a spot on the team for its ranking on the 2013 NCES study mapping state proficiency standards onto NAEP scales. On each of the grade 4 and 8 reading and mathematics tests examined, Georgia’s proficiency standard was among the lowest of all states and in the Below Basic range on the NAEP scales.  The implication, of course, was that Georgia’s proficiency standards were too low; and subsequently they have been raised.

The state of Georgia, however, has been in the news the last couple of weeks for again following the beat of its own drum on how to handle the COVID-19 pandemic.  Time will tell whether Georgia was right with respect to its approach to re-opening the state and with respect to its proficiency standards.

I don’t know how things will turn out, but I am pretty confident that there is not a single national approach that is right for handling COVID-19 and that there is not a single proficiency standard that applies equally well to all states and students across the country.

NECAP (2014)

Although the NECAP science tests continued for a few more years, the 2013-2014 administration of the New England Common Assessment Program tests in reading, writing, and mathematics in fall 2013 effectively marked the successful end of a grand decade-long experiment in collaboration among states to design, develop, and administer a common assessment.  Having served as an existence proof that a consortium of states collaboratively could manage a common assessment program under the right conditions Maine, New Hampshire, and Vermont moved on to Smarter Balanced and Rhode Island to PARCC.

Nevada, Montana, and North Dakota (2015)

Along with Montana and North Dakota, Nevada chose to administer the Smarter Balanced assessment using the open source platform developed by Smarter Balanced as part of their Race to the Top Assessment grant.  Let’s just say that things did not go well.

The remainder of Smarter Balanced states opted to use their assessment contractor’s proprietary platform.  PARCC tests also were administered using its contractor’s proprietary platform.

We have a much better understanding now of the importance of having a stable test administration platform. The testing companies that have flourished in this transition to computer-based testing are the companies that have built such platforms.

Trying to design, develop, and implement a new assessment within the limited timeframe of the Race to the Top Assessment grant was a formidable challenge on its own without the additional burden of building an open source platform in that same window.

Massachusetts (2015,2016)

While deciding whether to adopt the PARCC assessment or develop a new custom assessment program on their own, Massachusetts administered both its own MCAS tests and PARCC tests in spring 2015 and 2016 in what was dubbed a “two-year test drive”  – allowing districts the choice of tests.  For districts choosing PARCC, there was the additional choice of paper-based or computer-based testing.  The combinations of tests and formats across years posed interesting challenges for accomplishing tasks such as determining the comparability of scores across districts for accountability purposes and computing student growth scores.

 Connecticut (2016)

In March 2016, the governor of Connecticut announced that the state was dropping a performance componentof the Smarter Balanced test in an effort to reduce testing time.  The first-of-its-kind decision by Connecticut raised questions about what type of changes to an assessment program can be made while continuing to treat it as the same test in terms of the reporting and interpretation of scaled scores and achievement levels and also for the use of those scores for accountability purposes.

The governor’s decision also demonstrated how strong the pushback against too much testing time had become. Other states and testing programs have grappled with similar decisions since 2016.

Louisiana (2017)

After administering a paper-based form of PARCC on its own in 2015 and a hybrid PARCC-Louisiana assessment in 2016, in spring 2017 Louisiana administered grade 3-8 English language arts and mathematics tests composed virtually entirely of PARCC items – but not the items that PARCC was administering in 2017.  Louisiana was the first state to license PARCC items (and item parameters) so that it could report its results on the PARCC scale, apply PARCC achievement standards, and compare its performance to other states.

While relying heavily on the properties of IRT, the Louisiana experience has demonstrated just how complex the licensing process can be and the myriad of decisions that must be made and checks that must be performed to ensure that the assumptions of IRT are being met when trying to build comparable test forms by selecting items from an item pool.

NAEP (2017 … 2018)

Although technically not a state assessment, it would be impossible (for me anyway) to compile this all-decade team without including the 2017 NAEP Reading and Mathematics tests.  The 2017 administration was the first computer-based administration of the reading and mathematics tests.  The processing of results stretched into April 2018 and presented considerable challenges for accounting for mode effects while maintaining trend lines – particularly when it cannot be assumed that the mode effect is the same across the entire population of test takers.

North Dakota (2018)

In 2018, North Dakota became the first state to obtain United States Department of Education approval to allow some districts to administer a locally-selected, nationally-recognized high school assessment (e.g., the ACT) in place of the high school state assessment.  As we all know, the requirements in the fine print accompanying the “flexibility” offered in ESSA have proved challenging for USED to interpret and states to meet.

In addition to North Dakota’s use of the locally selected nationally-recognized test option, other states have adopted the ACT or SAT as their high school state assessment.  The use of college admissions tests as a high school state assessment raises long unasked and unanswered questions about not only the purpose and use of state assessment in high school, but also questions about the meaning and purpose of high school.

Massachusetts (2019)

I began this list with a controversy surrounding a reading passage and I will end it with a controversy surrounding a reading passage.  In this case, a passage from the Pulitzer Prize winning book, The Underground Railroad, and the essay question accompanying it caused the uproar during the 2019 administration of the tenth grade English language arts test – used as a high school graduation requirement.  The question was dropped from the test and after an external analysis adjustment to policies were made in an attempt to account for effects on the performance of individual students.

We all know that there are topics that although fair game for the classroom should not be addressed on state assessments.  Content committees and Sensitivity committees are provided guidelines on topics to avoid such as abortion and contraception, abuse of people or animals, the occult and ghosts, and suicide.  Highly-charged topics such as gun control, climate change, and other political issues are also avoided.

However, as we demand that assessments measure students critical thinking, problem solving, and other 21stCentury skills and as we strive for instruction to be more culturally relevant and make the student a more active participant in the education process, I can only conclude that it will become increasingly difficult to assess the knowledge and skills that we want and need to measure through large-scale on-demand state assessments.

Takeaways

For me, the clear message sent by the assessments highlighted here is that the 2010s were a decade of transition for state assessment.  Meeting the challenges raised by the assessments in this list will provide tremendous opportunities for those charged with developing and administering state assessments in the decade ahead.  If folks are willing to think outside of the box (and the answer bubble), allow for the messiness of experimentation and innovation my bet is that we can end the 2020s with excitement for all that has been accomplished.

That leads to my final and most important takeaway from the 2010s.  I think that the biggest mistake of the decade was the requirement of  the Race to the Top Assessment program for the consortia to design, develop, and implement a fully operational assessment program by the 2014-2015 school year.  That mistake is being repeated now with the ESSA Innovative Assessment Demonstration Authority.

Innovation in assessment needs space, time, and resources for research and for things not to work correctly (never mind perfectly) the first time around – or even the second time around.  The federal government can and must be a critical partner in creating and supporting the environment that will yield the next generation of assessments that we would all like to see.

 

 

A rejoinder

lake

 

Charlie DePascale

“Father, forgive them, for they know not what they do!”

In the spirit of the Easter season, that was my initial reaction when I read what Steve Sireci had to say in his final newsletter column as president of NCME. I support Steve as a born-again public policy advocate.   I agree wholeheartedly with Steve that NCME should be more proactive as an organization in protecting measurement principles,  promoting the proper use of assessment, and particularly in supporting states in their attempts to use educational measurement to improve student learning.

For example, it would have been wonderful to have NCME weigh in when the two happy hoopsters from Chicago were running their 3-card Monte game (aka NCLB waivers) on desperate states. In essence, they offered states a waiver on a requirement that never actually existed (100% proficiency) for agreeing to a “new and improved” requirement that was mathematically equivalent to the requirement that did actually exist (reduce non-Proficient students by 10% annually) while forcing states to accept the horrific, ill-conceived teacher evaluation requirements that Steve decries.

Yet, I came away from the column with an uneasy feeling.  I was uncomfortable with the  underlying message that if the states just knew better, if we just did a better job of informing them about the Standards and research everything would have been better.  Although at a very high level, I agree with the direction NCME is taking, the devil, as they say, is in the details. I decided that a rejoinder would be an appropriate NCME-type approach to point out a few areas of disagreement.

I am not going to comment on Student Growth Percentiles (SGP) in this post.  We can re-engage on the SGP debate at a future NCME conference.

I am also not going to spend much time on Steve’s comment about state assessments under NCLB:

In my opinion, these early 21st century tests met the goal of demonstrating how well students had mastered specific aspects of curricular goals at a specific point in time.  But as illustrated above, that was not the only goal of NCLB.”

As a member of the MCAS team at the Massachusetts Department of Education, I held a similar view of the English language arts and mathematics tests our content team and contractor developed.

It was only after several lengthy discussions over three years with Jim Popham (sadly at AERA, not NCME, sessions) that I realized that the quality of the English language arts and mathematics tests and even the use of those tests to measure schools’ adequate yearly progress (AYP) toward 100% proficiency was largely an irrelevant question.  The critical issue was that student performance in English language arts and mathematics was being used as a proxy for the yet-to-be-defined construct of school quality. You can define school quality in terms of student performance in English language arts and mathematics, but you have to be prepared to live with the consequences (pun intended).

In this rejoinder, I do want to focus on two issues Steve raises in the column: technical reports and the role of TAC members during NCLB, the NCLB waivers, ESSA, and beyond.

Technical Reports

Let’s consider this statement regarding Technical Reports:

“Are the criticisms from teachers and others true—that tests have caused undue student anxiety, narrowed the curriculum, and led to even narrower curricula for historically marginalized groups?  Only specific empirical study of testing consequences can answer these questions.  If someone can show me such evidence in a technical manual for a statewide summative assessment, I will buy the authors of that manual a drink, for those authors have fulfilled the goals required in 21st-century validation.”

 The content of state assessment Technical Manuals (i.e., what should be included and who is responsible for writing it)  has been a hot topic as long as I have been working with Technical Advisory Committees.  The two people whom I regard as my TAC mentors, George Madaus and Ron Hambleton, had strong feelings about what should be, but most often was not, included in a Technical Manual.

George’s pet peeve was related to validity and evidence that the test was meeting the lofty purposes laid out in the opening paragraphs of most technical manuals (e.g., improving instruction and student learning, reducing achievement gaps). His partial victory in this area was getting Massachusetts to rewrite and tone down its description of what the test could accomplish.  Ron, of course, remains concerned about score reporting: what was included on various reports, whether the information was understood by stakeholders, and how it was used.

The documents that we regard as state assessment Technical Manuals, for the most part, are written by the state’s assessment contractor as part of its contractual obligation to the state.  In various capacities, I have written, edited, and reviewed my fair share of them.  Over time, my view on the content of technical manuals has been somewhat fluid and amorphous, but has now congealed to the belief that what is needed is the following set of documents:

  • State Assessment Technical Manual (i.e., the current documents) – these documents are pretty much fine the way they are with the exception of the aforementioned information about reporting and updates needed to reflect the shift to computer-based testing.These documents provide evidence of the technical quality of the test that was developed, administered, scored, and the scores that were reported.
  • State Assessment Program Technical Manual – the manuals described above do not provide all of the information required to support a validity argument for the state’s testing program.A separate manual containing additional information and evidence produced by the state and other external sources is needed to properly document the testing program.
  • State Accountability Systems Technical Manual – a state’s accountability system is separate and distinct from its testing program.It requires its own theory of action, validity argument, and technical manual.  To the extent that an accountability system uses state test scores, it must produce evidence that they, along with all other data included in the system, are appropriate for use in the accountability system.

Without wading into a validity argument, it simply does no practical good to conflate tests, testing programs, and accountability systems.  Furthermore, the test-based information that is used in accountability systems is often what I have referred to as a “fruit loops” version of a test score – something that has been so heavily processed that it bears little resemblance to the original scale score from which it was derived.

Technical Advisory Committees

I feel compelled to address comments Steve made regarding Technical Advisory Committees and their actions, or lack thereof, during NCLB.  First, a disclosure/disclaimer:

  • The overlap between my TAC experiences and Steve’s is minimal.We worked together for a brief time on two TACs, neither of which was a general assessment TAC.  Steve’s impressions may very accurately reflect his TAC experiences.

Regarding the era of NCLB and Adequate Yearly Progress (AYP) Steve wrote:

Over the past 17 years, fascinating explanation, discussion, and debate regarding AYP was published in our NCME journals Journal of Educational Measurement (3 articles) and Educational Measurement:  Issues and Practice (35 articles).  However, these discussions were largely absent from conversations involving education policy makers and state boards of education.

 Test scores were being used for new accountability purposes, and there was little validity evidence to their use for the purposes of determining AYP.  Did such use put state departments of education in violation of the American Educational Research Association, American Council on Education, and National Council on Measurement in Education’s (1999, 2014) Standards for Educational and Psychological Testing?  If so, did we as responsible TAC members inform the states of such violations?

 My impression of these times is we did issue warnings, but they were not strong enough. 

 Later, he added the following regarding the NCLB waiver era and the use of test scores and other indicators derived from test scores for high-stakes teacher accountability.

Specifically, we co-author standards that require evidence for the validity of test score use, and then we stand idly by, collecting our TAC honoraria, while teachers lose their jobs based on test score derivatives that lack validity evidence for teacher evaluation.  Clearly, our actions must change if we are to partner with the education community in using educational tests to help students learn. 

 Steve suggests that research contained in NCME publications and other peer-reviewed journals does not find its way into TAC discussions. Although state assessment teams and policy makers might be absent from NCME conferences, research does reach them in a variety of ways.

  • In the early NCLB era, TACs with which I worked included many highly regarded researchers and measurement specialists, six of whom are past presidents of NCME: Bob Linn, Dale Carlson, Barbara Plake, Laurie Wise, and the aforementioned George Madaus and Ron Hambleton.  These were not shy people.
  • Bob Linn published multiple papers on NCLB that were shared widely at TAC meetings including Accountability Systems: Implications of Requirements of the No Child Left Behind Act of 2001. Dale Carlson produced an elegant paper on school accountability that can be considered a seminal work in the use of both growth and status for school accountability.
  • Organizations such as CRESST and CCSSO produced documents and sponsored conferences that shared information directly with states. Between 2002 and 2007 new NCME Board member Ellen Forte and former Wisconsin state assessment and accountability administrator Bill Erpenbach worked with CCSSO to produce multiple detailed reports for states about accountability systems across the country.
  • Work that ultimately became NCME publications, such as the 2003 EM:IP article that I co-authored with Rich Hill on the Reliability of No Child Left Behind Accountability Designs traveled a long road before being published by NCME including presentations to state and federal officials at conferences such as the CCSSO Large-Scale Assessment conference, the CRESST conference, AERA, NCME, and foundation- and company-sponsored conferences.

If states were receiving all of this research, why wasn’t it used as Steve would hope?

TACs are convened to provide advice directly to the assessment team (or accountability team) in a department of education and indirectly to the commissioner. The state assessment team is charged with implementing programs to meet the state and federal laws, most often under very narrow constraints and tight timelines.  In the current context, there are two documents driving their actions.

Both documents are complex, are the result of compromise, rely heavily on imprecise and ill-defined terms, and contain contradictory statements or requirements.  One document is a federal law (i.e., NCLB, ESSA, IDEA) and the other is the Joint Standards  (compare the introduction and standards in the validity chapter for an example). When states assessment teams are backed to the wall and faced with the option of “violating” one of the two documents, the Joint Standards lose 99 out of 100 times.

The one time when the Joint Standards might win is with regard to the high stakes use of tests for students such as promotion or graduation decisions where there is a high likelihood of lawsuits.  In that case, the state moves forward, because to not do so is not an option, but moves forward following the guidance offered by the AERA position statement on High-Stakes Testing in Pre-K – 12 Education. AERA doesn’t simply say don’t do it.  They say, if you have to do it, this is what must be done.

If TAC members were not advising states to simply adhere to the Joint Standards regarding AYP, what were they doing?  In most cases, TAC members were trying to help states make the best of a bad situation and do as little harm as possible.  In other words, they were saying, if you have to do it, this is what must be done.

In my work with TACs dealing with AYP and later teacher evaluation I saw some of the most creative applications of confidence intervals that I have ever seen in the name of trying to help states do what had to be done. Also, remember that AYP was only one of many technical challenges that NCLB presented to states simultaneously, not the least of which was ensuring that all students were assessed fairly through the use of accommodations and alternate assessments.

Learning from States

In closing, I would also like to note that some state leaders were quite proficient in dealing with the technical challenges presented by AYP and teacher evaluation.  One shining example was the late Mitch Chester. As accountability director in Ohio, he worked within NCLB to develop a balloon-payment accountability system that effectively deferred many problematic AYP decisions until a point in time where real improvement could be measured, or the law would have been reauthorized.  Later, as commissioner in Massachusetts, again working within the law, his team effectively nullified the use of test scores for teacher accountability.

So, yes, NCME should be proactive in promoting good measurement, assessment, and accountability at the state and local levels.  As has been discussed recently with regard to classroom assessment, however, NCME cannot begin by assuming that states make the decisions they do simply because they don’t know any better.

 

The K-12 testing industry survived, even flourished, during past economic downturns. There are signs, however, that this time might be different.

bubble

 

 Charlie DePascale

There have been two major economic downturns in the past twenty years: the bursting of the Dot-Com bubble in the early 2000s and the Great Recession of 2008.  Much like the proverbial cockroach in a nuclear winter, the K-12 standardized testing industry emerged from each not only unscathed, but in fact, strengthened.

In the early 2000s, the savior was No Child Left Behind (NCLB).  When Assistant Secretary of Education Susan Neumann declared, Let them use multiple-choice tests, it was off to the races.  Testing companies were the cock of the walk, strutting their stuff, spreading feathers that had been ruffled during the Psychometric Spring of the 1990s, and exalting in the good fortunes brought by the ‘No Psychometrician Left Behind Act’ or the ‘Psychometricians Full Employment Act’ that mandated annual testing of all students in grades 3 through 8 and once in high school.

In the latter part of that decade, it was President Obama to the rescue with NCLB waivers and the Race to the Top Assessment program.  Fears that the new administration would reduce required testing proved to be unfounded. Testing companies continued to feed at the federal trough and roll around happy as a pig in an internal alignment study. Like pigs being fattened for slaughter, however, they never saw what was coming.

The structure laid out in the Obama blueprint for reform contained heavy pillars that strained the sandy foundation of K-12 standardized testing.  To a greater degree than might be apparent, the success of standardized testing in schools relies on a longstanding social contract with local educators and communities.  As long as the tests do not take too much time and have no real consequences everything is fine. NCLB might have pushed the boundaries of that contract, but the requirements that the new assessments measure college-and-career readiness standards and be used to measure teacher effectiveness tore the social contract to shreds.

The Every Student Succeeds Act (ESSA) removed the teacher evaluation requirements of the NBLB waivers and states have made repeated concessions on the length of tests, but a social contract once violated is difficult to restore.

Accountability and Equity

K-12 standardized testing, of course, relies on more than a social contract with local educators.  The dual principles of accountability and equity have been the driving force behind federally mandated assessment since the early days of the original Elementary and Secondary Education Act (ESEA) passed in 1965.  Sen. Robert Kennedy pushed for the evaluation of Title 1 programs funded by ESEA as a tool for providing information to parents that federal money was being spent wisely to improve student outcomes; a theme echoed by Secretary of Education Arne Duncan throughout the Obama administration.  From the very beginning, standardized tests became the tool of choice to provide that accountability.

The close link between accountability and equity resulted in standardized testing receiving bipartisan support. NCLB with its assessment requirements was championed by President George W. Bush and Sen. Edward Kennedy, perhaps the last truly bipartisan bill passed by Congress.  There are clear signs, however, that standardized testing is no longer viewed as a tool to promote equity; and to the contrary, is perceived as a threat to equity.

Wearing the Black Hat

In response to a question posed at a candidates’ forum last December, presumptive democratic nominee Joe Biden seemingly declared that he would end standardized testing in public schools.

“Given that standardized testing is rooted in a history of racism and eugenics,” the audience member asked, “if you are elected president, will you commit to ending the use of standardized testing in public schools?”

“Yes,” Biden responded. “As one of my friends and black pastors I spend a lot of time with . . . would say, you’re preaching to the choir, kid.”

The premise of the question may seem extreme, but it is becoming quite mainstream.

  • In her 2019 presidential address, AERA president Amy Stuart Wells described testing policies as the Jim Crow of education.
  • Last spring, an uproar over a passage and prompt on the Massachusetts 10th grade English language arts test led to the item being dropped and rules for meeting graduation requirements loosened in an “abundance of caution” over concerns for stereotype threat.
  • Late in 2019, a lawsuit was brought against the University of California system regarding the use of the SAT and ACT in college admissions, not necessarily because the tests are flawed technically, but because the use of the tests “privilege affluent families who can afford to send their children to tutoring,” and “illegally discriminates against applicants on the basis of their race, wealth and disability.”
  • This weekend, the National Council on Measurement in Education faced with offering only one virtual session from the 2020 NCME Conference program chose a session with the following description: “If learners vary in their cultural experiences, appreciations, characteristics, and needs, all aspects of culturally relevant pedagogy may need to reflect such variations. However, educational assessments have been most resistant.”

Standardized tests have been challenged for decades over technical concerns related to test bias.  The current challenges, however, go beyond technical issues that can be easily fixed through adjustments to the test development or scoring process.

Although the loss of trust among educators and concerns over equity may be serious threats to the continued viability of standardized testing, the greatest threat may come from within the field itself.

Necessary, maybe, but not sufficient

While the panelists in the NCME session described above did not spend much time addressing the cultural relevance of assessments, they did spend a great deal of time discussing the need for assessments to measure more and different things than the knowledge and skills that are currently being measured effectively through traditional on-demand, large-scale assessments.

The past five years have demonstrated clearly that the English language arts and mathematics college-and-career readiness standards that were a byproduct of state testing during NCLB  require different kinds of assessments than traditional standardized tests.  An NRC committee on designing assessments for the Next Generation Science Standards concluded that large-scale standardized tests were only part of a comprehensive assessment solution.  A handful of states across the country, including standardized testing stalwarts such as Louisiana and Massachusetts are exploring alternative methods of assessment under the Innovative Assessment Demonstration Authority provision of ESSA.

This Time May Be Different

Although a President Biden may not fully retreat from the testing policies he implemented as vice-president, this time may be different.  As the COVID-19 crisis shrinks state budgets and schools refocus on what is really essential, a return to fully embracing a standardized testing system that was meeting few stakeholder needs would seem unlikely.  Standardized testing may remain part of the evaluation process for years to come, but it will likely be a much smaller part than it has been in the past; and traditional testing companies should not expect the boom time they experienced in 2002 and 2010.

Trees

The loss of state assessment results in the wake of COVID-19 does not have to mean a loss of information about student proficiency

Charlie DePascale

Given that the COVID-19 pandemic is affecting nearly all aspects of our lives, it is not a surprise that it has brought a critical nationwide, federally mandated data collection effort to a halt.  I am not referring to Census 2020 which was forced to suspend all of its field operations.  Nor am I referring to the IRS and Tax Day which has been moved from April 15 to July 15.  No, the nationwide data collection effort to which I am referring is the annual administration of state assessments to millions of public school students in grades 3 through 8 and high school.

With school closures affecting more than 55 million students across the country and nearly all states obtaining testing waivers, it is nearly a certainty that there will be not be and should not be state testing this spring.  This cancellation of testing causes a significant hardship to assessment contractors and more importantly, deprives states of information used to inform policy, a condition which if we believe in the reasons for testing, ultimately is harmful to students.

The “good news” is that data that would have been collected through state testing is not lost.  Like the data that is collected through the census or tax filings, data on student proficiency on state standards in the 2019-2020 school year is still there waiting to be collected.  We may just have to adjust our thinking on what data we are collecting and how we are collecting it.

State Assessment is a Data Collection Effort

First and foremost, we have to recognize that state assessment is at its core a data collection effort.  Because the current solution includes an assessment, we have fallen into the trap of viewing the task of collecting data on student proficiency from a measurement perspective and treating all challenges to the process as measurement problems.  The fundamental task, however, similar to the census, is to produce an accurate count of the number of students in the state who are meeting state achievement standards.  The task is not to measure student proficiency.

It can certainly be argued that at one time the most accurate and efficient way to collect the desired information was through an assessment administered to students statewide.  That solution, however, became less desirable over time as state content standards became more complex, state achievement standards became more rigorous, requirements to include all students became more rigid, and the consequences associated with the results of the assessment increased (see Campbell’s Law).

At the present time, the current model of state assessment is fast becoming an anachronism; perhaps not as much of an anachronism as annual tax filings, but more of an anachronism than the census, simply due to the frequency of state assessment if for no other reason.

States have known since at least the 1990s that an on-demand test composed primarily of selected-response items was insufficient to fully measure student proficiency, but for 25 years that remained the most feasible and efficient solution.  The PARCC assessment, however, was likely the field’s gallant last gasp at developing an on-demand state assessment to measure college-and-career readiness standards.

Moving forward, state assessment will still be at least a component of the best available solution to compile accurate information about student proficiency, but assessment is not the only solution.

There are Proficient Students Even if there Is No Assessment

There may be doubt about whether a tree falling in the forest makes a sound if no one is around to hear it, but there is no such doubt about student proficiency.

After accepting that the task is to count, not measure, we must recognize that students are proficient (or not) in English language arts, mathematics, science, and a host of other areas regardless of whether we administer an assessment.

Over time, the belief became ingrained that we need a state assessment to determine whether a student is proficient.  The state assessment and its items defined the meaning of loosely worded state content standards.  Achievement level descriptors were most often developed for the assessment rather than the content standards; and were used in conjunction with the unfortunately named process of standard setting to define the state’s achievement standards. In short, the state assessment system and student proficiency became a closed system.

Federal policy that decreed performance on state assessment as the gold standard for student proficiency and elevated alignment to state content standards as the most important evidence in the validation of state assessment programs only helped to keep the system closed.

The fact remains, however, that students acquire proficiency in English language arts and mathematics through curriculum and instruction aligned to state content and achievement standards.  That proficiency builds over the course of the school year and resides within the student, not within the test, when she or he sits down in the spring to take the state assessment.

Our actions as assessment professionals, policy makers, and educators, suggest that we have forgotten the principle that the purpose of assessment is not to define a construct such as proficiency in English language arts and mathematics, but rather to provide us with information that helps us to be able to accurately and consistently distinguish among students at various places along the proficiency continuum.

Teachers Should Be the Best Judges of Student Proficiency

If we accept that proficiency exists outside of the assessment then if follows logically that the best judge of a student’s proficiency should be the teacher who a) has deep knowledge of the state content and achievement standards and b) has been instructing that student for seven months with a curriculum, instruction, and formative assessment practices aligned to those standards.  Setting aside for now debate about the extent to which the two conditions are met in classrooms across the country, nobody is in a better position than the student’s teacher to make an informed judgment about student proficiency.

There are, of course, many reasons why states do not and should not rely on teachers’ judgments alone when collecting information about student proficiency for school accountability.  Concerns about self-reporting of results for accountability purposes are real; as is the fact that one of the primary things that we are measuring or evaluating through school accountability systems is the extent to which there is alignment between the state’s and local educators’ understanding of the state content and achievement standards.

The current situation, however, presents an opportunity to collect those teacher judgments with minimal risk.  First, accountability waivers will eliminate high stakes uses that might bias judgments.  Second, most states have data school- and student-level data from previous years against which to monitor these judgments.

The next critical question is whether enough instruction has taken place to enable teachers to make the necessary judgments.  The answer to that question is unequivocally yes.  If state testing had already started or was about to start within the next month, teachers have sufficient evidence to make an informed judgment of student proficiency.  I would argue that teacher judgments about student proficiency at the time of school closures is a more accurate reflection of the level of proficiency a student acquired during the 2019-2020 school year than an assessment administered when school resumes either this year or next year.  There will be other reasons for measuring student performance at that time.

So, if teachers have the data that states need, is there a feasible way for the state to collect it?

Collecting Data from Teachers on 2019-2020 Student Proficiency

With relatively minor adjustments, it should be possible to use the same infrastructure already in place to administer state assessments to collect teacher judgments of student proficiency.  Given that testing was about to begin, we can assume that student registration lists had already been prepared to sign students into computer-based tests and that procedures were in place to provide access to teacher test administrators as well.  States or assessment contractors may not have access to information needed to assign individual students to specific teachers, but that is a minor inconvenience.

Preparing online resources, instructions, and a form for teachers to enter ratings of student proficiency would not be a heavy lift, certainly not in comparison to scoring, processing, and equating tests.  States and their contractors can decide what judgment they would like teachers to make.

Using the NAEP achievement level categories of Below Basic, Basic, Proficient, Advanced as an example, a state might ask teachers to assign students to one of the four achievement levels or simply to indicate whether the student’s level of proficiency was at the Proficient level or above (i.e., Proficient or Advanced).  In activities conducted in conjunction with standard setting for a state assessment, we have asked teachers to designate students’ proficiency as Low, Medium, or High within one of the four achievement levels (a total of 12 possible classifications).  My personal preference, however, is to allow teachers to use borderline categories as shown below for a total of 7 possible classifications: Below Basic, Borderline Below Basic/Basic, Basic, Borderline Basic/Proficient, Proficient, Borderline Proficient/Advanced, Advanced.

Will the results of the teacher judgment process be totally accurate, complete, or interchangeable with assessment results?  Probably not, but that’s OK.  They can still become useful information to support the school improvement process.

 More Than A Stopgap

If I viewed the collection of teacher judgments only as a one-time stopgap to make the best of the 2019-2020 school year, I might hesitate to suggest it.  It is a fact, however, that if we have any hope for education reform and school improvement efforts to be successful, we need teachers to understand what proficiency is and to be able to classify student performance along the proficiency continuum.

One of the big unanswered questions when state assessment results are released each year is whether those results are consistent with the way that local administrators and teachers perceive their students’ performance.

It is also a fact that we are not going to be able to continue to use on-demand large-scale assessment measure the complex knowledge, skills, and abilities that we want students to acquire. It is inevitable and desirable that in the near future states are going to have to rely on teacher judgment of student performance as key part of the information they collect from schools each year.

Given the conditions in a particular state, it might be foolish for state assessment leaders to consider any type of data collection in the coming weeks or months.  However, if a state is seeking a way to recover data lost from cancelling testing in 2019-2020, we have a unique opportunity to begin to take the first step toward collecting that information.  We might as well use it.

A Useless Test Bias Argument

Charlie DePascale

close up of books on shelf

Photo by Suzy Hazelwood on Pexels.com

 

“Criticizing test results for reflecting these inequities is like blaming a thermometer for global warming.”

That was the viral moment from the recent NCME statement on admissions testing. The line clearly was intended to go viral and it did go viral; well, as viral as any technical defense of standardized testing can go – quoted and retweeted tens of times.

I like a glib “test as thermometer” quip as much as the next psychometrician and I have enjoyed the various versions of this one that have been used in the context of college admissions testing.  There was something about the line and the statement, in general, however, that just didn’t feel right.

NCME framed the statement as “highlighting the critical distinctions between group score differences and test bias.” Along with an obligatory quote from the Standards and an academic reference to correlation and causality, the test as thermometer equivalence appears to be drawing a clear distinction between test scores and test use.  Test scores, it appears can reflect real differences between groups without the tests being biased.  This separation of test scores from test use is something that we have not seen in the organization’s arguments on validity.  As NCME president, Steve Sireci has written, “To ignore test use in defining validity is tantamount to defining validity for ‘useless’ tests.”  Does the same argument apply to test bias?

When the tests in question are college admissions tests their primary intended use is fairly explicit. One can assume that a claim that the tests are biased refers at least as much to their use in college admissions as in a technical claim about the accuracy of the scores.  To dismiss this claim with a technical lesson on misconceptions about test scores comes across as defensive, at best, tone deaf, and somewhat self-serving.

NCME could have chosen to focus their response on this portion of their quote from the Standards: “group differences in testing outcomes should trigger heightened scrutiny for possible sources of test bias…”

  • They could have discussed whether the construct being assessed by the college admissions tests is academic achievement (in English language arts and mathematics) or college readiness. If the former, then we are back to the question about whether the focus on the accuracy of the group differences is tantamount to the defining bias for useless tests.
  • They could have discussed differential validity and the importance of establishing that the relationship between English language arts and mathematics achievement and college readiness (or success in college) is the same for students whose low performance is “caused by disparities in educational opportunities” as it is for other students.
  • They could have discussed the role that test scores play in the “proper use and interpretation of all data associated with college readiness” and explained how limited or extensive that role should be given what the field knows about college admissions tests and test scores – particularly with respect to the performance of the subgroups of students in question.

Instead, NCME chose to offer a heavily nuanced defense of college admissions tests and test scores.  I have to wonder who they see as the primary audience for this statement.

 

I’m With The Band

 

 

reunion-ad4f554d668041ff6cdb07cabbb0c23695dfb641cbcc115d044b1a15620fffe6
Harvard University Band

 

Charlie DePascale ’81

This weekend the Harvard University Band celebrates its 100th anniversary.  Along with meeting my wife, my time in the band remains one of the two happiest memories of my four years at Harvard. Actually, my memories of the band begin with the end of my junior year of high school.

It was the summer of 1976, the Bicentennial Year, and a high school classmate told me about this Summer Pops band at Harvard: anybody can join, they rehearse one night a week, and there are two concerts at the end of the summer – one in Harvard Yard and one at the Hatch Shell, where the Boston Pops play. OK, sign me up.

During that summer, on stage at Sanders Theater with a couple of hundred people of all ages and musical ability, I had my first interactions with Tom Everett, director of Harvard bands.  Until that summer, I had no intention of applying to Harvard.  Harvard was for other people.  But during that summer with Tom, I remember thinking, hey, if this is what Harvard people are like, I could spend four years here.  So, I applied, was accepted, joined the band and the wind ensemble, and quickly learned that there were no other people at Harvard like Tom.

Despite that, my decision to attend Harvard was a net positive (did I mention meeting my wife), and my experience with the band was definitely positive. In my short time with the band, I enjoyed performing at the Kennedy Center, traveling to New York City, Washington, DC, and Montreal, performing a song conducted by the legendary Arthur Fiedler, playing for Jackie Onassis, and on one magical December night witnessing the beginning of a major collegiate point-shaving scandal and fulfilling my childhood dream of playing Amen in the Holy Cross basketball band.  Dare to dream!

And then there are the lessons learned that extended well beyond my years in Cambridge.

First, there are a few practical takeaways:

  • A wool jacket can absorb several times its weight in rainwater and still be fine the following week – a clarinet, not so much.
  • If you’re tired enough, you can sleep anywhere – on the cement floor of a game room in Ithaca, in an end zone at Princeton, sharing a sofa bed with a virtual stranger in an apartment in Montreal, or on a bandmate’s shoulder during a long, late-night bus ride.
  • At least one time in their life, everyone should experience walking through a dark tunnel into a sunlit stadium to hear and literally feel the roar of 60,000 cheering people.

And then there are the larger life lessons that have served me well throughout my career.

  • Illegitimum non carborundum – Enough said.
  • Lines (1) – Sometimes when the gun sounds and you are jumping, or scrambling, from one formation to the next you end up on the wrong 45-yard line (they all look alike, you know). When that happens, just fall into line with the trumpet section, play the song, and rejoin the clarinets for the next formation.
  • Lines (2) – Everything and everyone is fair game for the halftime humor of Harvard Band – even the band itself. However, there are times when you know you are crossing a line that shouldn’t be crossed – for me, it was the formation that paired Ted Kennedy with a popular Bee Gees song. Don’t shy away from the line, but try to stay on the right side.
  • “The Game” Syndrome – Every week, the halftime show had to fit into a tight window. When our time was up, we were off the field – this wasn’t American Pie (reference is to the 1971 song; we can discuss the resemblance of the HUB to the early 2000s movie franchise at another time). That limit was a good match for our practice of rehearsing the show for the first time the morning of the game. The Harvard-Yale game, however, had a longer halftime, which provided a few extra minutes for an extended halftime show. Of course, the temptation to turn our show into a Super Bowl-worthy extravaganza was too great to resist – often with the same result as recent Super Bowl halftime shows.  Forty years later, there are still nights when my dreams are haunted by giant royal stick figures trying to “walk” across the field.  Dream big, but know your limitations.
  • A Dedicated Core – Every volunteer organization, whether it is a college band, a town Democratic committee, a regional educational research organization, or a national professional association cannot function without a dedicated core of passionate people who are willing devote way too much of their own time to doing all of the big and little things that must be done so that everything runs smoothly when the rest of us just show up. Treasure those people.
  • Leader of the Band – With the right person leading them a group of 200 community members, or 150 Harvard students looking to have fun, or 50 student musicians grateful for one more opportunity to keep playing can each make such beautiful music. It takes a special person to know how to pick the right music, create the right environment, and effectively structure a limited amount of rehearsal time to get the most out of each of those groups and individuals; teaching and gently moving them in the right direction with humor, skill, grace, and wealth of knowledge, skills, and experience.  Thanks, Tom.

So yes, I’m with the band and the band will forever be a part of me.

Happy Anniversary HUB!  Here’s to the next 100 years.