assessment, accountability, and other important stuff

Archive for March, 2016

This Is My Fight Song

Arizona, Connecticut, and a fuchsia wristband

Charlie DePascale

 

Last weekend I attended a concert in Boston with my daughter; an opportunity that has become more rare and more appreciated since she left for college four years ago.  We arrived early and waited in line to hear Rachel Platten perform her breakout hit Fight Song in a pre-show soundcheck.  As we waited, a gentleman emerged from the nightclub and made his way down the line attaching fuchsia bands to the right wrists of everyone 21 years or older.  He proceeded past the group of high school girls ahead without a pause or word, walked up to me, and began attaching the wristband.   As he was leaving, my daughter, who looks much younger than her age of 22, held out her driver’s license.  He examined the ID intently for a good 30 seconds before returning it to her and attaching her wristband.

Why share this story?  How does it relate to Arizona, Connecticut, and the general subject of this blog?

At its core, this is a story about testing, and more specifically, testing for a particular purpose.

The man with the wristbands did not bother to test anyone who was too young (i.e., the high school girls ahead of us in line).  He also did not test anyone who clearly has been older than 21 for a very long time (i.e., me).  He did, however, spend a significant amount of time examining my daughter’s ID.  His test had a particular purpose – to identify people 21 years or older.  He also had clear directions on which type of error was more important.  He had little concern about not giving a wristband to someone who was actually 21 or older.  He was very concerned about giving a wristband to anyone under 21.

He did not check the ID of everyone standing in the line.  He did not give out different colored wristbands to those of us well above 21.  He did not ask those of us over 21 whether we intended to purchase alcohol during the show; he simply attached the wristband.  He also did not tell us anything that we did not already know about our own ages.  He was collecting data from us, not providing data to us.   However, he did not simply accept a self-report of how old we were; he required external verification (i.e., a state-issued photo ID).  When asked, he made it clear that he did not have any information about when the doors would open, when Rachel Platten would actually take the stage, or when the show would end.  He had a specific job to perform and he performed it efficiently.

Nobody in the line expected anything more or anything less from the man with the wristbands.  Why is it so difficult for us to view large-scale state assessment in the same way?

This is My Fight Song!

As I have addressed in several posts over the last year and in papers and presentations over the last decade, state assessments are best designed to serve a particular and limited purpose – to provide information to the state about the percentage of students meeting the state’s performance standards or perhaps to indicate whether an individual student has met those standards.  Well-designed assessments can perform those functions accurately and efficiently – in terms of time and cost.   State assessments begin to break down, however, when we ask them to do more than they are designed to do.  Complaints about tests that take too much time or tests that cost too much or tests that do not return any useful information to inform instruction and improve teaching begin when we ask state assessments to do more than what they are best suited to do.

For the last two decades, however, at every opportunity we have asked state assessments to do more.

  • To serve as a signal or model for good assessment and instruction in the classroom
  • To enable the measurement of student growth across years
  • To accurately and precisely measure the full range of student performance within a grade level
  • To measure the full depth and breadth of increasingly complex state content standards
  • To serve as an indicator of effective teaching

However, as Rachel said last Saturday while introducing Fight Song, “I hope that you understand, it is never too late to make something happen.”  So, this blog is my fight song.  It is my platform to continue to describe my vision of The Ideal Role of Large-Scale Testing in a Comprehensive Assessment System as I have done for the last fifteen years; because as the song says,

And I don’t really care if nobody else believes.

 ‘Cause I’ve still got a lot of fight left in me.

And perhaps there is a glimmer of hope.  Recently, Arizona and Connecticut announced changes to increase the efficiency of their state assessment programs, suggesting perhaps that there at least two states moving state assessment closer toward our image of the man with the fuchsia wristbands.  A closer look reveals, however, an important distinction between the two states’ views of the role and purpose of state assessment.

Arizona and Connecticut – so much the same, yet so different

In the last couple of weeks, state assessment programs in Connecticut and Arizona were in the news because of changes designed to reduce testing time, eliminate redundant or duplicative testing, and reduce costs.  In Connecticut, the Governor Malloy announced that the state would no longer administer the performance task portion of the Smarter Balanced assessment English Language Arts/Literacy assessment.  In Arizona, Governor Ducey signed legislation allowing school districts to replace the state assessment with another assessment selected from a menu of state-approved alternatives.  Both states cited a similar set of reasons for and benefits of the changes: reducing testing time, saving money, eliminating redundant or duplicative testing, and my personal favorite, allowing more instructional time for teachers and students.

On the surface, it may appear that Arizona has made the much more radical change.  Connecticut simply eliminated one portion of their English language arts assessment.  Arizona is allowing districts to completely abandon their state assessment in favor of other assessments.  Although it is certainly bold, the Arizona move to a menu of assessments may be much more in line with the fundamental purpose of state assessment than Connecticut’s decision to shorten its test.

In announcing the change in Connecticut, Governor Malloy emphasized that the eliminated essay portion of the test was duplicative with in-class work.  It did not provide teachers with information that they did not already have and therefore should be eliminated.  In this rationale, the focus of the state assessment is on providing information to schools and teachers to improve teaching quality.

In contrast, the change in Arizona requires any alternative test that is “adopted in Arizona will have to maintain or exceed the rigor in Arizona’s College and Career Ready Standards” and it appears that the alternatives will still have to be external tests.  There is discussion of high school students who have already clearly demonstrated the desired level of proficiency through their performance on recognized tests such as the ACT, PSAT, or the Cambridge exams. In this scenario, the focus of the state assessment remains on providing an external check, an audit, of the information that is generated within the school.

I would argue that the intent of the change to state assessment in Arizona is much more consistent with the ideal role of state assessment.  Yes, it will be a challenge to ensure that the alternative tests are aligned with the state’s content standards and maintain the rigor of the state’s performance standards.  Yes, it will be more complicated to compare performance across school districts with multiple assessments than it would be with a common state assessment.  Yes, there will be opportunities for inadvertent errors or intentional attempts to game the system with less rigorous assessments.  Addressing all of those challenges is worth the effort, however, because Arizona is attempting to find a more efficient way to provide the state and schools with the information that should be provided by the state assessment – a credible, external indicator of student achievement.

The move by Connecticut to eliminate the performance task from the Smarter Balanced English Language Arts/Literacy test presents fewer obvious challenges, but comes with much greater risk of perpetuating one of the biggest flaws of traditional standardized assessment programs – sacrificing validity in the name of reliability and increased efficiency.  It is not written in stone that changing the Smarter Balanced test blueprint will produce scores that are less valid, useful, or predictive of future performance.  It is not a foregone conclusion that eliminating the authentic writing portion of the assessment will have a negative impact on instruction in the classroom as teachers see what is valued and what is not valued on the state assessment.  A long history, however, tells us that simply making the high-stakes accountability test shorter and more efficient is not the answer.  I trust that policymakers in Connecticut intend to ensure that the full range of the state’s academic standards continue to be taught well in schools across the state.  Like their colleagues in Arizona, however, they will have to realize that announcing this change is just the first step in a process that is filled with risks and challenges to be overcome.

Like a small boat
On the ocean
Sending big waves
Into motion
Like how a single word
Can make a heart open
I might only have one match
But I can make an explosion

It’s Déjà vu All Over Again

Charlie DePascale

 

The definition of insanity is doing the same thing over and over again, but expecting different results. (source unkown)

 Those who cannot remember the past are condemned to repeat it. (Santayana)

Newman!  (Seinfeld)

 

This is one of those times when there are so many quotes that describe the situation so well that it is impossible to select just one.

We are at a crossroads in educational assessment, where the decisions states make in 2016 and 2017 likely will shape assessment for the next ten to fifteen years.  There is a new federal law in place with assessment and accountability requirements that states are trying to interpret.  There are forces pushing for high-quality assessments that require students to demonstrate college- and career-readiness; assessments that emphasize writing, critical thinking, problem-solving, and research skills.  At the same time there are forces citing feasibility and practicality in calling for assessments that take less time, are less expensive, and produce immediate results.

If all of the above feels oddly familiar, it should.  In too many ways, 2015-2017 is beginning to look like a replay of the period between 2002 and 2004; and that is where our quotes begin.

 

Newman!  – or in this case, Neuman!

In 2002-2003, states were trying to figure out how they would implement the new state assessment requirements of No Child Left Behind (NCLB) that would take effect in 2006.  Instead of testing students once at the elementary, middle, and high school levels, the assessment requirements of NCLB required annual testing of students at grades 3 through 8, plus an additional test at high school.  NCLB also required those state assessments to be high-quality assessments aligned with the state’s challenging academic achievement standards, and to involve multiple up-to-date measures of student academic achievement, including measures that assess higher-order thinking skills and understanding.  Additionally, the accountability requirements of NCLB required the results of those assessments to be processed quickly to allow states to issue school and district accountability reports in time for accountability requirements such as “school choice” to take effect prior to the beginning of the new school year.

States did not know whether it was possible or whether they had the capacity to test students at seven grades instead of three, process the results of those assessments quickly enough to meet the accountability requirements, and also administer the high-quality assessments described in the law.  Something had to give, and Susan Neuman, Assistant Secretary for Elementary and Secondary Education in the U.S. Department of Education, stepped up to do the giving.  At a keynote luncheon during the annual Large-Scale Assessment Conference sponsored by the Council of Chief State School Officers (now the National Conference on Student Assessment), Neuman announced that the NCLB assessment requirements could be satisfied by tests consisting solely of multiple-choice items.  While we ate cake, the die was cast on the design of state assessment for the next decade.

Those who cannot remember the past are condemned to repeat it.

It is critical to the current situation to remember the state of large-scale assessment in the period leading up to NCLB and the USED acceptance of multiple-choice tests.  In many ways, the 1990s was a golden age of innovation in large-scale state assessment.  The modern era of state assessment began in the mid-1980s, shortly before the backlash against traditional, multiple-choice, norm-referenced standardized testing reached a peak with uproar over the Lake Wobegon Effect.  Many state assessment programs such as the New Jersey Assessment of Knowledge and Skills (NJ ASK) and Massachusetts Comprehensive Assessment System (MCAS) were able to incorporate a direct writing assessment and a variety of item types into an otherwise traditional test format.  Other states attempted to push the assessment envelope well beyond the traditional end-of-year bubble test.   A few examples of the assessment programs which explored innovations such as the use of portfolios and performance tasks, along with constructed-response items and direct writing assessments included the following:

  • New Standards Project
  • Vermont Portfolio Assessment Program
  • Maryland School Performance Assessment Program (MSPAP)
  • Kentucky Instructional Results Information System (KIRIS)
  • Maine Educational Assessment (MEA) and Local Assessment Systems (LAS)
  • California Learning Assessment System (CLAS)
  • Rhode Island Distinguished Merit Testing Program
  • Nebraska Student-based Teacher-led Assessment and Reporting System (STARS)

In some fundamental way, each of those programs attempted to push the assessment envelope well beyond traditional multiple-choice items.  Some attempts were more successful than others.  Careers were made pointing out issues of technical quality associated with programs such as KIRIS or the Vermont Portfolio Assessment Program.  All of those programs, however, reflected a spirit of innovation and improvement – a sense that it was possible to re-imagine large-scale state assessment.  The statement by Lauren Resnick, co-founder of the New Standards Project, conveyed the feelings of most of the programs listed above,  “When you do something as expansive as what we did, you have to have a belief you can make almost anything happen.”

Two decades later, many of the practical and technical problems associated with the innovative assessments of the 1990s remain unsolved and unstudied or potential solutions remain untested.  That might be acceptable if we could argue that 20 years of research had resulted in little progress in solving those problems.  The reality, however, is that much of that research never occurred.  The impetus for innovative assessment and solving those problems stopped in the 1990s.  When push came to shove at the end of the 1990s, innovation gave way to expediency.  The pressure to meet the increased testing and accountability demands of NCLB simply outweighed any countervailing forces to produce high quality, innovative, performance-based assessments.

How likely is it that the experience of the late 1990s will repeat itself in 2016, 2017, or 2018?  What will this next generation of state assessments look like after two or three test administration cycles?  Is the recent decision by the state of Connecticut to eliminate the performance task from the Smarter Balanced English language arts assessment an anomaly or a tipping point?  How far will assessments such as PARCC go in attempt to rightsize themselves – what design features will be compromised?  Should we be excited or nervous that words like rightsize are being used in discussions about assessment?

 

The definition of insanity is doing the same thing over and over again, but expecting different results.

A key feature of each of the innovative assessment programs listed in the previous section is that their developers were willing to think outside of the traditional, end-of-year, summative assessment box.  They did not limit themselves to on-demand assessments that could be administered neatly to individual students in one, two, or three sessions during a tightly defined administration window.  Most of those programs reflected the belief that the authentic assessment of higher order cognitive skills such as critical thinking and problem solving required a different kind of assessment experience.  Assessment of those skills requires students to engage with problems and performance tasks over an extended period of time and it requires assessments embedded in curriculum and instruction.   In short, assessment of the skills that we consider critical is messy.

As states built the new common core aligned assessments such as PARCC and Smarter Balanced, however, they stayed firmly within the traditional assessment box.  PARCC may have tried to build a bigger box and Smarter Balanced used adaptive technology to try to measure more precisely the types of things that you can assess well in that box.  However, neither program attempted anything close to the innovative assessment programs of the 1990s. That outcome was not a surprise in the current era of accountability, but it does evoke a sense of doing the same thing over and over again, but expecting different results.

End-of-year assessments such as PARCC and Smarter Balanced can and will do a better job at assessing skills such as problem solving and critical thinking than most state assessment administered in the NCLB era.  There is, however, a ceiling to what such assessments can do.  As reflected in the 2014 National Research Council report Developing Assessments for the Next Generation Science Standards the effective assessment of complex content standards such as the Common Core State Standards and the Next Generation Science Standards requires a fundamental change in our thinking about assessment, and requires a new kind of assessment program.

ESSA offers states some opportunity for flexibility and innovation in the design of their testing programs.  States may explore ways to incorporate results from interim assessment programs into an end-of-year performance level classification for students.  States may also propose innovative assessment designs for assessment systems to support their accountability systems.  Such efforts, however, will be limited by the constraints of accountability as well as the practical and technical challenges faced by the innovative assessment programs of the 1990s.  Without a true commitment to fundamental change, to accepting the messiness, and to providing the time and resources needed to build comprehensive, cohesive assessment systems across the school, district, and state levels, it is likely that the once-in-a-lifetime window of opportunity for assessment reform that Secretary Duncan described in 2009 and 2010 will close with only incremental improvement in the state of state assessment.

To end on a positive note, all hope is not lost.  Acknowledging the limitations of end-of-year, on-demand assessment, realizing that it has become unmanageable, and admitting that we are powerless over our addiction to it is the first step toward the solution.  There are pieces of a comprehensive system being built through solid work on formative assessment practices, the use of interim assessments, and through efforts to build and sustain educators’ assessment literacy.  The NRC Committee on Developing Assessments of Science Proficiency has provided a solid framework for moving forward.  The recently published NCME volume, Meeting the Challenges to Measurement in an Era of Accountability, addresses many of the technical challenges laid out in the 1990s and provides examples of performance-based assessment programs in a variety of content areas.  The pieces are there.  Will we pick them up and stop doing the same thing over and over again or are we condemned to repeat the errors of the past?