assessment, accountability, and other important stuff

Charlie DePascale

As Smarter Balanced, PARCC, and the other new college- and career-ready assessments proceed through their first operational administration, one might assume that all of the difficult design and development decisions are well behind them.  In reality, we are entering the final, and often most harrowing phase of the test design process.  Despite all of the careful planning and field testing that leads up to the operational administration of a new assessment, it is not unusual for there to be significant design changes following that first administration.  Some of these changes may stem from content or measurement issues, but most often it is factors such as testing time, cost, or administration issues that lead to post-implementation design changes.

The English Language Arts (ELA) portion of the Massachusetts Comprehensive Assessment System (MCAS) serves as a prime example of the type and extent of the changes that can occur after the initial administration of a new assessment.  The MCAS ELA test consists of a separate Reading and Writing test, from which scores are combined into a composite English Language Art score.  On the initial administration of the MCAS Writing test in 1998, each student produced two essays, called the Long Composition and the Short Composition.  By 1999, the Short Composition had been eliminated.  The Reading test included a mix of multiple-choice worth 1-point and constructed-response items, scored on a 0-4 scale.  In 1998, student scores in reading were based on 32 multiple-choice items and 8 constructed-response items.  In 1999, half of the constructed-response items were eliminated and student scores were based on 36 multiple-choice items and 4 constructed-response items.  The table below shows the breakdown of items on the 1998 and 1999 MCAS ELA assessments.

Changes in Item Distribution Across the First Two Administrations

Of the MCAS English Language Arts Assessment

Item Type

1998

1999

Multiple-choice

32 36
Constructed-Response 8 4
Long Composition 1 1
Short Composition 1

0

The changes in item distribution shown above were accompanied by administration changes including the number, order, and length of test session as well as the number of different content area tests administered to a student.

 

No, no, you can’t take that away from me.

Virtually all large-scale state assessment, and particularly assessments of the scope and nature of  PARCC and Smarter Balanced, have been carefully designed, often over a period of two, three, or four years.  Many of the new assessment programs are applying the principles of evidence-centered design to construct test blueprints and specifications intended to support a specific set of claims about student achievement.  The resulting designs reflect the best attempt of knowledgeable and dedicated professionals to develop an assessment aligned to a specific set of standards and supporting a specific set of claims.  We can also assume that those people were well aware of constraints such as testing time, cost, and ease of administration as they designed the assessment.

Test design decisions are never reached easily.  In a consortium, in particular, it is certain that countless hours and blood, sweat, and tears were involved in the discussions that led to the operational design of the new assessments.  Undoubtedly, there have already been tough compromises in the design along the way.  For example, changes to the design of the PARCC ELA/Literacy assessments at the early grade levels following the spring 2014 field test were well-documented in the fall of 2014.

Now, perhaps even before the results of the initial administration are reported, additional design changes may be required; and those changes may be out of the control of the content or measurement specialists.

You like tomato, I like tomahto …

When changes to the test design occur after the initial administration of an assessment, it is almost always the case that cut scores for achievement levels have already been established under the original design.  Because nobody wants to change newly established achievement standards, a delicate dance begins to identify changes that will not require a new round of standard setting.  More accurately, it is often a dance to provide a rationale explaining why the proposed changes will not require standards to be reset.  In other words, the goal is to explain why the changes were not really changes.

The argument often includes statements such as:

  •  No, eliminating the short composition and half of the constructed-response items will not change the construct being measured.
  • The distribution of items across content standards (or strands or domain clusters) has remained the same.
  • We did an analysis to score this year’s students on the shorter form and the correlation between their scores on the original and shorter forms was quite high. In the much simpler past, there was a lot more wiggle room to make such arguments about post hoc changes to the test design.  In this age of evidence-centered design, however, I would expect it will be much more difficult to make and explain changes to the test design while maintaining the original set of claims.

Let’s call the whole thing off

The pressure to change an assessment is always intense, whether that pressure is to make the assessment shorter, less expensive, easier to administer, less difficult, or perhaps more rigorous. The new assessment programs also face the additional pressure of states having the option to choose another assessment program.  It would have been very difficult for a state to walk away from its own custom assessment in the heyday of NCLB.  Over the last two years, however, it seems to have become commonplace for states to say let’s call the whole thing off to one assessment program and move in a different direction.

The ability of states to walk away adds a new dimension to this test design two-step.  On one hand, there is pressure to make changes to prevent a state from walking away.  On the other hand, an assessment program can choose to hold firm on its design and allow an individual state to walk away.  It takes two to tango.

Ideally, the new assessment programs will make it through this process intact, with their claims still fully supported; and years from now they will be able to look back and say …

The odds were a hundred to one against me
The world thought the heights were too high to climb
But people from Missouri never incensed me
Oh, I wasn’t a bit concerned
For from history I had learned
How many, many times the worm had turned…

They all laughed at Christopher Columbus
When he said the world was round
They all laughed when Edison recorded sound
They all laughed at Wilbur and his brother
When they said that man could fly
They told marconi
Wireless was a phony
Its the same old cry
They laughed at me wanting you
Said I was reaching for the moon
But oh, you came through
Now they’ll have to change their tune
They all said we never could be happy
They laughed at us and how!
But ho, ho, ho!
Who’s got the last laugh now?

(They All Laughed, George and Ira Gershwin, 1937)

%d bloggers like this: