It is said that the road to Hell is paved with good intentions. I fear that those building the road to next generation, high-quality assessments may be using the same construction crew.
At the beginning of this month, Secretary King released promised guidance to states as a follow-up to President Obama’s Testing Action Plan. Last week, the Fordham Institute and HumRRO released reports containing the results of their evaluations of how well the ACT Aspire, MCAS, PARCC, and Smarter Balanced assessments match up against the CCSSO Criteria for High-Quality Assessments. Both the administration and CCSSO set lofty requirements for what qualifies as high-quality assessment.
From the President’s plan, high-quality assessment results in “actionable, objective information about student knowledge and skills” and as appropriate:
- Covers the full range of the relevant state standards
- Elicits complex student demonstrations or applications of knowledge
- Provides an accurate measure of student achievement
- Provides an accurate measure of student growth.
All of that sounds nice, but the devil, of course, is in the details. What do they mean by actionable, objective information? Who determines when it is appropriate for an assessment to cover the full range of the relevant state standards? More important, when is it appropriate for an assessment to not cover the full range of relevant state standards? What constitutes accurate measures of student achievement or student growth?
The Fordham and HumRRO reports provide a glimpse at just how high the bar for high quality assessment has been set. In 2010, when Secretary Duncan announced the winners of the $300 million Race To the Top Assessment competition in his famous Beyond the Bubble Test speech, he made clear that the key feature that would separate these next generation tests from traditional, fill-in-the-bubble tests was depth; that is, they would be tests of critical thinking skills and complex learning, provide students with complex performance tasks, and measure the higher-order thinking skills so vital to success in the 21st century and the future of American prosperity. However, when Fordham and HumRRO evaluated PARCC and Smarter Balanced, their ratings on categories related to Depth of Coverage left much room for improvement.
- On their grade 5 and 8 English Language Arts assessments, PARCC and Smarter Balanced combined received an ‘Excellent Match’ on only 4 of the 10 ratings on depth.
- On the Mathematics assessments at grades 5 and 8, the assessments received an ‘Excellent Match’ on only 3 of 9 ratings.
- The results were slightly better on the high school assessments, but many states are abandoning those tests in favor of the SAT or ACT (a subject for another day).
One would think that in four years it would have been possible to build assessments that assess students’ knowledge and skills at a depth that reflects the demands of College and Career Readiness. We were able to put a man on the moon and return him safely to earth in under nine years. The Manhattan Project took three years. Hell, it only took the Denver Broncos two years to rebuild their defense and win a championship after that debacle at the end of the 2014 season – and with a salary cap roughly equal to the Race to the Top Assessment grant.
The goal here is not to criticize PARCC and Smarter Balanced, but rather to discuss these well-intended definitions of high quality assessment that very soon could be become codified in USED Peer Review rubrics (recall the reification of Norman Webb’s alignment criteria if you are not sure what I mean). My fear is that these definitions of high quality assessment which are so expansive are at the same time much too narrow. They describe a worthy set of goals for assessment, in general, but do not provide enough information for me to determine what a high quality assessment or actionable information looks like in a particular context or when being used for a particular purpose.
We all have our favorite measurement instruments. OK, some of us have a favorite measurement instrument. If not a favorite, we all have a most dreaded measurement instrument such as the thermometer that read nine degrees below zero last Sunday morning, the bathroom scale or even worse, the unforgiving scale at the doctor’s office. They all provide actionable, objective information of one sort or another. The thermometer told me to stay inside and eat comfort food all day. The bathroom scale told me not to do that. (Darn those multiple measures!)
My personal favorite measurement instruments are the devices that are used to measure height at amusement parks. Some are as simple as a line painted on a wall or an L-shaped piece of PVC pipe. Others are more elaborate like the outstretched arm of a character or the system at Hershey Park that assigns candy bar names to various height categories (e.g., under 36” = Hershey miniatures). Whether simple or elaborate, each efficiently performs a straightforward, single measurement function and immediately returns actionable, objective information: you are too tall/short for this particular ride or you can go on this ride. They rarely measure the full range of height. They don’t provide specific information about how tall an individual child is. They certainly don’t provide direct information about how much a child has grown from one year to the next. For my money, however, they are high quality assessments. To paraphrase Bill Belichick, they do their job!
Like the devices used to measure a child’s height at amusement parks, high quality K-12 assessments must adhere to the maxim Form Follows Function. A test must be designed to fit its intended use and the context in which it is being administered. This applies equally to state assessments, district assessments, and assessments used on a regular basis in the classroom.
Perhaps it is not desirable or feasible to attempt to design an on-demand, summative state assessment that fully assesses the depth and breadth of college- and career-readiness standards as complex as the Common Core State Standards. As we discussed in a blog post last spring, Shall We Dance, virtually all assessment programs have to make tradeoffs to fit alignment to the standards within the constraints of on-demand testing such as testing time, cost, and test security. PARCC, for example, made some adjustments prior to the administration of the tests in spring 2015, but determined after spring 2015 testing that it was necessary to further shorten and simplify their test administration.
The issue of the extent to which a single state assessment can address the depth and breadth of complex content standards is being tackled head-on with the Next Generation Science Standards (NGSS) and the conclusion presented in the NRC Report, Developing Assessments for the Next Generation Science Standards,
CONCLUSION 6-1 A coherently designed multilevel assessment system is necessary to assess science learning as envisioned in the framework and the Next Generation Science Standards and provide useful and usable information to multiple audiences. An assessment system intended to serve accountability purposes and also support learning will need to include multiple components: (1) assessments designed for use in the classroom as part of day-to-day instruction, (2) assessments designed for monitoring purposes that include both on-demand and classroom-embedded components, and (3) a set of indicators designed to monitor the quality of instruction to ensure that students have the opportunity to learn science as envisioned in the framework.
When taking such a systems approach from the beginning, it becomes much easier to discuss the role of the end-of-year state assessment as one component of that system. Consequently, it becomes much easier to set realistic expectations for not only the depth of coverage of the state assessment, but the type of actionable, objective information it can provide and to identify the intended audiences for that information. When states began generating student results from all state assessments, parents became a primary audience for those results. The objective information provided was an external indication of their child’s proficiency in reading, mathematics, and science as well as information needed to compare performance in their school district with performance in other school districts across the state. The intended action was to generate necessary conversations between parents and teachers.
Moving from state assessments to district- and classroom-level assessment, we can address similar issues about defining high quality assessment in terms of purpose, format, form, and function. Among standardized assessments, the most commonly administered assessments at this time are so-called interim assessments, administered two to four times per year, and assessments administered as part of Response to Intervention (RTI) programs, administered on a more frequent basis throughout the year. The roots of both interim and RTI assessments can be traced back to curriculum-based measurement (CBM) developed in the late 1970s and early 1980s at the University of Minnesota. In the case of CBM, the actionable, objective information produced is generally not detailed content-based feedback, but rather an indication that the individual student (or group of students) is not progressing in a manner that will lead to successful attainment of the goal or in a manner consistent with other students.
The original development of CBM procedures 35 years ago serves as an exemplar for thinking about high quality assessment today. As described by Stan Deno in a 2003 article
“The essential purpose of CBM has always been to aid teachers in evaluating the effectiveness of the instruction they are providing to individual students.”
“Three key questions were addressed in developing the CBM procedures: (a) What are the outcome tasks on which performance should be measured? (“What to measure”); (b) How must the measurement activities be structured to produce technically adequate data? (“How to measure”); and (c) Can the data be used to improve educational programs?” (“How to use”). The questions were answered through systematic examination of three key issues relevant to each – the technical adequacy of the measures, the treatment validity or utility of the measures, and the logistical feasibility of the measures.
What to measure – How to Measure – How to Use
The focus on utility and logistical feasibility in conjunction with ensuring technical adequacy and identifying key outcomes are features of CBM that must be incorporated into any definition of high quality assessment and any criteria for evaluating individual assessments or assessment systems.
In the same article, Deno discusses the need for teachers to have a “vital sign” indication not only that student learning or effective teaching has occurred on a particular lesson, but also that what has been taught and learned in that lesson contributes to the “overall proficiency for which the curriculum is designed.” It is this concept of assessments serving as tools to provide vital sign indicators that is missing from current conversations about high quality assessment. When we arrive at the emergency room with chest pains, protocol is not to do a complete physical workup, but to administer key tests that are vital indicators of whether there is imminent danger of a life threatening event. In that context, a high quality assessment is one that addresses only the critical content in the most efficient manner possible.
Current definitions of high quality assessment such as the requirements contained in the CCSSO Criteria or those key elements laid out by the administration do not necessarily preclude the use of efficient, targeted, assessment for a particular purpose. There is wiggle room to embrace such tools in the way that one defines “actionable, objective information” or in the way that one interprets the phrase “as appropriate” in the President’s plan. I will be much more confident that there will be room at the table for such assessments, however, when they are being discussed openly in conversations about high quality assessment.