As we begin the 2020s, let’s take a moment to look back at some of the state assessment moments that defined the 2010s
Last week, the New England Patriots announced their all-decade team for the 2010s. The Patriots release was quickly followed by local media offering their own selections of all-decade teams for the Celtics, Bruins, and Red Sox (there’s not much to talk about on sports radio these days).
That got me thinking, what tests would I include on my all-decade team of state assessments (seriously, I don’t have much going on these days). After all, it was quite a decade in state assessment. As my former colleague and frequent foil Chris Domaleski noted, “It’s hard to describe the enthusiasm and high expectations for assessment in 2010.” That level of enthusiasm, of course is impossible to sustain and the result was a decade of state assessment which Chris described aptly as “an exuberant start, setbacks, and, finally, renewed optimism and clarity about the road ahead.”
So, what state assessments stood out during the roller coaster ride of the 2010s? Which tests would I include on my all-decade team – state assessment version?
Criteria for Selection
The first step in creating a list like this is to determine a set of criteria for selection. Was there a minimum number of years the test had to be administered? Did it have to meet the CCSSO Criteria for High Quality Assessments? Should I consider cost, testing time, number of subscores reported, best name or acronym?
What about “measurement-y” factors like validity, reliability, fairness, alignment, utility, and unintended consequences?
In the end, I went with the same simple criterion I use for everything that makes it into this blog: things that interest me and I can use to make a point.
Other Tests Considered but Not Included
No list such as this would be complete without a brief discussion of other tests that were considered but didn’t quite make the cut.
I will start with the two most obvious omissions: PARCC and Smarter Balanced – the 800-pound gorillas of state assessment in the 2010s. Although both programs appear in the list indirectly through various states, I decided to save discussion of the programs themselves for my forthcoming Netflix documentary Assessment King: Measurement, Mayhem, and Mystery.
I also decided not to include any of the new assessments being developed under IADA by states such as New Hampshire, Louisiana, and Georgia. Similarly, I didn’t include the new next generation science assessments under development such as the Cambium (formerly AIR) cluster-based science assessment which is attempting assessment innovation on so many levels at the same time it makes my head spin. As the development of those testing programs plays out over the next several years, they can be considered for the 2020s All-Decade Team.
With that out of the way, on to the list.
(in chronological order)
New York (2012)
When people list their biggest fears things like public speaking, bugs, clowns, or certain politicians being re-elected top the list. Worst nightmares include showing up naked at work or school, falling off of a cliff, or drowning.
For those of us in state assessment, all of those fears rolled into one are realized when one of your assessment items ends up on the front page of the newspaper. A former colleague still shudders whenever he hears a song from Evita, Cats, or Phantom pop up in his AirPods. Personally, whenever I am asked to review a test item, the first thought that goes through my mind is WWJD – what would Jennifer do?
For New York, the nightmare played out through The Pineapple and The Hare; and because everything is bigger on Broadway, the 2012 New York assessment became a national punchline. Truth be told, if you can get beyond the cannibalism (spoiler alert: they ate the talking pineapple), the passage and items really aren’t bad at all. As became even more clear as the decade progressed, however, you can lose control of a story in the blink of eye.
Georgia earned a spot on the team for its ranking on the 2013 NCES study mapping state proficiency standards onto NAEP scales. On each of the grade 4 and 8 reading and mathematics tests examined, Georgia’s proficiency standard was among the lowest of all states and in the Below Basic range on the NAEP scales. The implication, of course, was that Georgia’s proficiency standards were too low; and subsequently they have been raised.
The state of Georgia, however, has been in the news the last couple of weeks for again following the beat of its own drum on how to handle the COVID-19 pandemic. Time will tell whether Georgia was right with respect to its approach to re-opening the state and with respect to its proficiency standards.
I don’t know how things will turn out, but I am pretty confident that there is not a single national approach that is right for handling COVID-19 and that there is not a single proficiency standard that applies equally well to all states and students across the country.
Although the NECAP science tests continued for a few more years, the 2013-2014 administration of the New England Common Assessment Program tests in reading, writing, and mathematics in fall 2013 effectively marked the successful end of a grand decade-long experiment in collaboration among states to design, develop, and administer a common assessment. Having served as an existence proof that a consortium of states collaboratively could manage a common assessment program under the right conditions Maine, New Hampshire, and Vermont moved on to Smarter Balanced and Rhode Island to PARCC.
Nevada, Montana, and North Dakota (2015)
Along with Montana and North Dakota, Nevada chose to administer the Smarter Balanced assessment using the open source platform developed by Smarter Balanced as part of their Race to the Top Assessment grant. Let’s just say that things did not go well.
The remainder of Smarter Balanced states opted to use their assessment contractor’s proprietary platform. PARCC tests also were administered using its contractor’s proprietary platform.
We have a much better understanding now of the importance of having a stable test administration platform. The testing companies that have flourished in this transition to computer-based testing are the companies that have built such platforms.
Trying to design, develop, and implement a new assessment within the limited timeframe of the Race to the Top Assessment grant was a formidable challenge on its own without the additional burden of building an open source platform in that same window.
While deciding whether to adopt the PARCC assessment or develop a new custom assessment program on their own, Massachusetts administered both its own MCAS tests and PARCC tests in spring 2015 and 2016 in what was dubbed a “two-year test drive” – allowing districts the choice of tests. For districts choosing PARCC, there was the additional choice of paper-based or computer-based testing. The combinations of tests and formats across years posed interesting challenges for accomplishing tasks such as determining the comparability of scores across districts for accountability purposes and computing student growth scores.
In March 2016, the governor of Connecticut announced that the state was dropping a performance componentof the Smarter Balanced test in an effort to reduce testing time. The first-of-its-kind decision by Connecticut raised questions about what type of changes to an assessment program can be made while continuing to treat it as the same test in terms of the reporting and interpretation of scaled scores and achievement levels and also for the use of those scores for accountability purposes.
The governor’s decision also demonstrated how strong the pushback against too much testing time had become. Other states and testing programs have grappled with similar decisions since 2016.
After administering a paper-based form of PARCC on its own in 2015 and a hybrid PARCC-Louisiana assessment in 2016, in spring 2017 Louisiana administered grade 3-8 English language arts and mathematics tests composed virtually entirely of PARCC items – but not the items that PARCC was administering in 2017. Louisiana was the first state to license PARCC items (and item parameters) so that it could report its results on the PARCC scale, apply PARCC achievement standards, and compare its performance to other states.
While relying heavily on the properties of IRT, the Louisiana experience has demonstrated just how complex the licensing process can be and the myriad of decisions that must be made and checks that must be performed to ensure that the assumptions of IRT are being met when trying to build comparable test forms by selecting items from an item pool.
NAEP (2017 … 2018)
Although technically not a state assessment, it would be impossible (for me anyway) to compile this all-decade team without including the 2017 NAEP Reading and Mathematics tests. The 2017 administration was the first computer-based administration of the reading and mathematics tests. The processing of results stretched into April 2018 and presented considerable challenges for accounting for mode effects while maintaining trend lines – particularly when it cannot be assumed that the mode effect is the same across the entire population of test takers.
North Dakota (2018)
In 2018, North Dakota became the first state to obtain United States Department of Education approval to allow some districts to administer a locally-selected, nationally-recognized high school assessment (e.g., the ACT) in place of the high school state assessment. As we all know, the requirements in the fine print accompanying the “flexibility” offered in ESSA have proved challenging for USED to interpret and states to meet.
In addition to North Dakota’s use of the locally selected nationally-recognized test option, other states have adopted the ACT or SAT as their high school state assessment. The use of college admissions tests as a high school state assessment raises long unasked and unanswered questions about not only the purpose and use of state assessment in high school, but also questions about the meaning and purpose of high school.
I began this list with a controversy surrounding a reading passage and I will end it with a controversy surrounding a reading passage. In this case, a passage from the Pulitzer Prize winning book, The Underground Railroad, and the essay question accompanying it caused the uproar during the 2019 administration of the tenth grade English language arts test – used as a high school graduation requirement. The question was dropped from the test and after an external analysis adjustment to policies were made in an attempt to account for effects on the performance of individual students.
We all know that there are topics that although fair game for the classroom should not be addressed on state assessments. Content committees and Sensitivity committees are provided guidelines on topics to avoid such as abortion and contraception, abuse of people or animals, the occult and ghosts, and suicide. Highly-charged topics such as gun control, climate change, and other political issues are also avoided.
However, as we demand that assessments measure students critical thinking, problem solving, and other 21stCentury skills and as we strive for instruction to be more culturally relevant and make the student a more active participant in the education process, I can only conclude that it will become increasingly difficult to assess the knowledge and skills that we want and need to measure through large-scale on-demand state assessments.
For me, the clear message sent by the assessments highlighted here is that the 2010s were a decade of transition for state assessment. Meeting the challenges raised by the assessments in this list will provide tremendous opportunities for those charged with developing and administering state assessments in the decade ahead. If folks are willing to think outside of the box (and the answer bubble), allow for the messiness of experimentation and innovation my bet is that we can end the 2020s with excitement for all that has been accomplished.
That leads to my final and most important takeaway from the 2010s. I think that the biggest mistake of the decade was the requirement of the Race to the Top Assessment program for the consortia to design, develop, and implement a fully operational assessment program by the 2014-2015 school year. That mistake is being repeated now with the ESSA Innovative Assessment Demonstration Authority.
Innovation in assessment needs space, time, and resources for research and for things not to work correctly (never mind perfectly) the first time around – or even the second time around. The federal government can and must be a critical partner in creating and supporting the environment that will yield the next generation of assessments that we would all like to see.