I first met Greg Cizek in May 2007 at the Contemporary Issues in High Stakes Testing Conference – a gathering in Lincoln, Nebraska to honor the career of Barbara Plake. From the audience, Greg engaged another participant (who will not be named) in a vigorous, thoughtful, and thought-provoking debate on the role of consequences in test validity. (Don’t you miss the days when a healthy exchange of ideas might break out at a conference?)
As we all know, the issue of the role of consequences in validity, the meaning of validity, and actual practice of validation have not become any less of a contemporary issue in high stakes testing over the intervening 13 years. In 2013, Newton & Shaw concluded that level of confusion and disagreement was so great that the best approach was abandon the concept of validity altogether and start over from scratch. Earlier this year, the outgoing NCME president lamented the time that the measurement community has spent fiddling with arguments about the meaning of validity over the past two decades while the testing world burned around us. When I retired in 2019, I feared that it might be generations before the field would be able to clean up our Messick.
Now, however, I see a light at the end of the tunnel. To borrow from Thomas Jefferson, with his new book Validity – An Integrated Approach to Test Score Meaning and Use, Greg Cizek has “placed before mankind the common sense of the subject, in terms so plain and firm as to command their assent.”
- What do these scores mean?
- “Should these test scores be used for Purpose X?” (where X is a specific test use)
Those are the two direct, distinct questions that must be answered for every test and for each and every proposed use of that test. These are not new questions or a new idea. However, while we have known this at least since Messick (1989) and Kane (2006) tried valiantly to make the concept more clear, in addressing what these two questions mean and how to approach answering them, Cizek has made what I regard as the definitive and most accessible statement on validity in the past 30 years.
I am not naïve enough, of course, to believe that this masterpiece will in fact, be any more successful than the Declaration of Independence in commanding the assent of those whose hearts and minds have already been hardened. Assuming, however, that it will be required reading in graduate courses and among testing professionals for years to come I am incredibly optimistic. Plus, for the next couple of weeks, you can pick it up at 20% off during Routledge’s mid-year sale.
In closing, I will share five thoughts about validity that either emerged or became much clearer to me while reading the book. Please do not hold Greg responsible for my creative interpretation or misinterpretation of the brilliant argument that he has made in Validity.
- Words matter – Cizek devotes much of the introduction to defining the foundational measurement concepts underlying validity such as test, assessment, construct, inference, etc. Thirty years ago, the question of the difference (if any) between “test” and “assessment” led my dissertation committee into a 15-minute debate among themselves while I sat and watched. That was a good thing at that time and in that setting, but now it is critical that we understand that a test is not a testing program and that a testing program is not an accountability system.
- The Standards – The AERA, APA, NCME Standards for Educational and Psychological Testing are about, as the title suggest, testing. I was in graduate school when the 1985 edition changed the last word of the title from tests to testing and noted that “the major new sections in the 1985 Standards relate to test use.” The focus of the current Standards is appropriately on testing and the use of tests, but they are rooted in a history and tradition of psychometrics and the technical quality of tests. To selectively quote from the introduction to the 1985 edition, “The Standards does not attempt to provide psychometric answers to policy questions.” Even with the 200+ pages of the most recent edition (twice the length of the 1985 edition and more than 3X the size of the 1974 edition), it is not possible in a single set of standards to adequately address both the development of tests (including the meaning of test scores) and the use of those tests. In attempting to serve as the definitive word on both they have served neither as well as they should have or needed to do.
- Constructs – By placing constructs at the center of a unified theory of validity and our testing universe we have aimed a spotlight at the fundamental and biggest challenge in our chosen field. From a measurement perspective, that makes perfect sense. We are trying to measure things that cannot be seen and, in general, are defined through other observed behaviors which are almost always context dependent and which often must also be measured by a test, and are related to, but different from, the construct of interest. We can evaluate the accuracy of predictions. We can judge the quality of content. With constructs, however, most of us had our first exposure to the terms “operational definition” and “operationalize” but often got not much further than that. The upshot is that we have become more and more comfortable with relying of the test to define the construct – which is a very bad thing. To paraphrase a line from Hamilton, “Prediction is easy, Son, Measurement’s harder.”
- A Theory of Everything – I am not sure what motivated Messick and the other giants of our field in the 1970s and 1980s, but they were clearly enamored by the notion of validity as a grand unified theory of everything. And they came so close, if not for that 2×2 matrix. Perhaps they were driven by the controversies and lawsuits enveloping test use at the time. Perhaps their motivation was the belief that consequences and the justification of test use wouldn’t be considered as important if not included as a component of validity. Perhaps it was simply the hubris that drives scientists to seek great discoveries, but ultimately leads them to fly too close to the sun with the retribution of nemesis that follows (i.e., 30 years of confusion and controversy). Or as I have chosen to believe (because we’re allowed to do that now), perhaps they saw the handwriting on the wall and wanted to make one last grand attempt to stake a claim for testing as measurement before handing standardized testing over to IRT and the modelers.
- Measurement and Evaluation – I entered the doctoral program in Measurement & Evaluation at the University of Minnesota in the mid-1980s – on the cusp of the IRT era of standardized testing. The program focused as much, if not more, on the Program Evaluation standards as it did on the Standards. After reading Cizek’s Validity, I realize that was the perfect foundation for a 30-year career in assessment and education policy. Thank you, Minnesota. Ski-U-Mah!