Pondering validity on the occasion of Pope Francis’ visit to the United States
From Kane back to Ebel , there are religious overtones, sometimes thinly veiled, to discussions of validity as the alpha and the omega of educational measurement. Without validity, there is no measurement. Validity was in the beginning, is now, and ever shall be. Amen.
As a lifelong practicing Catholic, this spiritual framing of validity has a familiar and comfortable feel to it. Begin with the idea that although there is but one true validity, it comprises multiple (often three) validities, none of which is greater than construct validity. Consider the mystery that unlike virtually everything else in measurement, validity itself cannot be measured. There are statistics that support a validity argument, but there is no overall validity statistic, per se. Validity is all around us, and our effort to find and compile evidence of validity is never ending. Most of the important things that our field is trying to measure cannot be seen or touched, but their existence can be demonstrated through words and deeds. Those are not foreign concepts to me.
The manner in which threats to validity are regarded also fits well within my upbringing as an American Catholic – everything in moderation and everything in context. Test preparation and teaching to the test in moderation may be fine, but taken to the extreme are likely threats to validity. Collaboration with peers, support from teachers, or the use of tools such as calculators or dictionaries may be fine under some conditions, but not others. Assessment developers strive to produce high quality items and tests, but the very use of the same items or tests repeatedly will make them useless. In short, there are few absolutes.
Above all that, there is a strange comfort in knowing that the meaning of validity is a mystery even to the most learned scholars in the field. In the 1961 article, Must All Tests be Valid?, Ebel quotes Cronbach, Gulliksen, and other test specialists of the day on validity and concludes, “[i]t would be difficult to state in words a core of meaning common to all the various definitions of test validity…” Subsequent treatises by Messick (1989) and Kane (2006) undeniably have furthered the discussion of validity, but arguably have done little to establish a core meaning of validity and validation. Newton and Shaw (2015) describe the current state of arriving at a core meaning of validity “as a standoff between scholars (and their followers) who advocate radically different usages.”
Within the Catholic Church, there are, of course, issues much more highly contentious and important than the role of consequences in validity. On a day-to-day basis, however, we do not let such issues paralyze us. We seek our best answer to the question What Would Jesus Do?, and we act upon it. Later, we reflect on our actions, we ask forgiveness when wrong, and we always try to find a better answer and do better the next time. All of us engaged in assessment today must adopt a similar approach to validity and validation. We must move forward and vow to design our testing programs with care, implement them with fidelity, report results accurately, make claims cautiously, gather evidence of their effectiveness, and do better the next time.
The final section of the 12-page introduction to the Validity chapter of the 2014 Joint Standards includes the following statement
[A] test interpretation for a given use rests on evidence for a set of propositions making up the validity argument, and at some point validation evidence allows for a summary judgment of the intended interpretation that is well supported and defensible. At some point the effort to provide sufficient validity evidence to support a given test interpretation for a specific use does end (at least provisionally, pending the emergence of a strong basis for questioning that judgment).
If assessment specialists, educators, and policy makers move forward with a focus on presenting, challenging, and refining our set of propositions, determining what evidence is necessary (not simply what evidence is readily available) to evaluate those propositions, gathering that evidence, and making a summary judgment that is well supported and defensible, I believe that an epiphany will occur. In a moment of divine inspiration and clarity, the veil will be lifted. We will move from absolute darkness to glimmering light, and thence to the bright and clear vision that issues directly related to the test are often nothing more than a mote of dust in the validation and evaluation of our test-based programs and policies.
- The validation and acceptance of test-based graduation policies rests little on the question of whether the test is providing an accurate measure of students’ proficiency in mathematics or English language arts. It is much more often concerned with the fairness and appropriateness of holding all students to a minimum level of proficiency as a prerequisite for earning a diploma.
- The use of student test scores in teacher evaluation systems is never a question of whether a mathematics or English language arts test is a measure of teacher or teaching effectiveness. It is not. The test, we claim, is a measure of student achievement in mathematics or English language arts; and we can gather evidence to validate that claim. With regard to teacher evaluation, however, it is our set of propositions about the way in which student achievement and teacher effectiveness are interrelated that must be validated to make a well supported and defensible summary judgment about the effectiveness of an individual teacher.
- Test scores in mathematics and English language arts do not provide a sufficient basis on which to evaluate the overall quality of a school. However, since its inception in 1965, a primary purpose of Title I was to close achievement gaps in reading, writing, and mathematics between students living in low-income households (particularly in high concentrations in urban and rural areas) and those who are not. Therefore, student achievement in mathematics and English language arts are certainly important outcomes to consider in evaluating the effectiveness of Title I programs or the use of Title I funds.
The Consequences of Misunderstanding Validity
In closing, we return to the Joint Standards and the final section of the Introduction to the chapter on validity. In contrast to the statement discussed above, the final paragraph of the Introduction begins with the statement
Ultimately, the validity of an intended interpretation of test scores relies on all the available evidence relevant to the technical quality of a testing system.
As practitioners in the field, we can continue gnashing our teeth and wailing over what does and does not merit consideration as evidence relevant to the technical quality of a testing system. In doing so, however, we cannot let others lose sight of the fact that the technical quality of a testing system and the validation of a particular interpretation of a test is usually just one of the propositions on which their program or policy has been built. If they do validate and evaluate the rest of their propositions, they are like the foolish man who built his house on sand.