Bring back valid tests


Charlie DePascale

Those of us of a certain age can recall when it was acceptable to talk about valid tests.  One could make the claim that a test was valid; or more often as a precocious graduate student, question whether a test was valid. Some version of the phrase a test is valid to the extent that it measures what it purports to measure rolled off our tongues and made its way into literally every paper we wrote, presentation we made, or late-night conversation we had about tests and testing.

But then things changed.

It was no longer acceptable to talk about valid tests.  Validity was not a property of the test.  Validity was “an integrated evaluative judgment of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of inferences and actions based on test scores or other modes of assessment” (Messick, 1989, p. 13, emphasis in original).

This change, of course, did not happen overnight.  It was the culmination of decades of debate about the meaning of validity and validation.  As with any transition in thinking, the field tolerated references to “valid tests” awhile, long after the focus of validation had shifted to interpretation, inferences, and actions.  At some point, however, the use of the term became socially unacceptable.

Validity – New and Improved

In general, there are three primary reasons why you would change the definition of, or the way we talk about, a concept as central to a field as validity is to educational measurement and assessment.

  • There is new information or understanding that renders the existing definition obsolete.
  • There is a desire or need to clarify the existing definition so that the concept is better understood by theorists and practitioners within the field.
  • There is a desire or need to clarify the existing definition so that it results in better understanding and actions by the broader public.

With regard to validity and educational measurement, one can make a stronger case for the second argument (clarity within the field) than the first case (new information).  Beyond the appeal of a unified theory of validity within the field, however, the third reason, promoting better test use, was the driving force behind the repackaging of validity.  As Shepard (2016) clearly describes, “the reshaping of the definition of validity, beginning in the 1970s and continuing through to the 1985 and 1999 versions of the Standards, came about in the context of the civil rights movement because tests were being used to sort and segregate in harmful ways that could not be defended as valid.”

Shepard (2016) and Cronbach (1988) explain that validity and validation are concepts that must be understood and applied appropriately by the vast body of test users to their vast array of test uses.  Cronbach argues that validators of the appropriateness of tests meet their responsibility “through activities that clarify for a relevant community what a measurement means, and the limitations of each interpretation.  The key word is ‘clarify.’ “

Validity – Clarified?

We are now in at least our sixth decade of clarifying the concept of validity.  Over the last 30 years, successive versions of the Joint Standards have been consistent in both the importance of the concept of validity and in its definition:

Validity is the most important consideration in test evaluation. The concept refers to the appropriateness, meaningfulness, and usefulness of the specific inferences made from test scores. (1985)

Validity refers to the degree to which evidence and theory support the interpretations of test scores entailed by the proposed uses of the tests. Validity, therefore, is the most fundamental consideration in developing and evaluating tests. (1999)

Validity refers to the degree to which evidence and theory support the interpretations of test scores for proposed uses of tests.  Validity is, therefore, the most fundamental consideration in developing and evaluating tests.  (2014)

The 1999 and 2014 Joint Standards further state,

“The process of validation involves accumulating relevant evidence to provide a sound scientific basis for proposed test score interpretations. It is the interpretations of test scores for proposed uses that are evaluated, not the test itself.”

As scientists, researchers, evaluators, or interested observers of educational measurement and test use we must ask the question, has this definition of validity, largely unchanged for 30 years, clarified the concept of validity and resulted in better and more appropriate test use by policy makers and by the general public.  My answer to that question is simply to evoke all of the validity issues associated with the phrase Test-based Accountability.

So, if the current definition of validity has not been as successful as we would have liked with end users of assessment, has the focus on a unified theory of validity, validity arguments and “an integrated evaluative judgment of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of inferences and actions based on test scores or other modes of assessment” clarified the concepts of validity and validation within the field?

Well, not so much.  If this is what clarity looks like, then I shudder to think of how things would look with a lack of clarity.

  • Yes, there is consensus on the importance of collecting evidence to support proposed test score interpretations and uses. What constitutes evidence, how much of it is necessary, how it should be evaluated, and who is responsible for collecting, evaluating, and presenting that evidence is another story.
  • Technical concepts such as constructs, construct under-representation, and construct-irrelevant variance are crossed with commonly used words such as alignment and comparability to create tables of test evaluation criteria that have no meaning to the field or the general test user.
  • The shift away from external criteria and predictions has had the unintended consequence of increasing acceptance of the test score as the construct. There can be no give-and-take between theory and evidence from the assessment if the test score is the only evidence available or considered.
  • Although Cronbach wrote of the responsibility of validators, it is no longer clear who those validators are. If everyone is responsible for validation, nobody takes responsibility for validation.  If validation is an ongoing (i.e., unending) process, we can put off too easily some of those validation studies for another year, or two, or three until we have time to gather good data about how the tests scores are being interpreted and used.
  • And even if there is consensus that validity is a unitary concept, the evidence needed to develop a validity argument is still being produced largely by specialists responsible for one distinct aspect of the testing process. Who within the field as it is currently being molded is being equipped with the knowledge and skills necessary for crafting and carrying out a comprehensive and coherent validity plan?

Back to the drawing board

As shown above, there was virtually no change in the definition of validity presented in the 1999 and 2014 Joint Standards and little change in the 30+ years since the 1985 version of the Joint Standards.  If our current definition of validity has not produced desirable outcomes, we have a professional responsibility to continue to review and refine it.

Just for fun, let’s return to our old friend:

Validity is the extent to which a test measures what it purports to measure.

What were the problems with that definition of validity?

A core criticism, of course, was that the definition did not address test use explicitly.  Outside of the thin air and ivy-covered walls (or perhaps some other green plant-covered walls) of academia, however, is that really a problem?  As the examples provided by Shepard (2016) demonstrate, with little help from the measurement community, the courts have certainly interpreted the definition of validity broadly enough to include test use since the 1970s.

Can we expect educators and policy makers to interpret “what it purports to measure” as “what you intend to measure” or “what you purport it measures”?  I would like to think that is the case, but if not, that is a relatively simple fix to the definition.

A second criticism is that the traditional definition is too narrow or too simple.  It does not reflect the complexity of the concept of validity and does not accurately reflect the concept of educational testing (i.e., that the test is being administered for a particular purpose).  That is a fair concern.  On the other hand, we can all agree that concepts of validity and validation are far too complex to be captured by any reasonable expansion of the definition.  What do we lose by trying to make a definition capture more than can possibly be captured in a simple definition?

There is also scientific elegance in the phrase “validity is the extent to which a test measures what it purports to measure.” In particular, the word purport seems uniquely suitable to describe the concept of validity or the process of validation.  There is an air of skepticism associated with the word purport.  Purport refers to claims made without proof or supporting evidence.  What better way to define validity and validation than to suggest that one must assemble evidence to prove that a test measures what it claims to measure – or what you intend to measure with it?

Would a reasonable group of experts and test users assigned with the task of describing how to prove that a test measures what it purports to measure arrive at the same set of validity standards contained in the Joint Standards; clusters of standards related to establishing the intended uses and interpretations of the test, issues regarding samples and settings used in validation, and specific forms of evidence?

A Brave New World

Somebody somewhere is probably already working on the validity/validation chapter for the fifth edition of Educational Measurement (1951, 1971, 1989, 2006).  Unlike the Joint Standards, which by their very nature must lag a generation or two, the validity/validation chapter in Educational Measurement has the opportunity to move the field forward. I hope the author takes advantage of that opportunity.

We are on the cusp of a brave new world in education, assessment, and educational measurement.  It is a world defined by technology, personalization, and an emphasis on individual growth.  It is a world without standardization and fixed form, census, large-scale assessment. It is a world in which educational measurement (and its principles of validity, reliability, and fairness) is threatened by statistical modeling. (Yes, there is a difference between the measurement and modeling.)

For educational measurement to have any hope of survival, we need to begin by being able to clearly convey what we mean by validity and validation and why they are important.  For my money, I will start with a test that measures what it purports to measure.

Published by Charlie DePascale

Charlie DePascale is an educational consultant specializing in the area of large-scale educational assessment. When absolutely necessary, he is a psychometrician. The ideas expressed in these posts are his (at least at the time they were written), and are not intended to reflect the views of any organizations with which he is affiliated personally or professionally..

%d bloggers like this: