Even Validity Has Unintended Consequences

While the educational measurement community has devoted an uncomfortable amount of energy to debating the proper role of consequences in validity, I would argue that we have paid far too little attention to the unintended negative consequences of the Validity Standards on the validation of K-12 large-scale state assessments and state assessment programs. We seem to have forgotten that the validation efforts for those state assessments are subject to the same laws that apply to their development and use, namely, Goodhart’s, Campbell’s, and of course, Murphy’s.

As the sources of evidence, or indicators, of validity have themselves become targets to be attained, the less useful they have become as sources of evidence.

Sowing the Seeds of Discontent

Recent editions of the Standards for Educational and Psychological Testing have outlined “various sources of evidence that might be used in evaluating the validity of a proposed interpretation of test scores for a particular use” (AERA, APA, NCME, 2014, p.13):

Evidence based on test content
Evidence based on response processes
Evidence based on internal structure
Evidence based on relations to other variables
- Evidence regarding relationships with conceptually related constructs
- Evidence regarding relationships with criteria
Evidence based on consequences of tests.

The Standards make it clear that “each type of evidence … is not required in all settings,” but “rather support is needed for each proposition that underlies a proposed test interpretation for a specified use.” In our case of tests administered as part of state assessment programs developed to meet federal requirements, the United States Department of Education (USED) has identified the first four sources of validity evidence as “critical elements” to support the technical quality of the test (USED, 2018, p. 29).

And therein, the seeds of unintended consequences are sown. The assignment of high stakes to the test validity standards in the form of states’ seeking to gain USED approval through the peer review process corrupts (in a nonjudgmental use of the word) the validation process and limits the usefulness of common sources of validity evidence.

Reaping What We Have Sown

For better or worse, state content standards are the starting point when gathering validity evidence for state tests. At least through 2009, the primary interpretation attached to test scores was student proficiency on the standards. For the purpose of this post, I will set aside discussion of an operational definition of proficiency, except to say that we know that proficiency does not refer to a determination of student mastery of the individual content standards.

Evidence Based on Test Content and Evidence Based on Response Processes

Validity evidence in these two categories can be summed up in one word: alignment. Alignment of the test to standards became the assessment buzzword of the 2000s. When the focus of the field and policymakers turned toward alignment, alternative methods for evaluating alignment emerged including procedures developed by Norman Webb, Achieve, and Porter & Smithson’s Surveys of Enacted Curriculum (as described in CCSSO, 2017). Eventually, the lay use of the term in federal law and its technical definition fused into widespread use of the set of metrics proposed by Webb.

There is no question that the increased focus on alignment has resulted in better item development and test design. I can state with confidence, however, that the attention devoted to meeting alignment criteria also has detracted attention from bigger questions about the alignment of an assessment form to the standards, such as how much alignment is enough and how is this proficiency that we are measuring and reporting affected by changes in alignment.

We are quite adept at rationalizing the decisions made in the name of meeting alignment criteria (e.g., let’s align to the content domains rather than standards, let’s designate assessment targets, let’s collapse three years of forms). After all, we are the same people who came up with, “If they are going to teach to the test, we’ll give them a test worth teaching to.” And I will defend almost all of those decisions as being sound from a measurement and a policy perspective. The disproportionate amount of resources devoted to chasing alignment scores, however, has distracted us from the determining how evidence of aligned items and tasks supports our validity argument for the intended interpretation and use of test scores; that is, proficiency.

Evidence Based on Internal Structure and Evidence Based on Relations to Other Variables

Moving beyond content, response processes, and alignment, the quest to provide required validity evidence based on internal structure and relations to other variables truly becomes surreal. We pull out all of the stops to demonstrate that the test we are using to measure the unidimensional construct of proficiency

is sufficiently unidimensional despite multidimensionality introduced by factors such as the attempt to place reading and writing on a single scale and the use of multiple (often new) item types, while also demonstrating that
the test comprises distinct factors reflecting the content domains identified in the state standards, allowing states to describe student performance on selected subsets of items.

Accomplishing this task requires a great deal of psychometric creativity (now there’s an oxymoron for you) and just enough motivated perception on the part of those reviewing the evidence.

Of course, we are able to capitalize on the fact that, for any number of plausible reasons, there is a strong positive correlation between virtually all cognitive constructs we measure within and across grades K-12. When you combine the little variation that does exist between student performance across content areas with the noise in our measures and the noise in non-test measures such as teacher judgments and student grades, a good psychometrician can find sufficient evidence to support pretty much any desired reporting structure.

If the discussion in this section came across as just a tad too cynical then I have accomplished my goal. My bottom line is that evidence compiled in these two categories adds little, if any, value to enhancing the primary interpretation and use of state assessment test scores as a measure of student proficiency. That is not to suggest that perfunctory analyses examining dimensionality, relations to other variables, etc. should not be carried out as part of the testing process to ensure that the test is functioning as expected. Rather, I am suggesting that as we move forward into a new era of state assessment, we need to rethink the types of evidence needed to support the inferences about student performance that are necessary for the intended uses of state assessments.

The Road Ahead Will Be Long

In this post I discussed the unintended consequences that resulted from applying the Validity Standards to large-scale K-12 state tests in a relatively high-stakes situation. Unfortunately, the test forms, scores, interpretations/inferences, and test use addressed in this post (i.e., measuring student proficiency on state standards) should have been relatively easy to validate. As I will address in my next two posts, the validation task only becomes more challenging as we expand the interpretation and inferences associated with test scores, measure more complex skills with more complicated test designs, and consider the various ways that test scores from state assessment programs are used.

(Image by PublicDomainPictures from Pixabay)

Share this:

Published by Charlie DePascale