Whose Job Is It, Anyway?

Whenever I come across the story Whose Job Is It, Anyway?, I cannot help but think of validity and K-12 large-scale assessment. Who among us has not sat through countless TAC meetings where the mere mention of the lack of validity evidence in the technical report results in blank stares and handwringing; or in recent years, listened to a state describe the scavenger hunt that is its effort to assemble documentation for USED Peer Review? We all agree wholeheartedly that validity is critical; but we are all busy people, you know, trying to manage assessment programs and improve student learning. Besides, whose job is it, anyway?

I have worn many hats since wandering into the large-scale state assessment arena in 1989: data analyst and supervisor for a testing company, principal (principled?) psychometrician and member of the assessment management team at a state department of education, project manager for a multi-state assessment consortium, assessment policy consultant, technical advisor, and for good measure, president of two regional educational research organizations. Validity was a prime concern in each of those roles, but I never regarded documenting validity evidence as primarily my responsibility in any of them.

With the hindsight that comes from experience and detachment, I realize now that it was my responsibility in each of those positions to support part of the validity argument – but only a small, well-defined part of the argument. Supporting a validity argument for K-12 state tests is simply beyond the scope of any one individual or organization, particularly a testing company or a state department of education. Testing companies, states, and educational researchers, however, each play a pivotal role in supporting that validity argument.

I am convinced that supporting a validity argument for K-12 state tests is a shared responsibility; as they say, it takes a village.

It Takes A Village

State assessment programs and the tests they comprise are complex systems with many interdependent parts. All such systems require a) an overarching plan and specifications for how the parts will fit and work together toward the overall goal and b) detailed specifications for each of the component parts within the system. Note also that the state assessment program itself is likely only one small component within a larger education reform system. When we begin to craft a validity argument, an evaluation plan, or a theory of action, as appropriate, for the overall system and components (e.g., tests) within the system, we must be mindful of the role that each component plays within the system.

With that caveat in mind, I offer the following simplified framework with regard to responsibility for validity and K-12 test state tests.

From Theory to Actions

An essential, but too often overlooked, first step in developing a validity argument for a state test is to move from why the state is establishing an assessment program (theory) to determining how the desired interpretations and inferences regarding student performance will be realized through the test (actions). This should not be confused with the process of establishing a theory of action for an assessment program (or accountability system), which my longtime friends and colleagues at the Center for Assessment have described in great detail. Rather, this can be considered as the initial step in a principled assessment design process.

The responsibility for developing the theory of action and the task of creating detailed specifications for the test rests with the state. Although state assessment staff may be included in both activities, there are likely to be, and should be, differences in the key participants in the two activities. The theory of action requires the input of state policymakers. The development of test specifications requires the involvement of content specialists – ideally, the same people who have been involved in the development and/or adoption of the state content standards. One could argue that this process should be included as part of the development of content standards, but we will save that discussion for another day.

Although the state may seek the assistance of an assessment contractor to complete this process, the test specifications described here should be completed prior to the state selecting an off-the-shelf test or selecting a contractor to assist in the development of a custom test. It’s nice to know what you are building before hiring the builder.

A well-designed set of test specifications will serve as the foundation for the state’s validity argument.

The Test, The Whole Test, and Nothing but the Test

Producing and documenting validity evidence related to the technical quality of the test, in general and in relation to the test specifications, is the primary responsibility of the state’s assessment contractor. This evidence includes all of the information that we expect to see in a traditional state assessment technical manual related to test design, administration, scoring, psychometric analyses, and the production of score reports. Documentation of analyses establishing the reliability of the test and activities such as alignment studies and standard setting would also fall under the purview of the assessment contractor.

The state’s role in collecting this type of evidence will be dependent upon its capacity and the level of its involvement in the development and administration of its tests.

The influence of educational researchers in this phase of the process primarily will be through the options made available to testing companies as they are making choices and/or recommendations to the state related to test design, psychometric models, and equating procedures.

Assessment Toolbox

Most states are not in the business of conducting basic, or even applied, research in educational measurement. A small number of states may be able to fund such research or form research partnerships with institutions of higher education, but for the most part, states are users of research. It is critical, therefore, that states have a toolbox of research-based tools with which to build their assessment systems. Rather than producing validity evidence from scratch, a state should be able to show that it has selected tools that have been validated to support interpretations and uses consistent with its needs in key areas such as, but not limited to,

Accommodations and Universal Supports
Item and Test Security
Item Types
Derived and Composite Scores
Score Reporting

Although there may be some testing companies that can conduct this type of research independently, it is more likely that this research will be conducted by research centers at institutions of higher education and similar organizations.

Research and professional associations can play a critical role in facilitating conversations between states, testing companies, and researchers so that all parties understand the needs of the field, are aware of the research that has been conducted, and can anticipate the research that must be conducted. Those professional associations must also work with policymakers and funding agencies to ensure that relevant research is disseminated, and that needed research is funded and conducted.

Intended Uses of Test Scores

After administering a test, many states desire to use those test scores for some purpose in addition to providing information to stakeholders (e.g., policymakers, educators, parents, students) regarding student achievement of the state’s content standards in English language arts, mathematics, science, or some other content area. Several common uses over the years have included:

Curriculum and Program Evaluation
Informing Instruction and Modeling Assessment Best Practices
Course placement
Promotion and High School Graduation
Teacher Compensation Programs – Incentives and Bonuses

If a state intends to use test scores for such purposes, it is incumbent upon the state to determine and document that the interpretations and inferences made based on the test scores and any ancillary resources (e.g., test items, scoring tools, interpretation guides) are appropriate for that particular use. Unless the state’s assessment contractor is directly involved in supporting one of these uses, however, generally, it would not be their responsibility to provide validity evidence to support this use. Further, I would not expect to see this type of use and evidence addressed directly in technical documentation prepared by the assessment contractor.

I have limited discussion in this section to uses and decisions directly based on test scores. In this post, I intentionally have not addressed programs such as school accountability systems and teacher evaluation systems. Such systems may use test scores as one piece of input data (perhaps even the most significant piece), but they differ from the uses listed above in that the interpretations that they are making are at least one step removed from the interpretation of the test score as an indicator of student proficiency. Those systems, obviously, require their own validity argument, and in many cases evaluation plan, which will include documented evidence to support the use of test scores in systems designed to provide information on school or teaching effectiveness. In my next post, I will address these systems and the problems associated with conflating the validity argument/evidence for a test and an accountability system.

Also, I have not addressed unintended uses of test scores in this post. Of course, testing companies should inform states and states should caution users regarding likely unintended uses and interpretations of test scores.

Each Must Play A Part

In the simplified framework presented in this post, I have assigned primary responsibility for validity evidence related to various aspects of large-scale K-12 state tests and test scores to either the state, testing company, or educational researchers. I don’t expect the responsibilities delegated to the testing contractor and the state will surprise many people. The separation of responsibilities between testing company and state is well-established, even if not explicitly acknowledged.

The responsibility assigned to the measurement community and educational researchers may be the piece of the framework that seems new and perhaps misplaced to some readers. However, if we are to make solid, significant improvements in large-scale assessment it is time for researchers to step up. Educational researchers played a significant role in pushing the field toward performance assessment in the 1990s, but the measurement community ran and hid when technical challenges were raised.

There has been good support and progress with alternate assessments and English language proficiency assessments, but we are on the cusp of a new era in large-scale assessment. The active participation of researchers and the measurement community is needed to determine how to support valid interpretations and uses of curriculum-embedded performance assessments, through-course assessments, and personalized continuous assessment. We are beyond the point where simply pointing out what doesn’t work is sufficient or acceptable.