Ninety-five Theses on State Testing

“Out of love for the truth and from desire to elucidate it, the Reverend Father Martin Luther, Master of Arts and Sacred Theology, and ordinary lecturer therein at Wittenberg, intends to defend the following statements and to dispute on them in that place. Therefore he asks that those who cannot be present and dispute with him orally shall do so in their absence by letter. In the name of our Lord Jesus Christ, Amen“.

Although lacking the significance, depth, coherence, and virtually all other qualities of Luther’s Ninety-five Theses, the opinions expressed to date through the 95 posts in this blog have come from love for the truth, from the desire to elucidate, and with the hope of stimulating thought and spurring debate. The theses, or statements, below are a compilation of the key thoughts on state testing expressed throughout the blog.

Student proficiency exists without state tests.
State testing is fundamentally a data collection activity.
The primary purpose of state tests should be to provide state policymakers with comparable data about student achievement of the state’s content standards from districts and schools across the state for a specified use.
Secondary purposes for administering state tests may complement but must not interfere with the primary purpose.
A primary responsibility of those managing state assessment programs is to establish and ensure favorable conditions for the administration of state tests to support the appropriate interpretation and of state test results.
State accountability programs, state academic support systems, and other systems that might use state test results should not be confounded with the state test and the state assessment program. Each of those programs requires its own validation evidence and an evaluation of its utility.

Actionable Information for Teachers, Schools, and Districts

Ideally, state tests should not provide teachers with information about their students’ achievement that they do not already know.
State tests should be most useful to districts and schools when the state has introduced new content or achievement standards.
Planned obsolescence of the utility of state tests for districts and schools should be built into the design of every new state testing program.
District and schools should expect to see diminishing returns from state tests over time.
If districts and schools are relying on state tests for important information about student performance five years after new content standards or a new assessment program were introduced, they are doing something wrong or there are greater problems to solve.
There is a need more external validation of state test results
Discrepancies between a school’s own estimates of its students’ performance and state test results is likely a signal of a bigger problem than simply low test results.
A goal of every state’s education reform initiative should be that there is no discrepancy between school-based aggregate estimates of student proficiency and results on the state test.
Teacher judgments of student proficiency (not predictions of student performance on the test) should be collected routinely as part of the state testing program.

State Content and Achievement Standards

State content standards are the foundation for the design and development of state tests.
The state must be explicit about which, if any, state content standards will not be assessed fully, or assessed at all, on the state test.
State tests are one instantiation of the state content standards, but must not be the sole or primary method of clarifying the meaning of the standards, or the expectations of the authors of the standards, for local educators and other stakeholders.
The authors of state content standards must include specifications for instruction and assessment along with a variety of instructional and assessment examples appropriate for different purposes and stakeholders.
State tests will play too great a role in defining state standards a) if no other assessment examples are provided through the standards themselves or b) the stakes associated with the use of state test scores are disproportionately high for particular school or students.
Achievement standards should be developed in conjunction with the process of developing the content standards or during the initial validation of the content standards.
Achievement standards should make explicit the connection between achievement in the classroom and achievement on state tests.
The quality of student performance required by a particular content standard is context dependent. There will be different expectations for an essay produced as a single draft on an on-demand test, a response to a constructed-response item on an on-demand test, a report and presentation prepared for an extended in-school performance task, or a written response to a daily homework question.
Items developed for state tests will be subject to the constraints of state tests and therefore, will serve as inadequate examples of the type of student performance expected in the classroom to meet the expectations of a particular standard or set of standards.
Examples of proficient performance should include performance on an entire test and portfolios of student work. Single items developed for state tests or single tasks developed for the classroom will serve as an insufficient example of the type of student performance expected of a student who is proficient on the state standards.

Standard Setting

Standard Setting for state tests is more procedurally sound or valid, but arguably, more fundamentally flawed now than when that critical label was used to describe the efforts to establish performance standards for NAEP tests in the 1990s.
In one sense, the term “Standard Setting” is an anachronism, a relic from a previous error in which tests used as state tests were not aligned to a particular set of content standards and the panels that were convened after testing were asked to establish achievement standards and determine achievement level thresholds on a test.
In theory, the process that we currently refer to as “Standard Setting” asks panelists to determine achievement level thresholds on a test that best match achievement standards that have already been established and specified through achievement level descriptors (ALD).
In practice, the confounding of content-based and policy-based concerns in the standard setting panelists’ recommendations makes it difficult to develop content-based interpretations of student performance based on the achievement level descriptors.
The confounding of content-based and policy-based concerns in the standard setting panelists’ recommendations makes it difficult for state policymakers and other stakeholder to unpack the relative influence of content-related and other factors on the standard setting panelists’ recommendations.
External validation of the achievement level descriptors prior to convening the standard setting panels would improve the stability and validity of the overall process.
Current practice in standard setting, from the development of the achievement level descriptors through to the choice of standard setting methodology, favors item-based approaches to standard setting and consideration of student performance one item at a time.
Historically, item-based standard setting methods have produced higher achievement level cut scores (i.e., higher standards) than test-based standard setting methods applied to the same test.
More research is needed to understand the impact of the interaction of choice of standard setting methodology, the design of the state test, and the achievement standards.
More external, empirical validation is needed at all stages of the standard setting process.

Test and Item Development and Scoring

The design and development of state tests should be informed by the achievement standards as well as the content standards.
The decision to develop items to fill in the gaps between end-of-grade or end-of-grade span standards must be consistent with the achievement standards and the way the state defines proficiency.
As the concept of alignment continues to evolve and moves away from mapping individual items to a single content standard, item and test development and review procedures must be adapted.
Traditional assumptions regarding bias and sensitivity must be revisited and renewed.
Methods and procedures for preventing, detecting, and understanding the real impacts of issues related to item and test bias must be updated, particularly for smaller subgroups of students.
Historical or traditional item analyses must be evaluated for their appropriateness and the continued relevance of their underlying assumptions.
Without abandoning item analyses, much greater emphasis must be placed on analyses focused on the test as a whole.
The development of new item types cannot continue to take place in operational test environments and on the same schedule as operational test development. Dedicated research and development on new item types must be supported.
The use of psychometrics to inform test development has increased significantly over the past decade and must continue to increase, but must also be accompanied by breaking down the silos between psychometric and content specialists involved in test development.
As computer-adaptive testing (CAT) becomes more prevalent, additional research is necessary to evaluate the assumptions made regarding comparability when developing item selection algorithms.
As matrix-sampling of items across students becomes more prevalent, assumptions that have driven test development procedures for the past quarter century will have to be revisited.
The evaluation of scoring quality (automated and human) must focus on the impact on total test score as well as individual item scores.
The design and development of automated scoring systems must focus on what was desired from scoring under ideal conditions, not solely on how scoring systems have developed under the constraints of large-scale state testing.
All scoring systems must continue to include safeguards to account for atypical responses that provide the information requested by the item.

Technical Quality of State Tests and Testing Programs

More applied research is needed on the concept of linking, in general, and equating, in particular, including research on when tests should not be linked.
The basic assumptions about when and why tests should and should not be linked is not understood well enough by those who need to make decisions about linking tests.
More research is needed on the effectiveness of equating procedures when there is differential growth across item types, components, standards, content domains or strands, or across subgroups of students, particularly small subgroups of students.
More research is needed to provide evidence that test-based equating procedures perform as expected for anchor sets of items selected from an item bank that have never been administered together.
More studies are needed to determine the effects of combining item parameters generated from different samples and/or in different years for items in an anchor set.
When the difficulty of the test is significantly different from the ability (i.e., achievement) of the population of students being tested, one cannot assume that a random sample of students is the most appropriate sample on which to generate item parameters.
More research is needed on alternative methods to combine scores from different components of a state test to produce a total score.
More thought and discussion are needed on whether different components of a state test should be combined into a single overall score.
More discussion is needed on the concept of parameter drift, its definition, its causes, its relevance in the use of IRT to create a measurement scale v. the use of IRT to score students against an established benchmark, and the appropriate actions when it is discovered.
More discussion is needed on the appropriate treatment in future years of achievement level thresholds established during standard setting, in particular, are they to be treated as fixed, true points with no error.
As advances in technology make more data available, decisions will have to be made on what data should be used to improve the estimate of student ability and what data should be used to assist in the interpretation of student performance.
There is no acceptable rationale for the desired cut score (e.g., Proficient, Meeting Expectations, Mastery) on a standards-based state test to be set at 50% – 60% of the possible raw score points. More research is necessary on what is being measured on such tests.
Research is needed to determine the psychometric and psychological impact of a state test with a difficulty significantly different than the ability of students in the state.

Reporting Results of State Tests

Results from state tests should be reported as information useful to the stakeholders for which the reports are designed and not simply as data.
Historically, when commercial norm-referenced tests were used as state tests, results were reported in terms of metrics intended to provide useful information such as percentile ranks, stanines, grade equivalent scores. [How much useful information they actually conveyed is open to debate, of course.]
Underlying scale scores, usually on a vertical scale, were used to derive scores reported on norm-referenced tests, but conveyed little information on their own and were rarely reported or used by policymakers or educators.
With the transition to custom state assessments, scale scores took on a greater significance in reporting, with reporting scales often centered on the state mean in the initial year of testing.
The performance level (aka achievement level) or passing score became the primary, and often only, information-based metric used to report state test results on custom standards-based tests.
If we continue to draw standard error bars around observed scores on Individual Student Reports, the field needs to agree on language that accurately describes what the bar represents.
Scale scores, error bars, mean scale scores, standard deviations, effect sizes, etc. still have no inherent meaning that is useful to policymakers and educators or supports content-based interpretations.
Although aggregate school scores are a primary focus of state testing, the vast majority of documentation and evidence of the technical quality of state tests is focused on student-level scores.
After fifteen years of testing all students at grades 3 through 8 and once in high school, longitudinal data remains severely underutilized in the computation of and reporting of results from state tests, both at the individual student and aggregate levels.
We need to refocus our efforts on reporting state test results in terms of metrics that provide useful information to policymakers, educators, and the public.
Advances in the reporting of state test results must acknowledge the limits to the information that can be gained from state tests and not promote the reporting and use of scores that cannot be supported.
Uncertainty is not an indication of a flaw or weakness of a particular test or of state testing, in general.
All reporting of state test results must focus more on conveying the degree of certainty and uncertainty in state test results.
Content-based anchoring of scale scores and achievement levels (i.e., as used on NAEP and with interim assessments) can be a useful tool, but must be better conveyed in terms of the probabilities that underly the information.

Alternate Assessment

There is no single population of students eligible for the alternate assessment on which it makes sense to base item parameters and similar statistics.
The variance in the achievement of the 1% of students eligible for the alternate assessment may be greater than the variance of the remaining 99% of students.
There is no single achievement standard to which all students eligible for the alternate assessment within a state should be held.
An alternate assessment that merely establishes a lower achievement standard not met by the vast majority of students taking the test year-after-year serves no purpose.
Ideally, the format of the alternate assessment would be consistent with the individualized instruction students are receiving, making curriculum-embedded portfolios a strong choice.
Barriers to the implementation of alternate assessment portfolios (beyond those related to federal requirements) should be approached primarily from an instructional perspective and secondarily from assessment and accountability perspectives.
As policy developed, “alternate assessment” almost exclusively has referred to Alternate Assessments aligned with Alternate Academic Achievement Standards (AA-AAAS). There are still students, however, who are unable to access the general state test even with accommodations who will benefit from the opportunity to participate in an alternate assessment of grade-level academic achievement standards.
The imposition of a hard 1% cap on participation in the alternate assessment or on the use of alternate assessment results for accountability is a misguided policy decision.

Accessibility to State Tests

Increased accessibility to state tests, particularly for students with disabilities, has been one of the most significant advances in state testing since the 1990s.
The assumption that all students with disabilities, if they receive appropriate instruction, should be able to attain grade level standards and the policies based on that assumption require additional thought and discussion, ideally leading to additional research and evidence to inform future policy.
States should not be required to produce additional validity evidence to support the use of accommodations that have been validated for use in similar testing situations.
States should devote additional resources to understanding the use of accommodations, including evaluating the implementation of accommodations on state tests and examining the impact of the use (and non-use) of accommodations on student performance.
Technological advances have made offering tests and test accommodations in languages other than English more practicable, but states still have limited capability in this area.
Greater clarity is needed on policies related to offering tests in languages other than English.
Greater clarity is needed on the expected relationship between grade-level English Language Proficiency, performance on the grade-level state test, and student performance in the classroom.
Research on the comparability of alternative forms of assessment to increase accessibility for all students has stagnated and must be refreshed and renewed.

Maintaining Focus and Flexibility as we Move Forward

Current state testing practices emerged over time to meet specific needs. As needs change, the design of state tests and testing programs must continue to adapt to meet those needs.
State tests are administered to meet a particular need. At one point in time it was determined that state tests were the most efficient and effective method to meet that need. That may not always be the case.
Moving forward, we must keep our focus on the purpose of state testing and on finding the most effective and efficient methods to fulfill the purpose of state testing and not simply on building better state tests.

Image by Gerd Altmann from Pixabay