Nagging Issues That Can Affect the Utility of Assessments
Starting with the direct assessment of writing, the inclusion of items requiring students to produce written responses may be the most significant development in large-scale assessment in the past three decades. We now stand on the cusp of a new wave of advances with automated scoring supporting locally administered curriculum-embedded performance tasks that measure 21stCentury skills. As we move forward, however, we need to acknowledge and reflect on key aspects of assessing student writing that we have not quite figured out.
In this post, I offer examples of unresolved issues at four critical points in the assessment process. Most importantly, each of these issues affects the ability of teachers and students to interpret and use the results of assessments to support instruction and improve student learning.
- Designing assessments that mirror authentic writing
- Understanding the accuracy and precision of scoring
- Assumptions in the scaling and calibration of writing items
- Equating and linking assessments containing writing tasks
Designing Assessments that Mirror Authentic Writing
One of the strongest arguments for moving from traditional selected-response writing tests to direct writing assessment was the perceived importance of authentic assessment – of measuring students’ ability to actually produce effective writing rather than to simply recognize the elements of good writing. A similar argument about authentic assessment was made regarding the need to shift as quickly from paper-based to computer-based assessment of writing – students (and everyone else) write on computers, not paper.
Despite our best efforts, however, we must acknowledge that the writing students produce for on-demand large-scale assessments in response to prompts or stimuli that they have never seen is far from authentic writing. It is simply a good sample of the writing that students are able to produce under a particular set of conditions and it has been found to be a fairly solid indicator of students’ general writing achievement.
At the very least, we must clearly communicate that there should be a difference between the quality of writing considered “proficient” on an on-demand large-scale assessment and the product that would be considered proficient in the classroom after the full application of the writing process, including the use of resources not available in an on-demand assessment setting.
Although it may be beneficial to use a single criterion-based rubric to classify student writing on the assessment and in the classroom, there must be consideration of the context in which the writing is produced. For example, if a scoring rubric calls for “effective selection and explanation of evidence and/or details,” the presentation of evidence and details considered effective for a student response produced on demand in a setting where the student may not have access to primary sources should be different than the presentation of evidence and details expected in response to the same task administered in a classroom setting where the student has access to other materials as well as the opportunity to review and revise their work. Although the criterion-based rubric may remain constant, the quality of work required to meet certain criteria will vary based on context.
It is a disservice to teachers and students to perpetuate the notion that the exemplars reflecting writing at various points along a scoring rubric are context free. The same considerations of context apply within the classroom setting as teachers evaluate student responses produced on a classroom assessment, for a two-day homework assignment, or for an extended research project.
Understanding the Accuracy and Precision of Scoring
There is little doubt that we are now moving rapidly from human to automated scoring of students’ written responses. When moving to a new and improved system, we must consider that there may be reasons why things are done the way they are, and those reasons may not be obvious to us 10, 15, or 20 years down the line. For example, I am concerned that in building automated scoring models we might forget the thinking that shaped current scoring processes. Although one might assume that the desired outcome when scoring essays is exact agreement among two raters of the same piece of writing that might not always be true.
When holistic scoring of student essays gained popularity in large-scale assessment in the late 1980s and early 1990s it was common practice for each student essay to be scored by two raters. If the rubric contained six score categories, each rater assigned the essay a score of 1-6. The student score was the sum of the two raters’ scores (i.e., 2-12 for a 6-point rubric). Using an adjacent agreement scoring model, students received an even numbered score if the two raters assigned the essay the same score (1-1, 2-2, 3-3, 4-4, 5-5, 6-6) and an odd-number score if the raters assigned adjacent scores (i.e., 1-2, 2-3, 4-3, 4-5,6-5).
A major focus of blind double-scoring has always been reliability and inter-rater agreement. Therefore, when other approaches to increasing reliability became more practical, such as administering multiple writing tasks per student and having different raters score each item on a student’s test, the practice of double scoring all student responses became less common. As the field moves to automated scoring, all student scores will be based on single scoring.
Monitoring inter-rater agreement, however, was not the only reason for double-scoring and expedience was not the only reason for accepting adjacent scores. Analyses of scored student essays conducted over several years and assessment programs have shown a discernible difference between the quality of writing in student responses assigned an odd numbered score (e.g., 4-3,3-4) and responses with the next lower and higher even scores (3-3 and 4-4). Responses at the borderline between two broad score categories were different than responses in the middle of either category. Like achievement levels, each score category contains a continuum of student performance. At least at the aggregate level, allowing adjacent scores from two raters added to the accuracy and precision of scoring.
This approach to human scoring could be quite useful in building automated models. It can be applied directly by training the model on student responses scored by two raters or indirectly by training the model on single-scored response but assigning points to students using a decision rule that accounts for responses that the model identifies as borderline. For example, clear 1’s receive 2 points, clear 2’s receive 4 points, but borderline 1-2 papers receive 3 points. Such an approach might not only maintain the advantages of the human scoring model but extend them to the individual level as well.
Assumptions in the scaling and calibration of writing items
Whether within a standalone writing assessment or an English language arts assessment that combines reading and writing, the standard operating practice has been to apply a unidimensional IRT model that treats each writing task as an individual item. The problem with this approach is that any differences in performance across tasks are attributed wholly to differences in item difficulty. There is no room within a unidimensional model for the concept that a student of a particular ability level may be a better writer in one genre than another.
For example, if students at a particular grade level are less effective at responding to prompts that require them to generate persuasive essays than they are at responding to prompts that require them to produce narratives or informative/explanatory essays then the persuasive prompts are classified as more difficult. In practice, this plays out as students and teachers seeing differences in ratings of student writing against the criterion-based writing rubric and no differences in the scaled scores or achievement level results.
In reality, there undoubtedly are differences in difficulty in prompts within genre and also differences in students’ effectiveness in writing across genres. However, by applying scoring models that conflate the two and mask cross-genres differences in students’ writing ability we are not providing useful information to support instruction in writing. Such an approach might be “fair” at the aggregate level from an accountability or measurement perspective but does not account for individual differences within genre and lacks utility in improving instruction and student learning.
Equating and linking assessments containing writing tasks
Nearly all of the unexpected and inexplicable issues I have encountered in equating state assessments across 30 years have been related to equating English language arts tests that include reading items and a writing task. It’s not that the process never works, but when it appears that it hasn’t worked there is no good way figure out why it is not working or to fix it.
I have spent far too many summer Sundays (yes, these problems always come to a head on a Sunday) in a testing company office or on a conference call with state assessment staff and psychometricians trying to figure out what to do so that the state assessment results make sense: do we fix the writing item parameters to a particular value, drop writing from the equating, “freeze” the writing results with some version of equipercentile linking. And then how do we document the decision?
And even when the equating process appears to work, you cannot really be certain that it has actually done what you designed the assessment to do. I cannot count the number of times that equating an English language arts test went smoothly because the writing prompt was dropped from the anchor set based on pre-determined equating decision rules.
There are several good, technically sound reasons for dropping the writing task from the anchor set or even not including it in the equating process by design. When we drop the writing task from the equating, however, we have to ask whether the results are an accurate reflection of student performance. If student performance from one year to the next improves more in writing than reading due to a new focus on writing and improved instruction will that improvement be captured if the writing task is not included in the equating process. By not including writing in equating, might we mask real differences between reading and writing performance or perhaps real differences in writing performance across genres in the same way discussed in the previous section on item calibration?
We have tried to force writing to fit nicely into a unidimensional box but every so often it likes to stick its head out and tell us that it really doesn’t fit. Reading and writing are related, but separate constructs. There are other non-IRT based approaches to combining reading and writing performance into a composite English language arts score, but those pose their own challenges such as the need for adequate field testing to understand differences in difficulty among writing tasks. As stated above, there are reasons why things are done the way they are now, and those reasons must be understood before we try to make “improvements” to the current process.
Moving assessment forward
The inclusion of items requiring students to produce written responses had an immediate and profound impact on instruction (although there is still room for the use of well-designed selected-response and technology-enhanced items in the assessment of writing). Demonstrating that the large-scale assessment of writing was practical as well as possible from the perspective of technical quality opened the door for the use of a variety of constructed-response items across content areas.
Advances in administration and scoring technology are now making it feasible to expand the use of tasks in large-scale, standardized assessments which require written responses in ways that were unthinkable ten or even five years ago. These advances will also make it possible expand our concept of large-scale standardized assessment to include curriculum-embedded performance assessment tasks that are administered locally at different times of the year based on the curriculum and individual student achievement. It is critical, however, that we devote at least as much time considering how the results of those new assessment tasks will be interpreted and used to support instruction as we devote to the technical challenges of administering and scoring them.