You pick up a lot as you journey through life; and you never know what will be valuable and what will be worthless. Never in a million years would I have guessed when I signed up for an introductory philosophy course my final semester in college that I would encounter a thought experiment that would be repeated over and over again throughout my career in large-scale testing. Seriously, I enrolled in the course to spend more time with a new girlfriend. I was a music major. I didn’t need another humanities course. As it turned out, she dropped the course after the first week. I stayed. Come for the girlfriend, stay for Descartes, Pascal, and Nietzsche.
[Digression – I did eventually marry my girlfriend from the philosophy course, but that’s another story.]
The first week of the course, the professor presented us with the ancient thought experiment that centers around the question, is it still the same boat? The experiment begins with an innocuous question such as, “If the sail on a sailboat is torn and replaced is it the same boat?” Most easily answer yes, of course. The answer becomes less clear as the questioning and replacements continue until every part of the boat has been replaced one after another.
There were multiple variations on the original thought experiment. Of course, all of this led to the variation in which the focus shifted from an inanimate object to a person. Is it the same person if a leg is replaced? the heart? part of the brain is injured or removed? Today the questioning would likely include a discussion of Alzheimer’s. What makes a person who he, she, or they is?
Is It Still the Same Item?
Fast forward ten years to the beginning of my career in large-scale testing. The rule was simple: Don’t make changes to equating items. But equating items were a precious commodity.
Each summer, as test results were reviewed and new item development began, hardly a day would go by without a content specialist approaching, test booklet or item sheet in hand, to ask,
“Can I make this change to an equating item?”
(anchor item, if you’re inclined to cling to the boat metaphor)
The automatic response in this pas de deux was
“Why do you want to make the change?”
Then the thought experiment unfolded. The content specialists quickly learned that if their response implied in any way that the change was intended to affect student performance then my answer would be an easy no; that is, if you make the change to improve student performance then we cannot use that item as an equating item next spring. Sometimes, after a little thought, the answer to their request was yes.
- We want to change the name of the girl in the item from Ann to Beth because the new reading passage already has a girl named Ann.
Many of their requested changes required more thought about what makes an item an item. What makes the item difficult? What changes will affect performance?
- We want to change the keyed answer from ‘b’ to ‘c’ to balance the answer key.
- We want to swap out the graphic of the park because we lost the license to the original graphic.
Sometimes the requests were not about changes to the items themselves and brought to mind the apocryphal (we hoped) example offered regularly at TAC meetings by George Madaus, the tale of the time changing the color of the staples on a test booklet had a huge impact on performance.
- The new contractor numbers items with white numbers inscribed in black circles rather than black numbers inscribed in white circles.
- The printing vendor cannot match the shade of orange we used on cover of fifth grade test booklets last year.
- The logo of the assessment program changed.
And so it played out, item after item, day after day, year after year. Yes, we could perform analyses after testing to get some idea whether the change affected item performance “too much” relative to other items. Yes, we could rely on experience, history, and professional judgment. Like the original boat, however, there was no way to declare definitively that the item was the same item.
Is It Still the Same Test?
Of course, our little thought experiment was not limited to individual test items. As I described in one of my first blog posts six years ago, it is quite common for state assessment programs, particularly new state assessment programs, to undergo significant changes from one year to the next. On the MCAS test described in my post, the number of constructed-response items was cut in half and a short composition was eliminated between the first and second year of the program. In other programs, the changes accumulate over time as with our boat. The list of proposed or requested changes to state tests goes on and on and on.
- Can we change the order in which subject area tests are administered?
- Can we increase/reduce the amount of additional time available?
- Can we do a, b, c, d, …, x, y, z related to test accommodations?
- Can we drop the performance task from the English Language Arts/Literacy test?
- Can we switch to automated scoring of essays and constructed-response items?
- Can we allow students to use their own devices to take the test?
- Can we administer only two of the three writing tasks?
- Can we move or extend the testing window?
- Can we administer a shorter test in spring 2021?
And so, the thought experiment begins again. What makes the test the same test? What will it take to convince ourselves that the proposed change will have no effect on the construct being measured, the test scale, or the achievement standard cut points on that scale? There are questions we can ask ourselves.
- Has the balance of representation been maintained?
- Does the revised test assess the same cognitive skills?
- Are the results consistent across the original and revised form?
- Are the original claims and inferences still supported?
- Will there be longer term consequences on student performance or instruction?
And as with individual test items, we can conduct empirical analyses to help determine whether we still have the same test. Sometimes those analyses will tell us clearly that the answer is no. Analyses alone, however, can rarely deliver a clear yes. A yes answer almost always requires professional judgment to determine whether there is enough evidence to support the argument that the tests are sufficiently “the same” for their intended purposes, interpretations, and uses. The art and science of test equating.
Are We In the Same Boat?
My original plan was to end this post with the preceding section. Current discussions regarding large-scale testing, however, make it clear that we are going to have to begin a new thought experiment. It appears that ‘yes’ may no longer be the desirable answer to the question “Is it the same test?” for all students. If the goal is to produce more personalized large-scale tests (an oxymoron?), what is the new question, or questions, driving our thought experiment?
- Can we make the same inferences about student performance regardless of which personalized, alternative test form they completed?
- Do we want to? Need to?
- Is there a common, operationally defined construct?
- Can we make comparable judgments about common student skills within a personalized knowledge context? (And yes, I have an idea what that means, but no I am not really clear yet on what it looks like at the elementary level.)
- Has the pendulum swung from one end of the knowledge-skills continuum to the other or are there critical knowledge/skills combinations that are still of interest?
Like most issues that arise in educational research and educational measurement, these questions are not new. They have been asked, researched, and even answered previously.
In the past, however, we could be satisfied when the answer to “Can we?” questions was “No, not really.” We no longer have that luxury when the questions shift from “Can we?” to “How can we?”.
Are We All in The Same Boat?
Whether the task is confronting and eliminating racist practices from the design and development of tests, taking steps to ensure fairness and equity in the interpretation and use of test results or a more mundane task that was on the front burner at the beginning of 2020 like figuring out how to include curriculum-embedded performance tasks as part of a large-scale state assessment program, it would appear that all of us involved in large-scale testing are in the same boat. I know of no individuals, institutions, or organizations that have established open water between themselves and the rest of us in the race to find solutions in any of these critical areas.
I am not convinced, however, that we are all rowing in the same direction. Listening to discussions and reading posts and articles over the past few months, I see and hear common terms such as anti-racism, equity, fairness, and asset-based mindset. I fear, however, that even though we are using same terms, we may be talking past each other with very different ideas about what constitutes a desirable outcome – a problem that we know all too well.
In upcoming posts, I will attempt to untangle the rhetoric and identify the outcomes I think are being proposed with regard to the future of large-scale testing, state assessment programs, and state accountability programs.