We are all familiar with the Three Rs, reading, writing, and ‘rithmetic, which have been the focus of federally-mandated assessment and accountability systems for the this first quarter of the 21st century. Those actually involved in education in the 21st century undoubtedly also have intimate knowledge of the Four Cs (critical thinking, communication, collaboration, creativity), recently expanded to the Six Cs with the addition of content and confidence. Measuring the Three Rs while accounting for the Four Cs has been the challenge for state testing over the past twenty years – a challenge that has been accepted on occasion, but one not yet met with a high degree of satisfaction.
At the 2019 MARC conference at the University of Maryland, I introduced the Three Ps of state testing in an effort to reframe the challenge to help us consider state testing in a way that might lead us closer to the desired solution. Without further introduction or explanation, I give you the Three Ps.
Now that I have your attention.
The connection I am making between state testing and pornography stems from the 1960s and attempts by the United States Supreme Court to define obscenity. No, I am not declaring that state testing is obscene. I am sure that you can easily find other blogs working that side of the street. Rather, I am referring to the Court’s attempts to determine whether particular examples of pornography were obscene. Justice Potter Stewart stated the court was “faced with the task of trying to define what may be indefinable,” but declaring confidently, “I know it when I see it.”
I assert that we are in the same place with state assessment. To this point, we have been unsuccessful in defining the construct we want to measure on state tests and only a little more successful in communicating their purpose.
However, there has been progress. As state tests shifted from traditional norm-referenced tests to standards-based instruments, Norman Webb helped us to consider depth of knowledge as we gazed at pools of items that were miles wide and an inch deep. The Common Core consortia advanced the field through their application of the principles of evidence-centered design to the design of their items and tasks. Recent work by Ellen Forte and her colleagues on the Alignment Evaluation Framework moves us closer through consideration of the whole test and the meaning of scores. We are getting there, step by step.
Unfortunately, we are not quite there yet. We cannot quite put into words what it is we want a state test to do or point to a good example of a state test that checks all of the boxes. Like Justice Potter, however, I am confident that we will know it when we see it.
And when we do see it, what will that elusive state test be measuring?
In 2021, we routinely use achievement level labels like Proficient to describe student performance on state tests. In addition, we develop sets of achievement level descriptors (ALD) that attempt to describe in terms of the state standards the knowledge, skills, and abilities of a student performing at the Proficient level, a student whose performance is proficient, or a proficient student, if you will. Our laser-like focus on state standards, along with the use of ALD and reporting of subscores, drive the interpretation of overall test scores inward, in the direction of a student’s mastery of individual standards, although we know full well that state tests are not designed to measure mastery of individual state standards.
We can design tests to measure student mastery of individual state standards and we can design tests to measure student proficiency on a set of state standards, but we cannot design a single test to do both.
Having looked at so many state tests for so many years, I am now convinced that rather than looking within the state standards for the meaning of state test scores, we should be looking “beyond the standards” – in the same sense as think about reading beyond the lines as moving beyond the literal text to construct meaning.
The overall score on a state test has a meaning that is greater than the meaning of student performance on the individual items or tasks that the test comprises. The whole is greater than the sum of its parts.
Rather than thinking of the state test as a direct measure of student performance on individual state standards, we need to think of the state test as an indirect measure of student proficiency on the state standards.
- Proficient is a label. Proficiency is a construct.
- Proficient is focused on a description of student performance on individual items. Proficiency is focused on the combination of student performances across the set of items.
- Proficiency is time dependent; that is, answering the same set of items correctly will not yield the same level of proficiency at the end of the third, fifth, and eighth grade. We know this, but we try to frame it within the interpretation of test scores on a grade-level scale, or worse, on a vertical scale.
We will have a better chance of getting the state test that we want to see, if we think of reporting and the interpretation and meaning of test scores in terms of inferences about student proficiency with the state standards.
And when we get there, we have to be a lot smarter about reporting test results.
The third of our three Ps is perhaps the most important. We must start thinking of, and conveying information about, student performance on state tests in terms of probability rather than absolutes. Again, we know this; our modeling is based on probabilities.
Performance on a state test supports a particular inference with a certain degree of certainty; this is particularly true with regard to the performance of an individual student on a single test.
This is not about measurement error. Even if the state test were perfectly reliable, the inferences that we want to make about student performance based on state test scores exist outside the test and come with some uncertainty. Instead of running from that uncertainty in our reporting of test scores, we must own it and embrace it.
And no, that doesn’t mean that we begin reporting that a score of ‘x’ means that there is a .63 probability of ‘y’. We are better than that. Our tests can be better than that. Our reporting must be better than that.