assessment, accountability, and other important stuff

If there are times when you feel you may not be smarter than a 5th grader take heart; that may not be as bad as it sounds.    As states are beginning to release results from the spring 2015 administration of the Smarter Balanced tests we are learning that the score needed to reach Level 3 on the Grade 11 English Language Arts/Literacy test, the very score that is “meant to suggest conditional evidence of readiness for entry-level, transferable, credit-bearing college courses,” the very score that is the goal of the entire college- and career-readiness reform movement, is being reached by fifth graders across the land. From Idaho (18%) to Connecticut (25%), top fifth grade students are earning the college ready benchmark score.

How are we supposed to interpret this phenomenon?

  • Are we supposed to conclude that those 5th graders are college-ready? NO.
  • Are we supposed to conclude that those 5th graders can demonstrate the same content knowledge and skills as the 11th graders with the same score?  NO. (Although as a colleague with recent experience in Connecticut schools pointed out, “It’s reasonable, they really do teach a lot in elementary school.”)
  • Is this an indication that there is an error in the Smarter Balanced scale? NO.

To paraphrase Shakespeare, the fault, dear friends, is not in the Smarter Balanced scale, but in ourselves.

We, policymakers and educators, want more meaning from test scores than they can possibly provide.

We, assessment experts and measurement specialists, promise more meaning from test scores, even when our science cannot quite deliver on that promise.

In short, we promise and deliver a vertical scale; a single reporting scale on which we report test scores from all tests from elementary school through high school.  It is the vertical scale that informs us that virtually the same score is needed to reach Level 3, the college ready benchmark, on the high school test (2583) and Level 4 on the fifth grade test (2582). It is the vertical scale that lets us know that it is possible to earn the college-ready score of 2583 on each and every Smarter Balanced ELA/Literacy test from grade 3 through high school.

This phenomenon, however, is not unique to Smarter Balanced and its reporting scale.  Whenever we attempt to report K-12 achievement across grade levels on a single scale (i.e., a vertical scale), two characteristics are sure to emerge:

  • there will be a considerable overlap in scaled scores across grade levels, and
  • the gap in scores between grade levels will narrow considerably as we reach middle school grades.

Those two characteristics are found on virtually all vertical scales built for K-12 achievement tests. They will be found whether we examine the vertical scales of traditional standardized tests such as the ITBS Iowa Tests or custom-built state assessments such as the CMT in Connecticut  or the FCAT in Florida.  Interim assessment such as the NWEA MAP reveal the same characteristics.  So, although this discussion focuses on the Smarter Balanced scale as the most recent example, the comments apply to any and all such scales.

thresholds

 

The graph and table above display the achievement level thresholds, or cut scores, for each of the seven Smarter Balanced ELA/Literacy tests from grade 3 through grade 8.  From the figure, it is easy to see that a score of 2583 is the Level 3 threshold on the High School test and 2582 is the Level 4 threshold on the Grade 5 test.  It is also easy to see the following:

  • A score of 2583 will be classified as Level 3 performance on the Grade 6, 7, and 8 tests as well as on the High School test. Level 3 is considered the target performance on the Smarter Balanced tests. (Although not shown on this graph, it is likely that more than 40% of students completing the Smarter Balanced sixth through eighth grade tests in 2015 will perform at Level 3 or higher.)
  • The number of points needed to grow from Level 3 at one grade to Level 3 at the next grade decreases across grades: 41 points between grades 3 and 4; 29 points between grade 5 and 6; 15 points between grades 7 and 8; and only 16 points across the three years between grades 8 and the High School test at grade 11.
  • In contrast, the number of points needed to move from Level 2 to Level 3 within the same grade a) is larger than the number of points needed to move from Level 3 to Level 3 across grades, and b) increases as we move across grades; from approximately 60 points at the lower grades to 80-90 points at the upper grades.

It is outcomes such as those and the need to understand all of those numbers to use the scale appropriately that raises my doubts about the value of the vertical scale.

What does a score of 2583 mean?

One of the primary charges to those developing the next generation, beyond the bubble, college- and career-ready assessments was to provide actionable information to inform instruction.  A key component of that actionable information was supposed to be the ability to derive meaning from reported scaled scores.  That is, a goal was for educators to be able to interpret what a student with a score of 2583 knows and is able to do in relation to content standards such as the Common Core State Standards (CCSS).  How is that possible if a student at Level 3 in grades 6, 7, or 8 each can earn a score of 2583?  We know that the content of those tests is not the same; so how does an educator interpret the score?  Has the student earning a score of 2583 on three consecutive tests at grades 6, 7, and 8 acquired any new knowledge and skills?  Has she or he lost any knowledge or skills?

It appears that the answer is that any content-based interpretation must be tied to the particular grade-level test.  OK, that makes sense, but it also raises the question, what is gained from a vertical scale?

Of course, the answer must be the information that the vertical scale provides about student growth!

Vertical Scales and Growth

A key advantage of vertical scales is supposed to be that they make it easy to interpret student growth across grade levels.  As explained by Darling-Hammond, Haertel, and Pellegrino in their paper on using and interpreting Smarter Balanced Scores:

A “vertical scale” is one designed to support inferences concerning growth across years.  In other words, with a vertical scale it should be possible, for example, to subtract a student’s score on a third grade test from that same student’s score on a fourth-grade test the following year to measure growth.  (p. 11)

OK, I can subtract the two test scores, but what does that difference tell me?  As with the meaning of the individual test scores, the meaning of the difference between two scores and any inference a teacher might make from it is largely dependent on knowing on which two tests the scores were produced.  Consider a difference of 25 points on the Smarter Balanced scale:

  • A change of 25 points would mean that a student has exceeded the number of points necessary to move from Level 3 at grade 7 to Level 3 at grade 8; but is well short of the 41 points needed to move from Level 3 at grade 3 to Level 3 at grade 4.
  • Regardless of the grade level, 25 points would mean that the student has fallen well short of moving from a low Level 2 at the beginning of the year to a low Level 3 at the end of the year.
  • Similarly for higher achieving student, 25 points would be more than the 19 points needed to move from Level 4 at grade 7 to Level 4 at grade 8, but conversely, only one-fourth of the approximately 100 points needed to move from Level 3 to Level 4 within grade 7 or grade 8.

In short, with regard to making inferences about student growth, the point differences on the vertical scale are largely meaningless outside of the context of the grade levels involved and the achievement level thresholds for those grade levels.

Again, how much is gained from the reporting results on a vertical scale?

Closing Thoughts on Vertical Scales

In closing, I return to Shakespeare and Julius Caesar to state what should be obvious by now,

“Friends, clients, policy makers, lend me your ears; I come to bury vertical scales, not to praise them.”

OK, I admit that burying vertical scales is a bit extreme.

Vertical scales are a really neat concept.  The idea of giving one big test and reporting everyone’s results on the same scale is appealing to everyone from H.D. Hoover to Gordon, the man who ran the local copy shop where I had reams of grade level test forms reproduced each year.  Vertical scales appear to provide that to people, and we have the ability to construct them.  The basic psychometric theory behind vertical scales is relatively straightforward and we have been building them forever.  Vertical scales were the foundation of virtually all norm-referenced standardized tests – although we hardly ever reported or used those scaled scores directly; instead, converting them into context-based scores such as percentile ranks, grade equivalents, or NCE.  Hmmm.  (dramatic pause to let that sink in)

One of the first things they teach us as budding psychometricians is that the first requirement and primary consideration in creating a reporting scale is to make test scores more interpretable and useful for the intended audience.  If you have not accomplished that, why bother taking the time and effort to create the scale in the first place.

In the case of vertical scales, I question whether we are meeting that first requirement.

Vertical scales serve some useful purposes.  One of the primary benefits of vertical scales is that they facilitate out-of-level testing for students, whether through the use of computer-adaptive testing or by administering fixed forms based on the student’s instructional level.   Vertical scales are also very useful at describing the current state of K-12 achievement; for example, there is ample evidence that achievement gaps between low and high performing students do increase across grade levels as reflected in vertical scales.  However, my fear is that the practice of reporting individual student test scores as scaled scores on vertical scales is more likely to add confusion, not clarity, to the lives of those attempting two of today’s primary intended uses of test scores:

  • making content-based interpretations of student performance, and
  • making inferences about student growth.
%d bloggers like this: