assessment, accountability, and other important stuff

Archive for August, 2015

Are You Smarter Than a 5th Grader?

If there are times when you feel you may not be smarter than a 5th grader take heart; that may not be as bad as it sounds.    As states are beginning to release results from the spring 2015 administration of the Smarter Balanced tests we are learning that the score needed to reach Level 3 on the Grade 11 English Language Arts/Literacy test, the very score that is “meant to suggest conditional evidence of readiness for entry-level, transferable, credit-bearing college courses,” the very score that is the goal of the entire college- and career-readiness reform movement, is being reached by fifth graders across the land. From Idaho (18%) to Connecticut (25%), top fifth grade students are earning the college ready benchmark score.

How are we supposed to interpret this phenomenon?

  • Are we supposed to conclude that those 5th graders are college-ready? NO.
  • Are we supposed to conclude that those 5th graders can demonstrate the same content knowledge and skills as the 11th graders with the same score?  NO. (Although as a colleague with recent experience in Connecticut schools pointed out, “It’s reasonable, they really do teach a lot in elementary school.”)
  • Is this an indication that there is an error in the Smarter Balanced scale? NO.

To paraphrase Shakespeare, the fault, dear friends, is not in the Smarter Balanced scale, but in ourselves.

We, policymakers and educators, want more meaning from test scores than they can possibly provide.

We, assessment experts and measurement specialists, promise more meaning from test scores, even when our science cannot quite deliver on that promise.

In short, we promise and deliver a vertical scale; a single reporting scale on which we report test scores from all tests from elementary school through high school.  It is the vertical scale that informs us that virtually the same score is needed to reach Level 3, the college ready benchmark, on the high school test (2583) and Level 4 on the fifth grade test (2582). It is the vertical scale that lets us know that it is possible to earn the college-ready score of 2583 on each and every Smarter Balanced ELA/Literacy test from grade 3 through high school.

This phenomenon, however, is not unique to Smarter Balanced and its reporting scale.  Whenever we attempt to report K-12 achievement across grade levels on a single scale (i.e., a vertical scale), two characteristics are sure to emerge:

  • there will be a considerable overlap in scaled scores across grade levels, and
  • the gap in scores between grade levels will narrow considerably as we reach middle school grades.

Those two characteristics are found on virtually all vertical scales built for K-12 achievement tests. They will be found whether we examine the vertical scales of traditional standardized tests such as the ITBS Iowa Tests or custom-built state assessments such as the CMT in Connecticut  or the FCAT in Florida.  Interim assessment such as the NWEA MAP reveal the same characteristics.  So, although this discussion focuses on the Smarter Balanced scale as the most recent example, the comments apply to any and all such scales.



The graph and table above display the achievement level thresholds, or cut scores, for each of the seven Smarter Balanced ELA/Literacy tests from grade 3 through grade 8.  From the figure, it is easy to see that a score of 2583 is the Level 3 threshold on the High School test and 2582 is the Level 4 threshold on the Grade 5 test.  It is also easy to see the following:

  • A score of 2583 will be classified as Level 3 performance on the Grade 6, 7, and 8 tests as well as on the High School test. Level 3 is considered the target performance on the Smarter Balanced tests. (Although not shown on this graph, it is likely that more than 40% of students completing the Smarter Balanced sixth through eighth grade tests in 2015 will perform at Level 3 or higher.)
  • The number of points needed to grow from Level 3 at one grade to Level 3 at the next grade decreases across grades: 41 points between grades 3 and 4; 29 points between grade 5 and 6; 15 points between grades 7 and 8; and only 16 points across the three years between grades 8 and the High School test at grade 11.
  • In contrast, the number of points needed to move from Level 2 to Level 3 within the same grade a) is larger than the number of points needed to move from Level 3 to Level 3 across grades, and b) increases as we move across grades; from approximately 60 points at the lower grades to 80-90 points at the upper grades.

It is outcomes such as those and the need to understand all of those numbers to use the scale appropriately that raises my doubts about the value of the vertical scale.

What does a score of 2583 mean?

One of the primary charges to those developing the next generation, beyond the bubble, college- and career-ready assessments was to provide actionable information to inform instruction.  A key component of that actionable information was supposed to be the ability to derive meaning from reported scaled scores.  That is, a goal was for educators to be able to interpret what a student with a score of 2583 knows and is able to do in relation to content standards such as the Common Core State Standards (CCSS).  How is that possible if a student at Level 3 in grades 6, 7, or 8 each can earn a score of 2583?  We know that the content of those tests is not the same; so how does an educator interpret the score?  Has the student earning a score of 2583 on three consecutive tests at grades 6, 7, and 8 acquired any new knowledge and skills?  Has she or he lost any knowledge or skills?

It appears that the answer is that any content-based interpretation must be tied to the particular grade-level test.  OK, that makes sense, but it also raises the question, what is gained from a vertical scale?

Of course, the answer must be the information that the vertical scale provides about student growth!

Vertical Scales and Growth

A key advantage of vertical scales is supposed to be that they make it easy to interpret student growth across grade levels.  As explained by Darling-Hammond, Haertel, and Pellegrino in their paper on using and interpreting Smarter Balanced Scores:

A “vertical scale” is one designed to support inferences concerning growth across years.  In other words, with a vertical scale it should be possible, for example, to subtract a student’s score on a third grade test from that same student’s score on a fourth-grade test the following year to measure growth.  (p. 11)

OK, I can subtract the two test scores, but what does that difference tell me?  As with the meaning of the individual test scores, the meaning of the difference between two scores and any inference a teacher might make from it is largely dependent on knowing on which two tests the scores were produced.  Consider a difference of 25 points on the Smarter Balanced scale:

  • A change of 25 points would mean that a student has exceeded the number of points necessary to move from Level 3 at grade 7 to Level 3 at grade 8; but is well short of the 41 points needed to move from Level 3 at grade 3 to Level 3 at grade 4.
  • Regardless of the grade level, 25 points would mean that the student has fallen well short of moving from a low Level 2 at the beginning of the year to a low Level 3 at the end of the year.
  • Similarly for higher achieving student, 25 points would be more than the 19 points needed to move from Level 4 at grade 7 to Level 4 at grade 8, but conversely, only one-fourth of the approximately 100 points needed to move from Level 3 to Level 4 within grade 7 or grade 8.

In short, with regard to making inferences about student growth, the point differences on the vertical scale are largely meaningless outside of the context of the grade levels involved and the achievement level thresholds for those grade levels.

Again, how much is gained from the reporting results on a vertical scale?

Closing Thoughts on Vertical Scales

In closing, I return to Shakespeare and Julius Caesar to state what should be obvious by now,

“Friends, clients, policy makers, lend me your ears; I come to bury vertical scales, not to praise them.”

OK, I admit that burying vertical scales is a bit extreme.

Vertical scales are a really neat concept.  The idea of giving one big test and reporting everyone’s results on the same scale is appealing to everyone from H.D. Hoover to Gordon, the man who ran the local copy shop where I had reams of grade level test forms reproduced each year.  Vertical scales appear to provide that to people, and we have the ability to construct them.  The basic psychometric theory behind vertical scales is relatively straightforward and we have been building them forever.  Vertical scales were the foundation of virtually all norm-referenced standardized tests – although we hardly ever reported or used those scaled scores directly; instead, converting them into context-based scores such as percentile ranks, grade equivalents, or NCE.  Hmmm.  (dramatic pause to let that sink in)

One of the first things they teach us as budding psychometricians is that the first requirement and primary consideration in creating a reporting scale is to make test scores more interpretable and useful for the intended audience.  If you have not accomplished that, why bother taking the time and effort to create the scale in the first place.

In the case of vertical scales, I question whether we are meeting that first requirement.

Vertical scales serve some useful purposes.  One of the primary benefits of vertical scales is that they facilitate out-of-level testing for students, whether through the use of computer-adaptive testing or by administering fixed forms based on the student’s instructional level.   Vertical scales are also very useful at describing the current state of K-12 achievement; for example, there is ample evidence that achievement gaps between low and high performing students do increase across grade levels as reflected in vertical scales.  However, my fear is that the practice of reporting individual student test scores as scaled scores on vertical scales is more likely to add confusion, not clarity, to the lives of those attempting two of today’s primary intended uses of test scores:

  • making content-based interpretations of student performance, and
  • making inferences about student growth.

One in a Million, A Million to One

Interpreting Individual Student Performance on a Large-scale Assessment

This is the third, and final, installment in a series of three posts based on a workshop presented in April 2015 at the annual conference of the New England Educational Research Organization.

Across the land, there is a call for state assessments to provide more, better, and actionable information to inform instruction.  The Criteria for Procuring and Evaluating High-Quality Assessments published by the Council of Chief State School Officers (CCSSO) in March, 2014 includes the criterion that assessments should

Yield Valuable Reports on Student Progress and Performance – Providing timely data that inform instruction (D.2).

Reports are instructionally valuable, easy to understand by all audiences, and delivered in time to provide useful, actionable data to students, parents, and teachers.

In April 2015, Achieve, Inc. sponsored a webinar and released a set of materials focused on Communicating Assessment Results to Families and Educators.  Among the materials disseminated, Achieve produced sample assessment reports, including a model Mathematics Family Report. The two-page report provides the family with information about an individual student’s performance on the fifth grade state assessment in mathematics.  The first page of the report provides basic information about the assessment, explains what the results mean, and provides the student’s Overall Score (i.e., scaled score) and Performance Level classification.  So far, so good.  At the bottom of the first page, however, the family is instructed to “[t]urn to the second page to learn more about David’s knowledge and skills in mathematics.”  That is where the trouble starts.

At the top of the second page there is a display providing information on the student’s level of mastery in each of five Mathematics Scoring Categories.  As discussed in my previous post, What’s In A Name?, the capacity for a state summative assessment to support such claims of mastery is shaky, at best.


Achieve, however, does not stop there.  The model report also includes a section which contains a detailed description of the student’s individual strengths and weaknesses.



There is no doubt that information like this would be extremely valuable for David, his parents, and David’s former or new teacher (assuming the report arrives during the summer).  Unfortunately, we cannot expect to receive this level of information from a state summative assessment and should not raise expectations that this level of detailed information about individual student performance is an appropriate target for states and assessment contractors.  A state assessment report cannot provide the same type or level of information as a standards-based report card because a state assessment cannot produce that level of detail about an individual student’s performance.

The implication in the table of David’s Strengths & Areas For Improvement is that the information provided has been compiled specifically for David.  In reality, the list of strengths and weaknesses is likely to be similar to the lists of knowledge and skills provided in performance level descriptions.  That is, if that type of information is reported at all it will reflect probable strengths and weaknesses for students with an overall score and/or pattern of performance similar to David’s. However, if David’s performance is anywhere near the middle of the score range there are literally millions, if not trillions, of ways that students could arrive at the same score as David; and repeated analyses over the last fifteen years have shown us that on any given test there will be nearly as many unique response patterns as there are students taking the test.  The likelihood that we can identify and describe David’s specific strengths and weakness to the level of detail shown in the model report is a million to one long shot.

It would be even worse, however, if that list of strengths and areas for improvements were tied directly to David and based only on his performance on the one or two items on the test that measure multiplication and division of fractions, understanding of the concept of volume, or the use of line plot displays.  A student’s performance on a small number of items on a single test may suggest a particular strength or weakness, but by itself is certainly not sufficient to support the claims implied by the statements on the Family Report.

The inability to describe an individual student’s strengths and areas for improvement on the basis of performance on a single test is a limitation of state assessment, but it is not a flaw that we can correct. Yes, we can do a better job of communicating state assessment results to students, families, and teachers; but no, we cannot obtain accurate descriptions of an individual student’s strengths and areas for improvement from a single comprehensive assessment.  No, computer-adaptive tests are not the solution.  Computer-adaptive tests are designed to  provide more accurate estimates of a student’s location on the overall proficiency continuum, but in general, are not designed to provide more detailed information about an individual student’s strengths.

The call to provide families with the type of information included in Achieve’s model Family Report is admirable and appropriate.  The problem, however, was in positioning the Family Report as the report of results from a single assessment.  The Family Report, like a report card, can be based on a synthesis of information from multiple sources.  State assessment programs like PARCC and Smarter Balanced are offering tools that can be used throughout the year to collect more detailed information about student performance.  We should also be able to expect local districts, schools, and teachers to be able to provide families with accurate information on a student’s strengths and areas for improvement.  If local educators cannot provide that information, that is the reporting problem that we need to solve.  After all, even the model report directs families to contact David’s teachers for information on how best to support David’s learning.

What’s In A Name?

Aligning achievement levels and assessments

This is the second of three posts based on a workshop presented in April 2015 at the annual conference of the New England Educational Research Organization.

Proficient.  The passing of NCLB made it a national goal that 100% of students would be Proficient by 2014; and the law made it clear that proficiency would be determined by performance on a state assessment.  What was not clear from the law was exactly what was meant by the term Proficient.  We know that NCLB left it up to individual states to adopt their own content standards and to set their own criteria for Proficient performance on those standards.  We also know that there was wide variation among the states in the level of student performance classified as Proficient; but that variation and the claims of honesty gaps and truth telling that flow from it are not the focus of this discussion.

The focus of this piece is on the more basic question of  what Proficient means within an individual state or consortium; specifically, the extent to which there is consistency among the following:

  1. the meaning and interpretation of proficient performance among educators, parents, and the general public,
  2. the state’s definition of proficient performance on its state assessment, and
  3. the design of the state assessment and the methods used to classify student performance as proficient.

When asked to describe the performance of a Proficient student, responses from those not directly involved in the development of state assessments often describe performance that is consistent across standards (and over time) as well as performance that reflects a certain degree of mastery of those standards.  Consistency and mastery appear to be two critical elements to common interpretations of Proficient performance.

As shown in the examples below, consistency and mastery are often found as key components of state policy-level descriptions of student performance at the achievement level corresponding to Proficient.


  • Students at this level demonstrate a solid understanding of challenging subject matter and solve a wide variety of problems. (Massachusetts)
  • This category represents a solid performance. Students demonstrate a competent and adequate understanding of the knowledge and skills measured by this assessment, at this grade, in this content area. (California)
  • Students at the proficient level demonstrate solid academic performance and mastery of the content area knowledge and skills required for success at the next grade. Students who perform at this level are well prepared to begin work on even more challenging material that is required at the next grade. (Mississippi)
  • Students performing at this level consistently demonstrate mastery of grade level subject matter and course subject matter and skills and are well prepared for the next grade or course level work. (North Carolina)
  • Student demonstrates thorough knowledge and mastery of skills that allows him or her to function independently on all major concepts related to his or her current educational level. (Idaho)


When we dig a little deeper, state content area descriptions often go into great deal about the knowledge and skills of students performing at the Proficient level.  The following example describes the performance of students proficient in grade 6 mathematics.

Students in grade six at the proficient level have a good understanding of the concepts that underlie grade six mathematics, including integers, percentages, and proportions. They solve problems involving the addition of negative and positive integers, compare and order integers using visual representation, calculate percentages, and set up proportions from concrete situations. Their skills in algebra and geometry include solving one-step equations, writing expressions from word problems, solving problems involving rate, solving for the missing angle in a triangle or a supplementary angle, and identifying types of triangles. Proficient students also understand the basic concepts of probability and measures of central tendency. (California Department of Education)

The problem is that in most cases there is a disconnect between the specific claims made in those descriptions of Proficient performance and the design of the state assessment.  In general, state assessments are comprehensive surveys of the ranges of content standards within a grade level.  They usually contain no more than a few questions measuring a particular skill such as solving for the missing angle in a triangle or setting up proportions from concrete situations; certainly not enough items to claim consistency or mastery of the skill.  The table below shows the percentage and number of points assigned to the four broad clusters of mathematics standard on the 2011-2012 NECAP sixth grade mathematics test.


Number of Points Percentage of Points
Numbers and Operations 27 41%
Geometry and Measurement 16 24%
Functions and Algebra 13 20%
Data, Statistics, and Probability 10 15%
Total 66 100%

The Numbers and Operations category contains a sufficient number of points to be support a claim of mastery or consistent performance within the cluster, but it is still not possible to determine student mastery of a particular skill. The 27 points in the cluster are distributed across multiple standards, each of which may include a variety of skills as shown in the standard 5-2:

M(N&O)–5–2 Demonstrates understanding of the relative magnitude of numbers by ordering, comparing, or identifying equivalent positive fractional numbers, decimals, or benchmark percents within number formats (fractions to fractions, decimals to decimals, or percents to percents); or integers in context using models or number lines.

How many different skills are contained in that single standard, and how many items would be needed to determine that a student had mastered each of them?

Most important, to be classified as Proficient on a state assessment, a student may only have to earn 65% – 70% of the points on the test.  It is difficult to make claims of mastery and consistency when a student has earned 70% of the points on a test. Let’s consider two students, Sue and Kevon, who each earned 70% of the points on the test.  We do not know whether Sue has mastery of calculating percentages, adding negative integers, and understanding measures of central tendency.  We do not know Kevon has performed consistently across all of the standards or clusters of standards.  We do not know which items Sue and Kevon answered correctly, or perhaps more important, which items each answered incorrectly to earn their score of 70%.  We do know that there are more ways to earn a score of 70% on the six grade mathematics test than there are students in the state; but this topic will be considered in more detail in the final installment of this 3-part series of posts: One in a Million, A Million to One.

So what do we know when a student has earned 70% of the points on the sixth grade mathematics test?  We  know that the student earned 70% of the points on a test that contains items assessing a representative sample of grade level standards, and that the state has determined that level of performance to be Proficient.  We expect Proficient students to be prepared to succeed at the next grade level, but based solely on the state assessment we cannot know what specific skills an individual student has mastered.

CCR is the new Proficient

Now we have moved on from Proficient to College-and Career-Ready.  That sounds like progress.  College- and Career-Ready feels much more grounded in reality and tangible than Proficient.  There can be no race to the bottom with college- and career-readiness. We should be able to easily verify that students classified as ready for college or a career were, in fact, ready. If the majority of students labeled college-ready on the state assessment must enroll in non credit-bearing courses in college then it is safe to conclude that something is not right with the CCR score on the state assessment.  Perhaps the cut score was set too low.  Perhaps the test measured the wrong content knowledge and skills.  Or maybe it is something we hadn’t thought of before!  Maybe college readiness doesn’t come from a score.  Maybe college readiness… perhaps… means a little bit more.  (Acknowledgements and apologies to David Conley and Dr. Seuss)

Whatever the reason, we can determine how well performance on a mathematics and English language predicts whether students are ready for college – even if they never plan to take a formal mathematics or English course in college.  There is much less room for misinterpretation with college ready than there was with Proficient.  A college ready student is a student who is ready for college.  Sure, we can worry about things like what kind of college and how ready, but why quibble over details.

By focusing on the predictive power of the test rather than its content, what we may have to give up, however, are claims of a tight link between a student’s test score and the specific knowledge and skills that the she has mastered.

As they say, however, can you really miss something you never had to begin with?