Ten Numbers That Shook My World

50%

>50%

99.9%

60K

3.2

500

Each in their own way, the10 numbers or statistics listed above shook my world as an assessment specialist involved in state testing and the larger world of PK-12 education reform.

A few of them come from the technical side of testing, but the rest are related more broadly to the context in which we testing occurs, interpreting test scores, and generally trying to use tests to support education reform.

A few of them were numbers I never really thought about much before I heard them in the contexts described here, and then I couldn’t stop thinking about them.

A couple of the numbers represent facts or definitions that I knew and routinely would recite from memory but had never really thought about deeply.

None of the numbers or what they represent is profound, but each had a profound impact on me and needs to be considered as part of any serious effort to reform education, support instruction, and improve student learning.

50%

In my head, I knew. I had read enough textbooks and seen enough Item Characteristic Curves to know that item difficulty is defined as level of ability (θ) at which a student has a 50% chance of responding correctly to an item. (set aside the 3PL model for now)

Still, it was a shock when during a meeting Ron Hambleton replied to a question by stating that it was basically a coin flipwhether a student would respond correctly when their ability matched an item’s difficulty. Intuitively, it didn’t feel right. Kids should be able to respond correctly to items matched to their ability more often than not, right? Well, sure, but no.

What are the implications for test design and standard setting? If we load a test with items at or near the achievement standard, what will the cut score be in terms of raw score or percent correct? Do the math. Even if the IRT “science” is sound, communication with stakeholders like students, teachers, and parents is tricky, especially in an arena that expects transparency and is comfortable with mileposts on a 0-100 scale.

>50%

During a presentation at a CCSSO NCSA conference around 2010, a presenter from the DOE in a western state made an offhand comment that the majority of their English learners entering kindergarten were born in the United States. Wondering whether this was a west coast or southwest phenomenon I checked with my colleagues in New England and found similar results. In 2015-16, the percentage of K-5 English learners born in the United States was 69.9% in Massachusetts and 82.3% nationwide. The percentages remain as high or higher today.

Obviously, understanding this context is critical to designing effective programs to support English learners gaining proficiency in English in elementary school and in early childhood education programs (assuming that proficiency in English remains a goal).

As a side note, this statistic is also relevant when considering the relationship between the level of English proficiency EL students need to be reclassified and the level of English proficiency needed to meet expectations on the state assessment in Reading or English language arts. Those two concepts have been conflated/confounded since college readiness tests aligned to the Common Core State Standards were introduced and ESSA replaced NCLB. How do we interpret the readiness of these native-born EL students to participate successfully in the classroom when the top performing states in the country have only 40% of students Proficient on the NAEP Grade 4 Reading test?

≈60,000

While sitting with MCAS data in Massachusetts, I discovered that the number of unique scored response strings (e.g., 1’s and 0’s) on each test was almost equal to the 60,000 students tested. That is, a student’s scored response string almostserved as a unique identifier. (Note that the actual response strings would have introduced even more uniqueness.)

At first, the outcome seemed implausible given the expectation of patterns and predictability from student responses. On further reflection, however, as discussed in detail in Fundamentals & Flaws, it’s a bit shocking that aside from 0’s and perfect scores, there are any repeated response strings at all given the sheer number of possible scored response strings on a 50-item test.

My takeaway was that any search for interpretable patterns of responses required a higher level of aggregation than the individual student and item.

99.9%

When Massachusetts was first planning to issue student identification numbers and create a student information database, the IT manager told us that they could promise 99.9% accuracy. The figure sounded pretty darn good to me. Then the project manager on our side of the table replied that with 1,000,000 students in the state 99.9% accuracy meant that the Department would have to deal with 1,000 errors, each one likely resulting in multiple phone calls, letters, and faxes (in the pre-email days).

The Department didn’t have the staff to handle that volume. “Three nines” was not good enough.

And what about the state assessment?

As it was, we were only testing grades 4, 8, and 10 and I was already on a first name basis and swapping Christmas cards with some district assessment coordinators who called regularly to resolve test administration and reporting issues, some of them our fault, some of them theirs, some because c’est la vie.

What would happen when we began testing 7 grade levels under NCLB. We would need “four nines” or “five nines” accuracy plus an external call center trained to interact with schools and districts. (I was comfortably ensconced within the Center for Assessment before those call center days kicked in.)

2%

With NCLB holding districts and schools accountable for 100% proficiency – all means all – there was a scramble to figure out fair ways to include more students in the numerator, as we say (i.e, count them as Proficient). Growth was one solution. The adjustable 1% cap from the Alternate Assessment (AA-AAS) for those students with significant cognitive disabilities was another solution.

Then USED added an additional 2% of students and a new test to the mix, the Alternate Assessment with Modified Achievement Standards (AA-MAS). The AA-MAS was intended to for students with disabilities who were being instructed in grade-level content but would be unable to achieve a high enough score or even show enough growth to be counted as proficient on the general assessment.

Who were the 2%?

Were they the 2% of kids immediately “above” the 1% taking the AA-AAS? Were they the “bubble kids” falling just short on the general assessment? Somewhere between those two groups?

Multi-million dollar grants were awarded, analyses conducted, articles written, presentations made, hands wrung, and blood pressure raised, all in the name of answering that question? Not to mention the additional millions of dollars spent on developing, administering, and scoring the tests.

In the end, what discussions and analyses surrounding the 2% test accomplished was to make clear, in no uncertain terms, just how incredibly long the tail is for students all assigned to the same grade level.

10

Even before chronic absenteeism became the in thing that it is today, we thought it would be engaging to conduct a simple analysis showing that there was a relationship between school attendance and performance on the state test. As expected, students who were chronically absent (i.e., missed 18 or more days) performed poorly on the test. Also, students who performed poorly on the test tended to be absent quite a bit. No surprises there.

The surprise was that students who performed at the Proficient level or higher were absent 10 days per year, on average. Half as much as the chronically absent, low-performing kids, but still 2 weeks. I don’t think that I was absent two weeks during my four years of high school. (Elementary school with flus, ear infections, chicken pox, etc. was a different story.)

We have the capacity now to investigate how many kids are absent from a particular class each day. My guess is that the disruption to the flow of instruction and student learning caused by missing students is significant.

7%

Comparing numbers of graduates with fall enrollment figures across multiple years, 7% was the percentage of students enrolled in the twelfth grade each fall (i.e., assigned to a 12^th grade homeroom) who did not graduate and receive a diploma the following spring. This figure was double the number of students who dropped out in the twelfth grade. With no graduation test in play, these students were a mix of students who failed courses during the 12^th grade and those who had not accumulated nearly enough credits in grades 9 through 11 to graduate by the end of the school year.

Sure, we all knew some of the handful of kids who received an empty folder or certificate of attendance when they walked across stage at graduation. And maybe a handful in a typical graduating class of 175-200 does translate to that additional 3.5%.

But that percentage, and the status quo it reflected, dwarfed any issues related to the graduation test we were trying to implement. That 7% figure is likely lower today now that graduation rates have increased, but I’m not sure that there have been corresponding increases in twelfth graders college and career readiness.

500 or 5 (a package deal)

Both numbers refer to the equating of state test forms after a single spring test administration. The first, 500, is the number of forms one state department psychometrician proudly boasted that he equated that spring [redundancy intentional]. The second, 5, is the number of days an assessment contractor scheduled for the equating of test forms from the spring administration of a state assessment program.

What both numbers told me was that contrary to what I had been taught in graduate school and early in my career, test equating was not an art. Equating had been reduced to a routine, mechanized set of analyses conducted within a tight window between scoring and reporting. And “pre-equated” tests of various sorts threatened to make equating as we knew it obsolete.

I’ll concede that one can argue art is still present in test development and in the technical choices that make equatinghundreds of test forms in a week possible. And without a doubt, there is certainly artfulness present in the chaos that ensues when equating doesn’t work as expected.

Bottom line, I’m looking for better balance between faith in and reliance on IRT methodology and human judgment.

3.2

Somewhere along the way, the average tenure for chief state school officers in the United States shrunk to 3.2 years, with a median of 3 years. Perhaps I was lulled into a false sense of their job security by working throughout the bulk of my career with the ultimate outlier like Peter McWalters in RI (17 years) as well as Dave Driscoll and Mitchell Chester in MA (9 years each), and the commissioners in NH, VT, and ME (approx. 7 years).

You cannot design, implement, and execute effective education reform if you are changing the person at the top every 3 years. Nope. Won’t happen. The average tenure for CEOs at Fortune 500 companies is 7-8 years, although the median is a bit lower. That tenure feels more reasonable.

Add in the average tenure for district leaders (4-5 years), the high turnover in teachers in their first five years, and for the hell of it, the frequent switches in state tests and it’s not a picture of the kind of environment that will nurture and sustain improvement.

1

After 10 years, when the dust settled…

One (1) was the number of students who would have been denied a diploma in 2014 if the RI legislature had not enacted a moratorium that effectively ended a decade of work attempting to implement secondary school regulations and proficiency-based graduation requirements. Never desiring to follow Massachusetts down the graduation test route, Rhode Island attempted to craft a policy that delicately balanced course grades, scores on the grade 10 state tests, and a capstone project or portfolio demonstrating performance in a student’s self-identified area of specialized interest. (If I never see the phrase “33-33-33” again it will be too soon.)

Prior to the legislative action, the state test portion of the graduation policy had already been effectively neutered by the combination of a feisty student-level protest supported by former department of education staff and districts’ exuberant overuse of a waiver intended for the handful of students statewide in a given year who might be unable to demonstrate their true proficiency in Reading or Mathematics on the state test.

But one district held firm. Until the legislature stepped in.

It’s a long and winding road from creating a portrait or vision of a high school graduate to devising graduation requirements true to that vision and applying those requirements to all students.

Graduation policies will always be a tricky balancing act between not holding students responsible for the shortcomings of the system v. misleading them by handing them a diploma without the knowledge, skills, competencies, etc. that it represents.

And there will always be poster children for arguments on both sides of proficiency-based requirements for awarding diplomas. While always being mindful of and making allowances for the exceptions and the exceptional, I’m not sure that we can or should center policies around one student.

In the end, the linchpin that holds graduation requirements together is not the commitment to attain them (although that’s critical) but rather the belief among all key stakeholders that they can be attained. We had that in Massachusetts but didn’t in Rhode Island. And I believe that made all the difference.

Image by Gerd Altmann from Pixabay

50%

>50%

≈60,000

99.9%

2%

10

7%

500 or 5 (a package deal)

3.2

1

Share this:

Published by Charlie DePascale