assessment, accountability, and other important stuff

Archive for April, 2018

If I Did It

Confessions of a Psychometrician

By OJ Simpsons Paradox with Charlie DePascale

Charlie – As we waited six long months for the release of the 2017 NAEP results, some wondered whether we would ever know the whole story; what really happened that February when NAEP reading and math went digital. Now that those results have been released and the NAEP trend line preserved, what do we really know?

This week, we are pleased to welcome, OJ Simpsons Paradox, a statistician and part-time psychometrician, usually locked deep within the bowels of the government where he has the ear of top education policy makers.  Today, he is here to offer his hypothetical account of how a broken trend line could be and should be “fixed” without anyone suspecting a thing.

OJ:  It all starts with NAEP.  The one constant through all the years, Ray, has been NAEP. America has rolled by like an army of steamrollers. It’s been erased like a blackboard, rebuilt, and erased again. But NAEP has marked the time.  This assessment, this trend line, is a part of our past, Ray.  It reminds us of all that once was good, and that could be great again. People want the trend line, Ray.  People definitely want the trend line.

Charlie: OK. You can call me Ray.  But aren’t people skeptical?

OJ: Ray Ray, you just tell them what they want to hear hear hear hear hear.  You need to tell em tell em tell em What they wanna hear.

Charlie: Sure, people will hear what they want to hear, believe what they want to believe; but this is psychometrics, measurement, facts…

OJ: It’s statistics, son.  Facts are stubborn things, but statistics are pliable.

Charlie: Pliable, yes. But, if the trend line were broken, how could you fix it?  You tell us that in the national sample students taking the test on paper performed 4 percentage points better on each item than those taking the test on computer.  That sounds like a big difference.

How does that compare to the p-value difference normally found between a top-performing state like Massachusetts and the national average or with states near the bottom of the list?

OJ: Right, in State A there is a 5-point scale score difference …

Charlie: Wait.  Sorry to interrupt.  No, I am asking about the national p-value difference.

OJ: Mindset.  You start with the mindset that the trend has been preserved and that you need incontrovertible evidence to prove that it has been broken.  The rest is just statistics.

You tell me that there is a 5 point difference between a state’s performance on paper and computer.  You think, “Damn, five points on NAEP is huge!”  NAEP can go 30 years without changing by five points.

But, could a difference that large happen by chance?  Maybe not too often, but 5/100 times, 1/100 times, 1/1000 times – you see where I am going with this?

Charlie: But what about Power?  With a paper sample of only 500 students…

OJ: Power!  We can take a year to report results and nobody bats an eye.  We can post cute little Twitter surveys while people are waiting and people ‘like’ them.

We can take the time we need to prepare the message. When I worked for a state we were taken to court and lost when we wanted to take two days to prepare a memo before releasing results.

We can bury you with videos, charts, graphs, data tools when we release the results.

That’s the only power I need.

Charlie:  People will want to know what happened to the trend line.

OJ:  We are reporting that nothing happened to the trend line, Don.  Reports that something hasn’t happened are always interesting to me, because as we know there are known knowns; there are things we know we know.  We also know there are known unknowns; that is to say we know there are some things we do not know.  But there are also unknown unknowns – the ones we don’t know we don’t know.

Charlie:  What does that mean?

OJ: Exactly!

Charlie: The trend line.  Was it broken?

OJ: Son, we live in a world that has trend lines; and those trend lines need to be maintained.  Who’s gonna do it?  You?  You have the luxury of not knowing what I know – that misrepresenting performance of an individual state, while tragic, probably saved lives; and my existence, while grotesque and incomprehensible to you, saves lives.

You want me maintaining that trend line!


Charlie: Well, thank you OJ.  That’s all the time we have today.  We are all looking forward to the release of the 2018 NAEP results later this fall.

OJ: We’ll see.

It’s about time

Charlie DePascale

We have all asked the question, “Where did the time go?”

As troubling as that question can be, more recently, I find myself pondering an even more vexing question, Where did time go?

Every day, it seems as though time has been removed as a dimension or component of some part of our lives in which it was always really important.

Television, of course, is a prime example.  I grew up with “same bat time, same bat channel,” and Sunday nights at 8 with the family in front of the television (could I stay awake long enough to see Topo Gigio).  Later there was 11:30 on Saturday nights and “must see TV” on Thursday.  Appointment television!

Now, I can watch a show whenever, wherever, and however I want – on demand.  I can still watch any of those shows referenced above as easily as a show that aired last night.  And not just whole shows.  I can pull up a clip of my favorite moments; like Sheldon erasing time as he makes a basic mistake while explaining the time parameter to Penny on The Big Bang Theory.

Not only can we pull up television or movie clips, clips of our own lives are now also neatly stored and readily available on demand.  We are supposed for forget certain things over time and to be able to process, shape, and reshape our memories. However, as Taylor Swift wrote recently,

“This is the first generation that will be able to look back on their entire life story documented in pictures on the internet, and together we will all discover the after-effects of that.”

Will it become more difficult for time to heal all wounds if we remove the passing of time; if every day, or at any time, moments in our lives are replayed for us in full color, with video and even sound?

Our Brief History of Time

Educational measurement, of course, has not been immune to this loss of time.  In previous posts, I have discussed our loss of the time needed to design, develop and evaluate assessment programs before making them operational. There is also the apparent lack of any understanding or consideration of time and the foundational formula D = RT when setting accountability goals for individual students, schools, or states.  The loss of time that I want to discuss today, however, is more fundamental to educational assessment.

Not so long ago, time was central to the design and administration of tests and also to the reporting and interpretation of test scores.  In the heyday of norm-referenced testing, test scores were based directly on the interpretation of a student’s performance at a particular point in time.  Grade Equivalent scores described student performance in terms of what was typical (or expected) at a given point of time within a school year.  Those scores, as well as percentile ranks and stanines, were based on the particular point in time at which the test was administered; with separate norm tables developed for each week within a defined test administration window.  As we moved to the NCLB era and more criterion-referenced achievement levels, student performance was still evaluated and interpreted in comparison to expectations at a fixed point in time (i.e, at the end of a particular grade level).  Time was still in play as recently as 2010 with the advent of the Common Core State Standards, when we spoke of student proficiency in grades 3-8 in terms of being “on track for college-and-career readiness” by the end of high school.  Referring again to our old friend, D = RT, the use of the term ‘on track’ implies that we have a fairly thorough understanding of distance, rate, and of course, time.

Losing Track of Time

Somewhere over the last five years, however, the assessment/measurement community lost track of time.  Ironically, in part, our loss of an appreciation for time can be attributed to pressures directly related to time – too much testing time, too long to report results, and the well-intentioned yet poorly conceived backlash against “seat time” in favor of competencies to be defined later.

But those reasons can only partially explain our complete abandonment of time. Perhaps we simply have succumbed to the pressures of an on-demand world.  Perhaps we started to believe our own rhetoric about vertical scales, invariance, and the wonders of IRT. Perhaps the assessment industry is simply trying to adapt to technology and the “lean startup” concept – get the product in the hands of the customer faster.

With almost reckless faith in psychometric theory we are willing to boldly go where no assessment person has gone before.  We will administer items anytime, anywhere, in any combination, and apply item parameters generated across wide swaths of time (it all averages out in the end) to produce a theta estimate for a student.

And what do we do with that theta estimate?  That’s where things get tricky.  Our “time-based” tools for reporting and interpreting test scores have not caught up with this new “time-free” approach to assessment.  We convert the theta estimate to a scale score – even a vertical scale score.  And then …

Time is all we have

And then we are face-to-face with the reality that educational assessment cannot exist without time.  Without slipping into the philosophical argument over whether any type of psychological measurement, including educational measurement, is “real measurement” we have to acknowledge that virtually all of our IRT-based assessment lacks the underpinning of a theory-based scale.  At our best, we assemble an agreed upon collection of items and collect data on student performance on those items at a particular point in time.  We cannot interpret student performance on our large-scale assessments without a consideration of time and both the expected and relative performance of students at that point in time.  We can make awkward attempts to couch test scores in criterion-referenced terms, but as the quote often attributed to Bob Linn says, “scratch a criterion and you’ll find a norm.”

But if we have the serenity to accept the ways in which we cannot change our dependence on time, perhaps we will have the courage to change the things that we can change, and the wisdom to know the difference.

At this time, we are embarking on one of our field’s greatest adventures and challenges – the development of assessments to measure attainment of the Next Generation Science Standards.  It is a task that challenges everything we know and hold dear about alignment, item construction, test construction, scoring, reporting, reliability, and of course, validity.  With nothing more than a meager notion of a construct, we are developing and implementing NGSS assessments.  Perhaps these NGSS assessments will be an example of the old principles of test construction meeting the new principles of the lean startup strategy – iterating with the client to understand the construct and build the product that is needed.  The NGSS assessments and construct will form and re-form each other over time.  If that’s the mindset of the assessment developers, clients, and policy makers that’s not necessarily a bad approach.

Only time will tell.