assessment, accountability, and other important stuff

Charlie DePascale

When I think about educational measurement the first thing that comes to mind is a high-fructose corn syrup commercial from about 10 years ago.

 

 

On one side there is the man who holds, but cannot articulate, the widespread, but ill-defined, perception (misperception?) that high-fructose corn syrup is inherently bad.  On the other side is the woman with the tempting treat who provides a couple of carefully selected facts and makes the claim that high-fructose corn syrup is fine in moderation.  Man takes the treat from Woman and all is right in their world 30-second commercial world. It is truly an Adam and Eve moment – although that’s probably not the allusion the sponsors of the commercial were after.

In 2018, as a field and as an industry, educational measurement finds itself in much the same place as high fructose corn syrup.  We developed an appealing, inexpensive product (i.e., large-scale standardized tests), exulted in its success, and then could do little when we lost control of the product that defines us.   For much of this century, we have taken the role of the woman in the HFCS commercial.   We ensure people that there is nothing harmful or evil about educational measurement used properly and in moderation; all the while watching test use soar beyond anything that can be called moderation. Assuming that it will be impossible to produce a cute 30-second video on the benefits of educational measurement that will be as effective as the counterargument that John Oliver has already produced, where do we go from here?

I think that the only solution is to engage aggressively in rebranding. Educational Measurement is ripe for a makeover or perhaps even a complete do-over.  Now is the time to change not only the surface image of educational measurement, but to actually change what we mean when we talk about educational measurement.

Two assessment industry icons have already started this rebranding.  College Board began by changing the name of the SAT (much like KFC), conducted a major overhaul of their flagship instrument, and then created a new suite of products and services aimed at a new market.  ACT has gone even further as it redefines itself as a learning company rather than an assessment company.  As described in an EdSurge article earlier this year, ACT CEO Marten Roorda “wants the ACT to become more involved in the learning process, and provide more analytics solutions to teachers and students. “

Reforms to the ACT and SAT assessments, of course, are just the tip of the iceberg. Learning analytics, big data, personalized instruction, and adaptive learning are trending topics in education which are already impacting the measurement community.  At the ITC 2018 conference in July, John Hattie and Alina von Davier delivered keynote addresses on visible learning and computational psychometrics, respectively, which forced those listening to reconsider how we think about and do educational measurement.  As Kathleen Scalise explained at the 2018 NCME conference in April, it is not a question of if or when big data and learning analytics will impact educational measurement, they are already here and they already have.

Back to our roots

We must begin any attempt to rebrand, redefine, or refocus educational measurement by revisiting our roots.  And the best place to reconnect with those roots is the first edition of the so-called bible of our field, Educational Measurement, published in 1951.  We choose this as a starting point because as explained by E.F. Lindquist (editor), “prior to its publication … no book had yet been published that would even begin to fill an urgent need…for a comprehensive handbook and textbook on the theory and technique of educational measurement.”

It can also be argued that among the four editions of Educational Measurement (1951, 1971, 1989, 2006), the initial edition made the best attempt to ask and answer the Why? Question; that is, to define the purpose of educational measurement.  It is from understanding the purpose of educational measurement that we are able to glean the core values and guiding principles of our field which is the first step

In Part 1, The Functions of Measurement in Education, the 1951 edition begins with four chapters that address fundamental issues related to the primary functions of measurement in education at that time:

  • The Functions of Measurement in the Facilitation of Learning
  • The Functions of Measurement in Improving Instruction
  • The Functions of Measurement in Counseling
  • The Functions of Measurement in Educational Placement

Part 3, Measurement Theory, begins with a chapter on The Fundamental Nature of Measurement that ends with a section titled, Explanation as the End of Measurement, and the following admonition:

The primary concern of measurement, however, should be for an understanding of the entire field of knowledge rather than with statistical or mathematical manipulations upon observations.

Knowledge will be advanced by recognizing what the empirical methods of measurement ignore…The aim of measurement must ever be the explanation of, or the meaning for, observed phenomena.

A practical application of those statements is provided by Ralph Tyler in describing the organization of his chapter on the functions of measurement in improving instruction:

Since the purpose of this chapter is to outline the ways in which educational measurement, that is, achievement testing, can serve to improve instruction, we shall consider first what steps are involved in an effective program of instruction and then indicate the contributions that achievement testing can make to each of these steps.  In this connection it will be noted that educational measurement is conceived, not as a process quite apart from instruction, but rather as an integral part of it.

Tyler then goes on to describe four sequential phases of instruction

  1. To decide what ends to seek; that is what changes in student behavior to try to bring about
  2. To determine the content and learning experiences likely to attain those ends
  3. To determine an effective organization of those learning experiences to bring about the desired ends effectively and efficiently
  4. To appraise the effects of the learning experiences to determine whether they have brought about the desired ends or changes in student behavior.

He argued in 1951 that educational measurement as a field had become stuck on Step 4 (documenting the effects of instruction); not focusing enough on how educational measurement can and should inform and support the other three steps of instruction.

This problem has only been exacerbated in the last 60+ years as our field has become more technical, more specialized, and more separated from instruction.  While we may pay lip service to the notion that a key purpose of educational measurement is to facilitate learning and improve instruction, we do little to understand and support that function.

Educational measurement must find a way to support all aspects of the instructional process; toward the ultimate goal of improving student learning.  And having taken on that task, we must find a way to convey the message that measurement is more than a test of student outcomes.

Rebuilding and Rebranding

Obviously, rebuilding and rebranding educational measurement will not be simple.  It will require more than a quick fix like a 30-second commercial, a catchy new slogan, or a name change. However, although not sufficient, I do think that a name change is necessary.  The term ‘educational measurement’ is too closely associated with achievement testing to continue to serve a useful purpose.  Additionally, it does not accurately reflect either what we have been doing as a field for the last 60 years or the new directions in which the field is moving (e.g., with a focus on personalization and computational psychometrics).

My suggestion for a starting point is to replace the term ‘measurement’ with ‘modeling’ – Educational Modeling.  What is the case for modeling? Just for starters …

  1. With a few notable exceptions, modeling is a much more accurate description of what we do as a field than measurement. (Yes, I see you out there Rasch folks.)
  2. By its very nature, the term modeling conveys a sense of concern with an entire process or an entire system and the interactions among the components of that system.
  3. Measurement, not just educational measurement, is an outdated 20th century concept. The 21st century world is just much too complex to measure. Our field, and psychology in general, latched on to the term measurement last century because it was cool and gave the field credibility. Modeling is the new measurement.
  4. Finally, through the Common Core State Standards (and its offspring) we have invested nearly a decade in spreading the word to K-12 educators, students, and the general public of the importance of modeling and its central role in all that we do as intelligent human beings.  Let’s take advantage of that and create coherence between what we say and what we do in education.

The Common Core defines modeling as “the process of choosing and using appropriate mathematics and statistics to analyze empirical situations, to understand them better, and to improve decisions.”  That sounds like what we are doing (or should be doing) in educational measurement.  In further describing modeling, the Common Core further states “Real-world situations are not organized and labeled for analysis; formulating tractable models, representing such models, and analyzing them is appropriately a creative process. Like every such process, this depends on acquired expertise as well as creativity.”

Again, isn’t that what we are supposed to be doing in educational measurement?

Let the games begin

To get this ball rolling, I call on NCME, consistent with their vision to be the recognized authority in measurement in education, to take the first step by changing their name to the National Council on Modeling in Education.  They won’t even have to change their logo, URL, or Twitter handle,

The next step would be for NCME and co-editors Linda Cook and Mary Pitoniak to make the upcoming 5th edition of our bible, Educational Measurement, a New Testament for our field.  Educational Modeling has a nice ring to it as a title.

We have to start somewhere to restore the reputation of educational measurement.

Are you ready for it?

 

 

 

 

Give Me A Lever

Charlie DePascale

5016825476_ac97485fd5_z

I realized very early in my career that the law of the lever, as explained by Archimedes in some variation of the quote above, was critical to my success. In short, there was little that I could do on my own as an assessment specialist, or psychometrician, to improve education; but working in concert with the right lever, we could move the world. My task, therefore, was to identify and associate with those levers.

In large-scale assessment that lever most often is a state policymaker; that is, a deputy commissioner, commissioner, board member, or governor. I left state assessment directors off of this list, because in many ways they are in the same position as I am. Without a policymaker as a lever, there is little that an assessment director can do. And there is no need to explain why federal education officials are never the appropriate lever.

[Aside: From my perspective as an assessment specialist, I see policymakers as my lever.  From their perspective, I may be their lever.  That is not really important.  What matters is the understanding that we need each other.  I may need them more than they need me, but we do need each other.]

Like operating a simple lever, the process of working with a policymaker is quite straightforward.  Begin with a policymaker who has a clear vision of what she or he wants to accomplish and a sense that assessment can help.  From that starting point, identify ways in which assessment can be used to support or advance the policymaker’s goals. Working together, determine what type of assessment is needed and how best to convey information from the assessment.  Understand what the policymaker would like to say (or needs to say) and work together to figure out a way to help her or him say it in their own voice.  Equally important, help them understand what the assessment cannot do and what they should not say.

When it all comes together just right, it’s a beautiful thing. Over the course of my career, I have been fortunate to be associated with several policymakers and assessment programs where things did come together quite well.

Of course, the pieces do not always come together exactly as you hoped.  Perhaps there are too many constraints (cost, time, capacity) to design and develop the assessment that is needed.  Perhaps the education leadership, governor, and legislature are not on the same page.  Perhaps other goals are higher on the policymaker’s priority list.  Or perhaps things came together for one brief, shining moment, but could not be sustained.

It is in such less than ideal situations that working with the right levers becomes even more important. With the right partners, you are often able to adapt, work through the issues, and make the best of the situation; sometimes just treading water until the context changes.  Without the right partners in place, however, oh you’ve got trouble. There is little that an assessment specialist can do – the assessment program flounders and the state moves on to another assessment program.

What about assessment in the classroom?

The importance of having the right partners is easy to see with regard to large-scale assessment.  Ideally, the assessment specialist and policymaker are working side-by-side to implement and maintain the assessment program.  That type of direct relationship rarely exists with regard to assessment in the classroom.  However, understanding the importance of partners is just as important to an assessment specialist when considering assessment at the classroom level.

My lever in the classroom is the teacher rather than the policymaker.  As an assessment specialist, however, I am likely to be much farther removed from a teacher than I was from a policymaker. I will seldom be in a position to interact directly with teachers as they make assessment decisions and use assessment information. The basic equation, however, remains the same.  For my work to make a difference in the classroom there must be an appropriate partner ready and willing to use it.

My task becomes providing tools that will help put the right information at the right time into the hands of a teacher who can use it to inform and improve instruction – in support of the ultimate goal of improved student learning.

That task is complicated by the fact that there are some teachers who are not prepared to be good levers and others who may be in situations that do not allow them to be good levers.  I might provide the same information, in the same way, at the same time to two teachers in the same school and see very different results.  Without a teacher prepared to use it, any information that I can provide will be much less effective.

How does this impact what I do and how I do it?

First, I have to acknowledge and understand the role of the teacher as my partner.  Although I will not be working side-by-side with individual teachers during implementation, I cannot work in isolation from teachers during the design and development process.

Second, I have to make the assumption that there will be a good partner in place at the classroom level. I have to design tools and resources that will be useful to an effective teacher.

Third, I have to realize that there will be many cases where the second assumption above is false.  I am convinced, however, that the solution is not to try to develop tools and resources so that can be used by any teacher, regardless of the knowledge and skills they bring to the table. That is a fool’s errand.

Rather, the solution is to identify and work with other partners to make the second assumption above more likely to be true.  Those partners will include state-level policymakers, district and school administrators, developers of curriculum and instructional support materials, and teacher educators. Throughout the entire process of preparing, certifying, and providing in-service support to teachers there must be a concerted effort to ensure that teachers are equipped to effectively use assessment in the classroom.

In short, …

The solution is not to try to make assessment teacher proof.

The solution is not add-on programs and materials designed to make teachers “assessment literate”.

The solution is to work with partners on multiple levels to better provide useful information to teachers who are prepared to make use of it.

 

If I Did It

Confessions of a Psychometrician

By OJ Simpsons Paradox with Charlie DePascale

Charlie – As we waited six long months for the release of the 2017 NAEP results, some wondered whether we would ever know the whole story; what really happened that February when NAEP reading and math went digital. Now that those results have been released and the NAEP trend line preserved, what do we really know?

This week, we are pleased to welcome, OJ Simpsons Paradox, a statistician and part-time psychometrician, usually locked deep within the bowels of the government where he has the ear of top education policy makers.  Today, he is here to offer his hypothetical account of how a broken trend line could be and should be “fixed” without anyone suspecting a thing.

OJ:  It all starts with NAEP.  The one constant through all the years, Ray, has been NAEP. America has rolled by like an army of steamrollers. It’s been erased like a blackboard, rebuilt, and erased again. But NAEP has marked the time.  This assessment, this trend line, is a part of our past, Ray.  It reminds us of all that once was good, and that could be great again. People want the trend line, Ray.  People definitely want the trend line.

Charlie: OK. You can call me Ray.  But aren’t people skeptical?

OJ: Ray Ray, you just tell them what they want to hear hear hear hear hear.  You need to tell em tell em tell em What they wanna hear.

Charlie: Sure, people will hear what they want to hear, believe what they want to believe; but this is psychometrics, measurement, facts…

OJ: It’s statistics, son.  Facts are stubborn things, but statistics are pliable.

Charlie: Pliable, yes. But, if the trend line were broken, how could you fix it?  You tell us that in the national sample students taking the test on paper performed 4 percentage points better on each item than those taking the test on computer.  That sounds like a big difference.

How does that compare to the p-value difference normally found between a top-performing state like Massachusetts and the national average or with states near the bottom of the list?

OJ: Right, in State A there is a 5-point scale score difference …

Charlie: Wait.  Sorry to interrupt.  No, I am asking about the national p-value difference.

OJ: Mindset.  You start with the mindset that the trend has been preserved and that you need incontrovertible evidence to prove that it has been broken.  The rest is just statistics.

You tell me that there is a 5 point difference between a state’s performance on paper and computer.  You think, “Damn, five points on NAEP is huge!”  NAEP can go 30 years without changing by five points.

But, could a difference that large happen by chance?  Maybe not too often, but 5/100 times, 1/100 times, 1/1000 times – you see where I am going with this?

Charlie: But what about Power?  With a paper sample of only 500 students…

OJ: Power!  We can take a year to report results and nobody bats an eye.  We can post cute little Twitter surveys while people are waiting and people ‘like’ them.

We can take the time we need to prepare the message. When I worked for a state we were taken to court and lost when we wanted to take two days to prepare a memo before releasing results.

We can bury you with videos, charts, graphs, data tools when we release the results.

That’s the only power I need.

Charlie:  People will want to know what happened to the trend line.

OJ:  We are reporting that nothing happened to the trend line, Don.  Reports that something hasn’t happened are always interesting to me, because as we know there are known knowns; there are things we know we know.  We also know there are known unknowns; that is to say we know there are some things we do not know.  But there are also unknown unknowns – the ones we don’t know we don’t know.

Charlie:  What does that mean?

OJ: Exactly!

Charlie: The trend line.  Was it broken?

OJ: Son, we live in a world that has trend lines; and those trend lines need to be maintained.  Who’s gonna do it?  You?  You have the luxury of not knowing what I know – that misrepresenting performance of an individual state, while tragic, probably saved lives; and my existence, while grotesque and incomprehensible to you, saves lives.

You want me maintaining that trend line!

YOU NEED ME MAINTAINING THAT TREND LINE!

Charlie: Well, thank you OJ.  That’s all the time we have today.  We are all looking forward to the release of the 2018 NAEP results later this fall.

OJ: We’ll see.

It’s about time

Charlie DePascale

We have all asked the question, “Where did the time go?”

As troubling as that question can be, more recently, I find myself pondering an even more vexing question, Where did time go?

Every day, it seems as though time has been removed as a dimension or component of some part of our lives in which it was always really important.

Television, of course, is a prime example.  I grew up with “same bat time, same bat channel,” and Sunday nights at 8 with the family in front of the television (could I stay awake long enough to see Topo Gigio).  Later there was 11:30 on Saturday nights and “must see TV” on Thursday.  Appointment television!

Now, I can watch a show whenever, wherever, and however I want – on demand.  I can still watch any of those shows referenced above as easily as a show that aired last night.  And not just whole shows.  I can pull up a clip of my favorite moments; like Sheldon erasing time as he makes a basic mistake while explaining the time parameter to Penny on The Big Bang Theory.

Not only can we pull up television or movie clips, clips of our own lives are now also neatly stored and readily available on demand.  We are supposed for forget certain things over time and to be able to process, shape, and reshape our memories. However, as Taylor Swift wrote recently,

“This is the first generation that will be able to look back on their entire life story documented in pictures on the internet, and together we will all discover the after-effects of that.”

Will it become more difficult for time to heal all wounds if we remove the passing of time; if every day, or at any time, moments in our lives are replayed for us in full color, with video and even sound?

Our Brief History of Time

Educational measurement, of course, has not been immune to this loss of time.  In previous posts, I have discussed our loss of the time needed to design, develop and evaluate assessment programs before making them operational. There is also the apparent lack of any understanding or consideration of time and the foundational formula D = RT when setting accountability goals for individual students, schools, or states.  The loss of time that I want to discuss today, however, is more fundamental to educational assessment.

Not so long ago, time was central to the design and administration of tests and also to the reporting and interpretation of test scores.  In the heyday of norm-referenced testing, test scores were based directly on the interpretation of a student’s performance at a particular point in time.  Grade Equivalent scores described student performance in terms of what was typical (or expected) at a given point of time within a school year.  Those scores, as well as percentile ranks and stanines, were based on the particular point in time at which the test was administered; with separate norm tables developed for each week within a defined test administration window.  As we moved to the NCLB era and more criterion-referenced achievement levels, student performance was still evaluated and interpreted in comparison to expectations at a fixed point in time (i.e, at the end of a particular grade level).  Time was still in play as recently as 2010 with the advent of the Common Core State Standards, when we spoke of student proficiency in grades 3-8 in terms of being “on track for college-and-career readiness” by the end of high school.  Referring again to our old friend, D = RT, the use of the term ‘on track’ implies that we have a fairly thorough understanding of distance, rate, and of course, time.

Losing Track of Time

Somewhere over the last five years, however, the assessment/measurement community lost track of time.  Ironically, in part, our loss of an appreciation for time can be attributed to pressures directly related to time – too much testing time, too long to report results, and the well-intentioned yet poorly conceived backlash against “seat time” in favor of competencies to be defined later.

But those reasons can only partially explain our complete abandonment of time. Perhaps we simply have succumbed to the pressures of an on-demand world.  Perhaps we started to believe our own rhetoric about vertical scales, invariance, and the wonders of IRT. Perhaps the assessment industry is simply trying to adapt to technology and the “lean startup” concept – get the product in the hands of the customer faster.

With almost reckless faith in psychometric theory we are willing to boldly go where no assessment person has gone before.  We will administer items anytime, anywhere, in any combination, and apply item parameters generated across wide swaths of time (it all averages out in the end) to produce a theta estimate for a student.

And what do we do with that theta estimate?  That’s where things get tricky.  Our “time-based” tools for reporting and interpreting test scores have not caught up with this new “time-free” approach to assessment.  We convert the theta estimate to a scale score – even a vertical scale score.  And then …

Time is all we have

And then we are face-to-face with the reality that educational assessment cannot exist without time.  Without slipping into the philosophical argument over whether any type of psychological measurement, including educational measurement, is “real measurement” we have to acknowledge that virtually all of our IRT-based assessment lacks the underpinning of a theory-based scale.  At our best, we assemble an agreed upon collection of items and collect data on student performance on those items at a particular point in time.  We cannot interpret student performance on our large-scale assessments without a consideration of time and both the expected and relative performance of students at that point in time.  We can make awkward attempts to couch test scores in criterion-referenced terms, but as the quote often attributed to Bob Linn says, “scratch a criterion and you’ll find a norm.”

But if we have the serenity to accept the ways in which we cannot change our dependence on time, perhaps we will have the courage to change the things that we can change, and the wisdom to know the difference.

At this time, we are embarking on one of our field’s greatest adventures and challenges – the development of assessments to measure attainment of the Next Generation Science Standards.  It is a task that challenges everything we know and hold dear about alignment, item construction, test construction, scoring, reporting, reliability, and of course, validity.  With nothing more than a meager notion of a construct, we are developing and implementing NGSS assessments.  Perhaps these NGSS assessments will be an example of the old principles of test construction meeting the new principles of the lean startup strategy – iterating with the client to understand the construct and build the product that is needed.  The NGSS assessments and construct will form and re-form each other over time.  If that’s the mindset of the assessment developers, clients, and policy makers that’s not necessarily a bad approach.

Only time will tell.

Bring back valid tests

DSC02859

Charlie DePascale

Those of us of a certain age can recall when it was acceptable to talk about valid tests.  One could make the claim that a test was valid; or more often as a precocious graduate student, question whether a test was valid. Some version of the phrase a test is valid to the extent that it measures what it purports to measure rolled off our tongues and made its way into literally every paper we wrote, presentation we made, or late-night conversation we had about tests and testing.

But then things changed.

It was no longer acceptable to talk about valid tests.  Validity was not a property of the test.  Validity was “an integrated evaluative judgment of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of inferences and actions based on test scores or other modes of assessment” (Messick, 1989, p. 13, emphasis in original).

This change, of course, did not happen overnight.  It was the culmination of decades of debate about the meaning of validity and validation.  As with any transition in thinking, the field tolerated references to “valid tests” awhile, long after the focus of validation had shifted to interpretation, inferences, and actions.  At some point, however, the use of the term became socially unacceptable.

Validity – New and Improved

In general, there are three primary reasons why you would change the definition of, or the way we talk about, a concept as central to a field as validity is to educational measurement and assessment.

  • There is new information or understanding that renders the existing definition obsolete.
  • There is a desire or need to clarify the existing definition so that the concept is better understood by theorists and practitioners within the field.
  • There is a desire or need to clarify the existing definition so that it results in better understanding and actions by the broader public.

With regard to validity and educational measurement, one can make a stronger case for the second argument (clarity within the field) than the first case (new information).  Beyond the appeal of a unified theory of validity within the field, however, the third reason, promoting better test use, was the driving force behind the repackaging of validity.  As Shepard (2016) clearly describes, “the reshaping of the definition of validity, beginning in the 1970s and continuing through to the 1985 and 1999 versions of the Standards, came about in the context of the civil rights movement because tests were being used to sort and segregate in harmful ways that could not be defended as valid.”

Shepard (2016) and Cronbach (1988) explain that validity and validation are concepts that must be understood and applied appropriately by the vast body of test users to their vast array of test uses.  Cronbach argues that validators of the appropriateness of tests meet their responsibility “through activities that clarify for a relevant community what a measurement means, and the limitations of each interpretation.  The key word is ‘clarify.’ “

Validity – Clarified?

We are now in at least our sixth decade of clarifying the concept of validity.  Over the last 30 years, successive versions of the Joint Standards have been consistent in both the importance of the concept of validity and in its definition:

Validity is the most important consideration in test evaluation. The concept refers to the appropriateness, meaningfulness, and usefulness of the specific inferences made from test scores. (1985)

Validity refers to the degree to which evidence and theory support the interpretations of test scores entailed by the proposed uses of the tests. Validity, therefore, is the most fundamental consideration in developing and evaluating tests. (1999)

Validity refers to the degree to which evidence and theory support the interpretations of test scores for proposed uses of tests.  Validity is, therefore, the most fundamental consideration in developing and evaluating tests.  (2014)

The 1999 and 2014 Joint Standards further state,

“The process of validation involves accumulating relevant evidence to provide a sound scientific basis for proposed test score interpretations. It is the interpretations of test scores for proposed uses that are evaluated, not the test itself.”

As scientists, researchers, evaluators, or interested observers of educational measurement and test use we must ask the question, has this definition of validity, largely unchanged for 30 years, clarified the concept of validity and resulted in better and more appropriate test use by policy makers and by the general public.  My answer to that question is simply to evoke all of the validity issues associated with the phrase Test-based Accountability.

So, if the current definition of validity has not been as successful as we would have liked with end users of assessment, has the focus on a unified theory of validity, validity arguments and “an integrated evaluative judgment of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of inferences and actions based on test scores or other modes of assessment” clarified the concepts of validity and validation within the field?

Well, not so much.  If this is what clarity looks like, then I shudder to think of how things would look with a lack of clarity.

  • Yes, there is consensus on the importance of collecting evidence to support proposed test score interpretations and uses. What constitutes evidence, how much of it is necessary, how it should be evaluated, and who is responsible for collecting, evaluating, and presenting that evidence is another story.
  • Technical concepts such as constructs, construct under-representation, and construct-irrelevant variance are crossed with commonly used words such as alignment and comparability to create tables of test evaluation criteria that have no meaning to the field or the general test user.
  • The shift away from external criteria and predictions has had the unintended consequence of increasing acceptance of the test score as the construct. There can be no give-and-take between theory and evidence from the assessment if the test score is the only evidence available or considered.
  • Although Cronbach wrote of the responsibility of validators, it is no longer clear who those validators are. If everyone is responsible for validation, nobody takes responsibility for validation.  If validation is an ongoing (i.e., unending) process, we can put off too easily some of those validation studies for another year, or two, or three until we have time to gather good data about how the tests scores are being interpreted and used.
  • And even if there is consensus that validity is a unitary concept, the evidence needed to develop a validity argument is still being produced largely by specialists responsible for one distinct aspect of the testing process. Who within the field as it is currently being molded is being equipped with the knowledge and skills necessary for crafting and carrying out a comprehensive and coherent validity plan?

Back to the drawing board

As shown above, there was virtually no change in the definition of validity presented in the 1999 and 2014 Joint Standards and little change in the 30+ years since the 1985 version of the Joint Standards.  If our current definition of validity has not produced desirable outcomes, we have a professional responsibility to continue to review and refine it.

Just for fun, let’s return to our old friend:

Validity is the extent to which a test measures what it purports to measure.

What were the problems with that definition of validity?

A core criticism, of course, was that the definition did not address test use explicitly.  Outside of the thin air and ivy-covered walls (or perhaps some other green plant-covered walls) of academia, however, is that really a problem?  As the examples provided by Shepard (2016) demonstrate, with little help from the measurement community, the courts have certainly interpreted the definition of validity broadly enough to include test use since the 1970s.

Can we expect educators and policy makers to interpret “what it purports to measure” as “what you intend to measure” or “what you purport it measures”?  I would like to think that is the case, but if not, that is a relatively simple fix to the definition.

A second criticism is that the traditional definition is too narrow or too simple.  It does not reflect the complexity of the concept of validity and does not accurately reflect the concept of educational testing (i.e., that the test is being administered for a particular purpose).  That is a fair concern.  On the other hand, we can all agree that concepts of validity and validation are far too complex to be captured by any reasonable expansion of the definition.  What do we lose by trying to make a definition capture more than can possibly be captured in a simple definition?

There is also scientific elegance in the phrase “validity is the extent to which a test measures what it purports to measure.” In particular, the word purport seems uniquely suitable to describe the concept of validity or the process of validation.  There is an air of skepticism associated with the word purport.  Purport refers to claims made without proof or supporting evidence.  What better way to define validity and validation than to suggest that one must assemble evidence to prove that a test measures what it claims to measure – or what you intend to measure with it?

Would a reasonable group of experts and test users assigned with the task of describing how to prove that a test measures what it purports to measure arrive at the same set of validity standards contained in the Joint Standards; clusters of standards related to establishing the intended uses and interpretations of the test, issues regarding samples and settings used in validation, and specific forms of evidence?

A Brave New World

Somebody somewhere is probably already working on the validity/validation chapter for the fifth edition of Educational Measurement (1951, 1971, 1989, 2006).  Unlike the Joint Standards, which by their very nature must lag a generation or two, the validity/validation chapter in Educational Measurement has the opportunity to move the field forward. I hope the author takes advantage of that opportunity.

We are on the cusp of a brave new world in education, assessment, and educational measurement.  It is a world defined by technology, personalization, and an emphasis on individual growth.  It is a world without standardization and fixed form, census, large-scale assessment. It is a world in which educational measurement (and its principles of validity, reliability, and fairness) is threatened by statistical modeling. (Yes, there is a difference between the measurement and modeling.)

For educational measurement to have any hope of survival, we need to begin by being able to clearly convey what we mean by validity and validation and why they are important.  For my money, I will start with a test that measures what it purports to measure.

Implausible Values

Equating in the early part of the 21st century

Charlie DePascale

Equating

Our field is facing a crisis brought on by implausible values. The values which threaten us, however, are not the assessment results questioned above.  Those are only the byproduct of the values which our field and society have adopted with regard to K-12 large-scale assessment.

That is, the values which lead us to wait more than a year for the results of the 2017 NAEP Reading and Mathematics tests while expecting nearly instantaneous, real-time results from state assessments like Smarter Balanced, PARCC, and the custom assessments administered by states across the country.

NAEP, the “nation’s report card,” is an assessment program that does not report results at the individual student, school, or even district level (in most cases). NAEP results have no direct consequences for students, teachers, or school administrators.

State assessment results for individual students are sent home to parents.  School and district results are reported and analyzed by local media and may have very real consequences for teachers, administrators, and central office staff.

It is a paradox that equating for annual state assessment programs with a dozen tests and multiple forms is often carried out within a week while results for the four NAEP tests administered every other year can be delayed indefinitely with the explanation that

“Extensive research, with careful, detailed, sophisticated analyses by national experts about how the digital transition affects comparisons to previous results is ongoing and has not yet concluded,” 

Of course, it is precisely because NAEP results have no real consequences or use that we are willing to wait patiently, or disinterestedly, until they are released.  Can anyone imagine a state department of education posting a Twitter poll such as this?

naep 2017

The reality, however, is that the NAEP model is much closer to the time and care that should be devoted to equating (or linking) large-scale state assessment results across forms and years than current best practices with state assessments.

Everything you wanted to know about equating but were afraid to ask

To a large extent, equating is still one of the black boxes of large-scale assessment.  It is that thing that the psycho-magicians do so that we can claim with confidence that results are comparable from one year to the next – not to mention valid, reliable, and fair.

Well, let’s take a quick peek inside the black box.

There are two distinct parts to equating – the technical part and the conceptual/theoretical part.

In reality, the technical part is pretty straightforward; at most it is a DOK Level 2 task.  There are pre-determined procedures to follow, most of which can be automated. It’s so simple that even a music major from a liberal arts college can pick it up pretty quickly (self-reference). That’s what makes it possible to “equate” dozens of test forms in a week; or made it possible for a former state department psychometrician to boast that he conducted 500 equatings per year.

Unfortunately, the technical part leaves you few options when the results just don’t make any sense.

That brings us to the conceptual and theoretical part of equating, which involves few, if any, complicated equations, but is the much more complex part of equating.

As a starting point, it’s important that we don’t confuse the concepts and theory behind the technical aspects of equating with the theoretical part of equating.  That’s a rookie mistake or a veteran diversion.

The concepts and theories that should concern us are those related to how students will perform on two different test forms or sets of items; or on the same test form taken on two different devices; or on a test form administered with accommodations; or on a test form translated into another language or adapted into plain English; or on test forms administered under different testing conditions; or on test forms administered at different times of the year or at different points in the instructional cycle.  The list goes on and on.

Unfortunately, we know a lot less about each and every one of those condititions than we do about the technical aspects of equating.

In the past, our go to solution was to try to develop test forms that required as little equating as possible. That approach, sadly, is no longer viable.  We have now moved beyond equating test forms to applying procedures to add new items to a large item pool; that is, to place the items on a “common scale” with the other items in the pool.

It was also tremendously helpful that in the past we didn’t really expect any change in performance at the state level from one year to the next.  That is, we had a known target, or a fixed point, against which to compare the results of our equating.  If the state average moved more than a slight wobble, we went back to find the problem in our analyses.  It was a simpler time.

Where we go from here

We cannot return to that simpler time, but neither can we abandon some if its basic principles.

When developing new technology, as we are doing now with large-scale and personalized assessment, it is important to have a known target against which to evaluate our results.  When MIT professor Harold ‘Doc’ Edgerton was testing underwater SONAR systems, he is quoted as saying edgertonthat one of the advantages of testing the systems in the Boston Harbor was that the tunnels submerged in the harbor didn’t move.  He knew where they were and they were always in the same place.

We need the education equivalent of a harbor tunnel against which to evaluate our beliefs, theories, procedures, and results.  We are now in a situation where the amount of change in student performance that has occurred from one year to the next is determined solely by equating.  There is no way to verify (of falsify) equating results outside of the system.  That is not a good position from which to operate a high-stakes assessment program; particularly at a time when so many key components of the system are in transition.

Finding such a fixed target is not impossible, but it is not something that can be done on the fly. We cannot continue to move from operational test to operational test.

Our current model of test development for state assessment programs rarely includes any opportunity for pilot testing.  That has to change.

We need to rely less on the technical aspects of equating and invest more in understanding the concept of equating.

We need a better understanding of student learning, student performance, and how student performance changes over time before we build our assessments and equating models.

We need to be humble and acknowledge our limitations.  A certain degree of uncertainty is not a bad thing, if its presence is understood.

Finally, we need to move beyond the point where whenever I think about equating, this scene from Apollo 13 immediately comes to mind.

My 12 Memories of Christmas

Charlie DePascale

 

As time goes by, certain memories of Christmases past become stronger than others.  Most are filled with family, food, music, and fun; but a few other things manage to creep in as well.  On Christmas 2017, here is a list of my 12 memories that mean Christmas to me.

  1. The First Noel 

My sister, 2 years younger than me, was at a stage where she recognized letters but could not yet read.  One Sunday afternoon before Christmas, my father was at the sink washing dishes and my sister was in the pantry looking at the box of Christmas cookies with the word NOEL written across the package.  She asks my father, “What does N-O-E-L spell?”.  She hears his response “No L” and responds “Yes, there is an L”  After 5 minutes of my father trying to explain, my sister becoming increasingly frustrated, and me laughing hysterically, I knew that Christmas always was going to be one of my favorite holidays.

  1. The Meaning of Christmas –

Beginning with Rudolph (1964) and Charlie Brown (1965) , followed by the Grinch (1966) and Santa Claus is Coming to Town (1970), I learned the “specials” meaning of Christmas  –

There’s always tomorrow for dreams to come true, believe in your dreams come what may

Maybe Christmas, he thought… doesn’t come from a store.  Maybe Christmas, perhaps… means a little bit more!

Christmas Day is in our grasp! So long as we have hands to clasp!

You put one foot in front of the other, And soon you’ll be walking ‘cross the floor. You put one foot in front of the other, And soon you’ll be walking out the door

 

And, of course.

 

 

And just in time in 1974, along came The Year Without A Santa Claus with the Heat Miser, Cold Miser and the advice to “just believe in Santa Claus like you believe in love.”

 

  1. Smile and say Santa

It wouldn’t be Christmas, of course, without the photo card of me and my sister (and later the dog).  In the days before Shutterfly, digital cameras, and even 1-hour photo that meant family “photo shoots” beginning in late summer or early fall to ensure a good picture.  Most years that involved bring a few Christmas decorations up from the basement and dressing us in nice clothes.  One year, my Mom decided that an outside shot would be nice.  So, in August we were hanging tinsel and decorations on the evergreen tree in my aunt’s front yard and donning our winter coats.

 

IMG_4765

 

  1. 1968

In 1967, I was aware of the space program and the Red Sox.  In 1968, the rest of the outside world came crashing through the door.  From the Pueblo to the Tet Offensive to Martin Luther King to Bobby Kennedy to the Democratic Convention to the protest at the Olympics to the election of Richard Nixon everything seemed to be spinning out of control.  And then came Apollo 8 and its Christmas Eve broadcast while orbiting the moon.

 

  1. Grand Funk Railroad – We’re an American Band

Throughout my childhood the annual Smyth family Christmas party brought together three generations of my mother’s side of the family. Grandparent, aunts and uncles, and most of our gaggle of 17 cousins gathered for an afternoon of food, games with Christmas-themed trinkets as prizes, Christmas music, etc. In the mid-1970s one of my older teenage cousins decided to replace the Christmas album on the record player with a pretty yellow album and music I had never heard before.  A few minutes later, parents entered from the kitchen, the album was flying across the room toward the wall, and our annual Christmas parties were no more.  I still use We’re American Band as the song on my alarm clock.

IMG_4767

 

  1. Here We Come a Caroling

From 1971 through 1977, I attended Boston Latin School – the oldest public school in the country and a place steeped in tradition.  One of the more recent, informal, and beloved informal traditions was members of the senior class singing Christmas Carols on the balcony overhanging the cafeteria during lunches on the day before Christmas vacation.  Nothing, even traditions, however, lasts forever. In 1972, the school became coed.  Sometime in that period, they took away the one minute of silent meditation and reflection at the beginning of the day.  In my senior year, they told us Christmas carols were no longer allowed in school and administrators and faculty did everything they could to thwart their singing.  Growing up in Boston, I had not realized that Christmas was controversial.

  1. Our 2nd Christmas Together

My wife, Lisa, and I were married in September 1984, and our first two years of married life were spent in Minnesota while I completed coursework for PhD program.  This meant leaving our apartment in Minneapolis in mid-December to return “home” to Boston for Christmas. The first year, we decided not to decorate the apartment because we would not be there for Christmas.  A bad idea. The second year, around Thanksgiving we headed to the Goodwill store up the street, picked up a silver aluminum tree, and all of the boxes of old ornaments that we could get for $25 (and could fit into our Sentra).  We had our first Christmas tree and have had one every year since then (although never silver again).  Upon returning to Minneapolis after Christmas break, we donated the tree and ornaments back to Goodwill.

  1. The Brown Box

Lisa is one of those lovely people with the tremendously annoying ability to pick up a wrapped present and know immediately what it is.  One year, my parents were determined to stymie her.  They bought her a sewing box from Italy made of beautiful brown wood.  It looked like they had her.  As she removed the wrapping paper and stared at the brown cardboard box with no clue what it contained, she reported what she saw – a brown box.  My Dad, mistakenly thinking that she had guessed that it was the “brown sewing box” was beside himself, gave away the surprise, hilarity ensued, and an annual Christmas story was born.

  1. Christmas Eve (part 1)

Everywhere else, December 24 was Christmas Eve.  For our family, however, it was first and foremost my grandfather’s birthday.  That meant a wonderful Italian dinner followed by an evening with his entire family (see #5 above) filling the two-family home we shared with them.  After lots of food, laughter, and probably just a little drinking, the evening invariably ended in our living room with my sister at the piano and my uncles leading the singing of Christmas carols and old standards from the 1930s, 40s, and 50s.

  1. Christmas Eve (part 2)

After my grandfather’s passing, Christmas Eve became a night to celebrate with my wife’s family.  That became a bit more complicated when her sister married a “nice Jewish boy from Long Island” and they were raising their children in that faith.  In the spirit of family peace and harmony, the compromise was “Chanukah presents”, wrapped in a limited variety of blue Chanukah paper (buy as many rolls as you can when you see it), and place carefully under the Christmas tree.  Nothing confusing there for a child, right?  One year, when my niece was around 4 years old, I asked her what all the presents were for.  She looked up and replied matter-of-factly, “After dinner.”

  1. Dashing Through The Snow

In 2002, our Christmas Day visit to my parents’ house was threatened by an impending snowstorm.  The storm was expected to start by late morning and dump about a foot of snow on us in Southern Maine.  Christmas was canceled this year.  Not so fast, in the spirit of Rudolph, we loaded up the sleigh (car with presents) right after breakfast and started out on the 90-minute drive to my parents.  We called my parents from the car as we were approaching their house (during which call, my Mom told us to hold on someone was pulling into their driveway).  They got to give Christmas presents to their granddaughter in a whirlwind visit (which was not quite as much of a whirlwind as it should have been), and we made it almost all the way back home to Maine before the roads were covered with snow.

  1. Our child is born

 

IMG_4769

In December 1993, we were preparing for the birth of our first child – who was due in late January.  Mary had other plans and was born on December 15.  Starting out a just a bit under 4 pounds, she spent the first ten days of her life in the hospital in an isolette (or baby incubator).  When we arrived at the hospital on Christmas morning, however, she was out of the isolette for the first time and dressed in a bright green Christmas onesie.

 

We spent that day holding our Christmas miracle and starting a whole new set of Christmas memories with her.  To my staples of Rudolph, Charlie Brown, etc. were added Elmo Saves Christmas, Arthur’s Perfect Christmas, Elf, and The Polar Express.

So, as the song says ….

 

Have yourself a merry little Christmas
Let your heart be light
From now on, our troubles will be out of sight
Have yourself a merry little Christmas
Make the Yuletide gay
From now on, our troubles will be miles away
Here we are as in olden days
Happy golden days of yore
Faithful friends who are dear to us
Gather near to us once more
Through the years we all will be together
If the fates allow
So hang a shining star upon the highest bough
And have yourself a merry little Christmas now