assessment, accountability, and other important stuff

Archive for April, 2015

Shall we dance? The final steps in implementing a new assessment program

Charlie DePascale

As Smarter Balanced, PARCC, and the other new college- and career-ready assessments proceed through their first operational administration, one might assume that all of the difficult design and development decisions are well behind them.  In reality, we are entering the final, and often most harrowing phase of the test design process.  Despite all of the careful planning and field testing that leads up to the operational administration of a new assessment, it is not unusual for there to be significant design changes following that first administration.  Some of these changes may stem from content or measurement issues, but most often it is factors such as testing time, cost, or administration issues that lead to post-implementation design changes.

The English Language Arts (ELA) portion of the Massachusetts Comprehensive Assessment System (MCAS) serves as a prime example of the type and extent of the changes that can occur after the initial administration of a new assessment.  The MCAS ELA test consists of a separate Reading and Writing test, from which scores are combined into a composite English Language Art score.  On the initial administration of the MCAS Writing test in 1998, each student produced two essays, called the Long Composition and the Short Composition.  By 1999, the Short Composition had been eliminated.  The Reading test included a mix of multiple-choice worth 1-point and constructed-response items, scored on a 0-4 scale.  In 1998, student scores in reading were based on 32 multiple-choice items and 8 constructed-response items.  In 1999, half of the constructed-response items were eliminated and student scores were based on 36 multiple-choice items and 4 constructed-response items.  The table below shows the breakdown of items on the 1998 and 1999 MCAS ELA assessments.

Changes in Item Distribution Across the First Two Administrations

Of the MCAS English Language Arts Assessment

Item Type

1998

1999

Multiple-choice

32 36
Constructed-Response 8 4
Long Composition 1 1
Short Composition 1

0

The changes in item distribution shown above were accompanied by administration changes including the number, order, and length of test session as well as the number of different content area tests administered to a student.

 

No, no, you can’t take that away from me.

Virtually all large-scale state assessment, and particularly assessments of the scope and nature of  PARCC and Smarter Balanced, have been carefully designed, often over a period of two, three, or four years.  Many of the new assessment programs are applying the principles of evidence-centered design to construct test blueprints and specifications intended to support a specific set of claims about student achievement.  The resulting designs reflect the best attempt of knowledgeable and dedicated professionals to develop an assessment aligned to a specific set of standards and supporting a specific set of claims.  We can also assume that those people were well aware of constraints such as testing time, cost, and ease of administration as they designed the assessment.

Test design decisions are never reached easily.  In a consortium, in particular, it is certain that countless hours and blood, sweat, and tears were involved in the discussions that led to the operational design of the new assessments.  Undoubtedly, there have already been tough compromises in the design along the way.  For example, changes to the design of the PARCC ELA/Literacy assessments at the early grade levels following the spring 2014 field test were well-documented in the fall of 2014.

Now, perhaps even before the results of the initial administration are reported, additional design changes may be required; and those changes may be out of the control of the content or measurement specialists.

You like tomato, I like tomahto …

When changes to the test design occur after the initial administration of an assessment, it is almost always the case that cut scores for achievement levels have already been established under the original design.  Because nobody wants to change newly established achievement standards, a delicate dance begins to identify changes that will not require a new round of standard setting.  More accurately, it is often a dance to provide a rationale explaining why the proposed changes will not require standards to be reset.  In other words, the goal is to explain why the changes were not really changes.

The argument often includes statements such as:

  •  No, eliminating the short composition and half of the constructed-response items will not change the construct being measured.
  • The distribution of items across content standards (or strands or domain clusters) has remained the same.
  • We did an analysis to score this year’s students on the shorter form and the correlation between their scores on the original and shorter forms was quite high. In the much simpler past, there was a lot more wiggle room to make such arguments about post hoc changes to the test design.  In this age of evidence-centered design, however, I would expect it will be much more difficult to make and explain changes to the test design while maintaining the original set of claims.

Let’s call the whole thing off

The pressure to change an assessment is always intense, whether that pressure is to make the assessment shorter, less expensive, easier to administer, less difficult, or perhaps more rigorous. The new assessment programs also face the additional pressure of states having the option to choose another assessment program.  It would have been very difficult for a state to walk away from its own custom assessment in the heyday of NCLB.  Over the last two years, however, it seems to have become commonplace for states to say let’s call the whole thing off to one assessment program and move in a different direction.

The ability of states to walk away adds a new dimension to this test design two-step.  On one hand, there is pressure to make changes to prevent a state from walking away.  On the other hand, an assessment program can choose to hold firm on its design and allow an individual state to walk away.  It takes two to tango.

Ideally, the new assessment programs will make it through this process intact, with their claims still fully supported; and years from now they will be able to look back and say …

The odds were a hundred to one against me
The world thought the heights were too high to climb
But people from Missouri never incensed me
Oh, I wasn’t a bit concerned
For from history I had learned
How many, many times the worm had turned…

They all laughed at Christopher Columbus
When he said the world was round
They all laughed when Edison recorded sound
They all laughed at Wilbur and his brother
When they said that man could fly
They told marconi
Wireless was a phony
Its the same old cry
They laughed at me wanting you
Said I was reaching for the moon
But oh, you came through
Now they’ll have to change their tune
They all said we never could be happy
They laughed at us and how!
But ho, ho, ho!
Who’s got the last laugh now?

(They All Laughed, George and Ira Gershwin, 1937)

Psychometrician, Do No Harm

Charlie DePascale

(Prepared for presentation on April 18, 2015 at the NCME annual conference in Chicago, IL)

Last fall, I was asked to participate in a panel discussion, responding to questions from teachers on the broad topic of making use of assessments and data from assessments in the classroom. Over the course of the winter and spring, as is often the case, the plans for the panel and my role in it morphed until finally coalescing into the daunting task described in the conference program:

Charlie DePascale will talk about psychometricians’ roles in ensuring high-quality measures, competing measurement priorities for, and barriers to, providing educators with more useful information.

One reason that the task was intimidating was because I have never considered myself a psychometrician. I only became a psychometrician when one of my previous employers changed the name of my position to principal psychometrician. What is a psychometrician and what is psychometrics? I was unsure how to answer those questions, so I decided to check the Psychometrics Society website. As it turns out, they are also a little unclear on the answer to those questions. To answer the question, they asked four noted psychometricians to offer a definition of psychometrics. The following carefully selected portions of their definitions made me feel better about calling myself a psychometrician:

Because many of the questions that psychometricians study transcend disciplinary boundaries, and concern general issues of measurement and data analysis, the boundaries of the discipline are fuzzy…

Because measurement in psychology is often done with tests and questionnaires, it is rather imprecise and subject to error. Consequently, statistics plays a major role in psychometrics…

Today, psychometrics covers virtually all statistical methods that are useful for the behavioral and social sciences…

Feeling reassured that I can speak as a psychometrician, I will address the charge to this panel in the context of four big picture issues in which I am involved and on which we would benefit from a stronger connection between psychometricians and educators: large-scale assessment, interim assessments, teacher evaluation and SLO, and fundamental concepts of assessment, measurement, and data literacy.

Large-scale Assessment

In educational testing, psychometricians are most closely associated with large-scale assessments; that is, external, standardized assessments such as
Custom state assessments,
Norm-referenced achievement tests,
National interim assessment programs, and
College admissions exams.

Among those, state assessments have taken on added importance and a seemingly ever-increasing presence in the lives of K-12 educators since the full implementation of the assessment and accountability requirements of NCLB in 2006, With regard to state assessments, the most important message that I, as a psychometrician, can deliver to educators is that the state assessment should not provide you with any information that you didn’t already know. Of course, you already knew that. However, in the midst of the hoopla around data-driven instruction and the hubbub about assessment results informing instruction, and with the new and improved next generation assessments growing in every conceivable dimension, perhaps you were beginning to doubt yourself. Rest easy. I am here to assure you that in a well-functioning, cohesive, local system of curriculum, instruction, and assessments aligned to the state content and achievement standards, state assessment results should confirm what you already knew about the performance of your students, schools, teachers, programs, etc.

What information does the state assessment provide?

You are familiar with the expression no news is good news and have probably heard the term multiple measures. Ideally, when a system is working well, the state assessment will serve as an additional measure confirming what you already knew; an external audit that lets you know that you are in sync with other districts and the state in your interpretation of the state content and achievement standards. The state assessment also provides a common metric that can be used to compare results across students, schools, districts, and over time.

With certain tests, it is also possible to compare performance across states. The catch to all of the above, of course, is the requirement for a well-functioning and aligned local system. If the local system is not well-designed or has been implemented poorly, the state assessment may, in fact, provide information that is discrepant with your local information. In that case, the response should be to determine how the results are different and to ask why. Figuring out why the results are different and what to do about it will involve some analysis of the test results, but will primarily focus on an examination of local materials, practices, and most important, student work in relation to the state standards.

Am I saying that the state assessment results are always right and local practices must always be adjusted when there is a discrepancy? No! However, there must always be an understanding of why there is a discrepancy.

Now one may be tempted to ask, if state assessment is intended to serve the confirmatory purpose that you describe, does it need to be so long, do we need to test all students every year, or do we need to place so much emphasis on the results. Those are good questions to be discussed on another day.

Interim Assessments

Interim assessments such as the MAP tests offered by NWEA and Renaissance Learning’s STAR Reading and Math exams have become quite popular with school districts across the country. They are relatively easy to administer, return results immediately, and provide a variety of informative and useful reports about student achievement in the content area as a whole and on specific skills. What’s not to like?

As the assessment reports provided with the interim assessments continue to improve and the level of information provided continues to increase, it is critical for educators to have a working understanding of how the reported information was derived. How can the test produce such detailed diagnostic and prescriptive information on the basis of such a short administration?

To some extent, the detailed or diagnostic information provided in those reports is based on statistical relationships gleaned from vast amounts of data. In a sense, the results are predictions or descriptions of patterns of typical performance based on the available data. What they are not, in most cases, is a certification that an individual student has mastered a specific standard or set of standards based on assessing the student directly on those standards with a sufficient number of items to make such a determination of mastery. And that’s OK, as long as teachers and administrators understand what the scores mean and how to use them appropriately.

One important thing to keep in mind at this particular point in time is that the stability or consistency of the relationship between student performance on the test items administered and students’ overall performance outside of the test is critical to the usefulness of those scores. In the midst of the implementation of new college ready standards, curriculum, and instruction, there is a high likelihood that those relationships will become unstable (at least temporarily) and may change permanently. The testing companies will make the necessary adjustments to reflect the new relationships as those stabilize, but in the meantime, it may be prudent to use additional caution in interpreting and using those results.

Educator Evaluation and SLO

In the last few years, I have become involved in the design of educator evaluation systems for states, particularly in the design and use of Student Learning Objectives (SLO). On the surface, the basic concept behind SLO is pretty straightforward: at the beginning of the year a teacher defines the knowledge and skills students are expected to acquire during the year; the teacher provides appropriate instruction and monitors student progress throughout the year; at the end of the year the teacher determines the extent to which students have attained the desired knowledge and skills. To a psychometrician, that sounds like teaching. However, I have been told by multiple K-12 educators and teacher educators that this way of thinking represents a paradigm shift. Clearly, a great topic for additional discussion between teachers and psychometricians. Of course, with the implementation of SLO, the devil is in the details; or perhaps some might argue, how the devil is using the details to classify teacher effectiveness. Again, a topic for additional discussion between educators and psychometricians.

Two important points to remember about SLO:

  1. A wide variety of programs that differ in critical aspects are being implemented under the label SLO.
  2. An SLO is a process that includes assessment, but an SLO is not an assessment.

Fundamental Assessment, Measurement, and Data Literacy

For educators to use assessment well in each of the three contexts described above, as well as in the classroom on a regular basis, there is a fundamental level of assessment, measurement, and data literacy required. The first step to acquiring that literacy is to understand that those are three interrelated, but different, concepts.

Assessment literacy refers to the understanding of practices and procedures related to the development and use of assessments in the classroom.

Developing or selecting the appropriate test for a particular purpose
Determining whether an assessment is accessible and free from bias.
Understanding the ways in which the format of a test item impacts the
information that it provides

Measurement literacy refers to the understanding of some fundamental measurement principles, particularly those related to validity and the uncertainty of measurement.
Understanding that all test scores are imprecise (contain error); and the
impact of that on setting targets for pre-post gain scores
Understanding what is gained and what is lost from allowing a student to
retake a test
Awareness of the interrelationship between achievement levels, the
distribution of students scores, and student performance on a test.

Data literacy refers to the skills needed to organize and manipulate data so that it can be analyzed, interpreted, and used to support instruction.

Knowing how to combine data across multiple assessments
Working knowledge of the various tools used to organize, analyze, and
present data
Having the skills to be a wise consumer and producer of data; and to know
how to protect and to share data.

Takeaways

This is hard, complicated, and messy.

In 2009, I attended a seminar titled Measuring 21st Century Skills: New Tools for a New Era. The keynote speaker for the session was Elena Silva, then a Senior Policy Analyst with Education Sector. During the course of her presentation, she happily reported that she had met with psychometricians who told her that if she could define the construct, they could measure it. What the psychometricians failed to tell her, however, was that in education it is virtually impossible to define the construct in such a way that we can actually measure it. The truth is that education is complicated and messy. There are too many factors and too many complex interactions and too much human involvement to stand a chance at true measurement.

Trust your intuition (sometimes), but verify.

In 2005, Phi Delta Kappan published an article by Braun and Mislevy warning of the dangers of policy based on what they referred to as Intuitive Test Theory. They made the case against assessment policy based on commonly held misconceptions or intuitions about assessment and measurement without input from assessment and measurement specialists. In that article, however, they also made the case for intuition in the use of assessment at the classroom level; and the need to be able to trust teachers’ and administrators’ intuition about assessment and assessment results is more critical today in this data-driven world. With the basic literacy described above, understanding of the imprecision of our measurements, and the role that we expect teachers and administrators to play in interpreting and using data, we have to be able to trust their intuitions. If something seems wrong with the data, there probably is something wrong with the data. For their part, we also need teachers and administrators who have the willingness and the tools to verify and support those intuitions with empirical data.

Tradeoffs are necessary, but never trade away the important stuff.

One of the first things we learn as budding psychometricians is that nothing is more important than validity and that validity is limited by reliability. Somewhere along the line that message is reshaped to you cannot have validity without reliability (yet another discussion for another day) and suddenly our focus shifts to reliability. Soon all sorts of decisions are made to preserve reliability at the expense of validity. That is our contribution to the Tradeoffs fiasco. After we are done, there are all of the non-measurement related tradeoffs caused by concerns such as: cost, testing time, fairness, and acceptability. In the end, we end up with an assessment instrument that is ill-suited for its intended purpose, one where the results can never support the claims that are being made. Of course, we seldom go back and adjust those claims or statements of purpose to reflect the final instrument. It is critical, therefore, to understand what you may be sacrificing in the name of standardization or to save time and money.

It is much easier for teachers to understand psychometrics than for
psychometricians to understand teaching.

With some basic training and practice, teachers (and administrators and even policy makers) can acquire the knowledge and skills that they need to make sound assessment choices, interpret and use assessment results effectively, and even construct assessments of sufficient quality for their uses. Most important, with some basic training and practice, teachers will know what not to do with assessments and assessment results. On the other hand, we are far, far away from a point where psychometricians can understand and model teaching, learning, and student achievement with anywhere near the accuracy and precision of meteorologists predicting the weather. What psychometricians can do, however, is work better to understand the type of questions that educators are asking of assessments and assessment results; and then use that information to produce better reports, design better assessments, and convey better information about the limitations of assessment.

This is only a test!

Assessment is a powerful tool, and it can be a dangerous tool if used improperly or recklessly. The bottom line, however, is that assessment is only a tool. A deep understanding of content, and pedagogy is necessary to make the tool useful in the classroom. Beyond the classroom, we need policy makers who have a deep enough understanding of assessment and measurement to know when to seek expert support before establishing assessment based policy and laws that have a negative impact on the classroom and ruin the fun for all of us. An ongoing dialogue among psychometricians, educators, and policy makers is critical. At this point in time, psychometricians cannot do nothing and do no harm.