Making Meaning from Mean Differences

There are few constants in this quickly- and ever-changing world. One of those, however, has been that there will be group differences on state tests, and those differences will be quite predictable.

So, we find group differences in mean scores on the state test. Differences that we knew that we would find, the very same differences that we have found for the past twenty years. Now what?

How do we interpret those differences?  What do they mean?

You might think that we would have good answers to those questions. Generating group scores and checking for group differences, after all, is the raison d’etre for state assessment programs.

Don’t  fooled by all of the emphasis placed on individual student results, or by the fact that the measurement models and procedures associated with state tests most often are based on individual rather than group performance,  or by technical reports for state tests that focus almost exclusively on the technical quality of individual student scores.

State tests are administered to produce district and school scores, scores which for the past two decades have been fed into federally mandated accountability systems.  

Sadly, however, we don’t have good answers to questions about how to interpret group differences on state tests and it doesn’t appear that we will have any in the near future. If anything, those answers have become more elusive and our attempts to provide them more convoluted over the past 20 years.

Becoming more circumspect about how to interpret group differences on state tests may be a sign of growth.  A willingness to provide simplistic answers to complex questions is not a hallmark of a strong field. Perhaps we are doing a much better job of acknowledging the complexity of the question now than we were two decades ago. If true, that would be progress. At some point, however, we have to be able to simply communicate complex answers to complex questions. That ability is the hallmark of a strong field.

Let’s take a quick look at where we have been and where we seem to be headed with regard to group differences on state tests.

Pre-NCLB – The Statistical Era

Prior to NCLB and its assessment and accountability requirements, the norm was to approach group differences on state tests from a statistical perspective, focusing on identifying statistically significant differences between groups. I cut/pasted into countless interpretive guides boilerplate language that talked about the importance of sample size and the size of the mean difference when making comparisons between groups within a year or within a group across years. A handy table showing the minimum difference in scale scores or in the percentage of students classified as Proficient or higher to consider statistically significant was also provided.

Usually missing from these guides was actual guidance on how to interpret a statistically significant difference to improve instruction or student learning. There was language cautioning that very small differences in scores might be statistically significant given enough students and additional language contrasting statistical significance with some non-quantified concept called practical significance or educational significance (i.e., differences that might be attributed to some difference in program, policy, or instruction).

Few educators found this information to be particularly helpful, statisticians questioned its accuracy, and in general, the advice was ignored as the public and policymakers made all sorts of comparisons  – the ineffective communication trifecta!

Small gains in test scores were celebrated and small losses were cause for alarm. We could, however, sleep well at night able to point to advice about exercising caution in the interpretation of group differences (just kidding, we never slept well during reporting season).

NCLB and ESSA – The Off the Hook Era

I describe the period since the NCLB assessment requirements kicked in as the off the hook era in interpreting group differences.

First, state tests and state assessment specialists were let off the hook from explaining group differences because all of the attention shifted to accountability systems and their classifications of school effectiveness. Let the accountability folks handle it.

Second, the amount of data available for computing and interpreting group differences was off the hook as the sheer number of possible comparisons exploded with the requirements for testing at every grade level, reporting disaggregated results, and making adequately yearly progress toward annual measurable objectives.

Third, advances in approaches to describing changes in group performance over time were off the hook as expanded state testing and technological advances in states’ school and student information management systems supported the development of a variety of growth models and other models that attempted to account for the complexity of district and school.

None of this, however, necessarily moved us any closer to making meaning from group mean differences.

The Present – The Spaghetti Era

Although still operating under the requirements of ESSA, the field has clearly entered a new era with regard to group differences. Once again, the assessments that reveal group differences are being scrutinized as much as the way that those differences are used in various accountability systems.

I refer to current period as the spaghetti era for two reasons. The first reason refers to the term spaghetti junction –  “a nickname sometimes given to a complicated or massively intertwined road traffic interchange that is said to resemble of plate of spaghetti.” The second reason refers to the apocryphal practice of throwing spaghetti against the wall to see what sticks.

I think that the combination of the two describes the current situation quite well. The good news is that society is beginning to recognize and acknowledge the complexity of the massively intertwined factors that contribute to the group differences state tests reveal year after year. The bad news is that we have not yet reached the acceptance phase of looking for systemic solutions to a complex problem, but rather are still in the anger or bargaining phases, throwing a plateful of individual reasons that might explain (i.e., explain away) group differences against the wall in the hope that some of them will stick.

The mean differences in test scores are real, but…

  • The tests aren’t measuring the right/important/necessary knowledge and skills
  • We shouldn’t focus on mean differences and gaps in performances
  • They are outside of our control and can be ignored if we condition on the right variables (nobody ever actually says this, they just implement policies that reflect this viewpoint).

The mean differences in test scores are not real, because …

  • They are simply a results of the way that the tests are designed/administered/used.
  • They are the result of inequities in access to test preparation
  • They are not real (just say it often enough and loudly enough to create an alternative reality– different than denial)

This rhetoric surrounding group differences on state tests in many ways is an accurate reflection of the reality of the complexity of the situation. Computing and interpreting groups differences is a challenge. Understanding the causes of and acting to reduce/eliminate group differences if that is, in fact, the desired, or even a desirable, outcome is a wicked societal problem.

In my next post, I will devote more attention to the rhetoric and reality of the societal response to group differences and the impact that response has had over the years on the way that we report and use test results. To conclude this post, however, I want to focus on the more basic issue of our role and responsibilities in generating the test scores which form the basis for those group differences.

Moving Forward As a Measurement Community

We first must accept that the measurement community is not going to solve the problem of group differences on its own. In fact, our role is likely to be quite limited.

We can be part of the solution.  We can build better tests and we can even build very different tests. Part of contributing to the solution, however, requires coming to grips with the limitations of who we are and what we do.

Most measurement theorists and practitioners bristle at the oft-cited Stevens (1946) definition of measurement as the assignment of numerals to events or objects according to a rule(s). There must be more than that to educational measurement, this noble field that we have chosen as our vehicle to improve the world – or at a minimum to help prepare citizens who will change the world for the better.

Well, yes and no.  Yes, the thought of merely assigning numeral to events or objects sounds too simplistic. It is the rules that define what we measure and how we measure it, however, that make all the difference.

The cited portion of the Stevens definition leaves out the important discussion that follows about those rules:

“The problem as to what is and is not measurement then reduces to the simple question: What are the rules, if any, under which numerals are assigned? If we can point to a consistent set of rules, we are obviously concerned with measurement of some sort, and we can then proceed to the more interesting question as to the kind of measurement it is. In most cases a formulation of the rules of assignment discloses directly the kind of measurement…”

For Stevens, this led to a discussion of  four types of scales. We may not be concerned with scales at the moment, but we should be very much concerned with understanding the rules that define what we measure and how we measure it.

We don’t make those rules. We implement the rules.

We measure what society wants to measure. We can do, and have to do, a better job of that. Most importantly, we cannot let our limitations get in the way. Our decisions and our tests should reflect, not define, the constructs we measure.

We cannot re-define the construct that we have been asked to measure to fit our technical limitations or other constraints. We cannot allow our lack of understanding of content or learning development affect the way that we interpret and implement the rules. Allowing our limitations to drive practice can have serious consequences. For example, the state test that we construct often becomes the de facto instantiation of the state content standards. If we cannot faithfully measure the state standards in a way that meets our own Standards, we need to say so.

We need to be clear on what we can and cannot measure. More appropriately, we need to be clear on how well we can measure it and under what circumstances. We need to provide educators and policymakers with the information they need to make informed decisions about how best to use what we have to offer.

It’s humbling, but that is our lot in life. We have a role to play and our task is to play that role to the best of our ability – nothing more and nothing less.

Image by Gerd Altmann from Pixabay

Published by Charlie DePascale

Charlie DePascale is an educational consultant specializing in the area of large-scale educational assessment. When absolutely necessary, he is a psychometrician. The ideas expressed in these posts are his (at least at the time they were written), and are not intended to reflect the views of any organizations with which he is affiliated personally or professionally..

%d bloggers like this: