Testonyms

In elementary school, first I learned about synonyms and antonyms. Synonym – Same; Antonym – Opposite; Got it. And synonym and antonym are such cool words – “and sometimes y.”

It got a little more complicated with homonyms, which might be homophones (pronounced the same, differ in meaning or spelling), homographs (spelled the same, differ in meaning or pronunciation), or both.

None of that however was adequate preparation for what I would encounter working in large-scale testing, namely TESTONYMS.

What are testonyms?

One word that is used to refer to several different things. The words standard and grade-level come to mind.
Words with very different meanings which are used interchangeably.

Note: Those two bullet points might appear to cover a lot of ground, but testonyms are not to be confused with an even larger category of problematic words in our field: those that we use regularly but with virtually no agreement on what they mean. Refer here for more information on those words.

In this post, I am going to focus on four words:

Measure, Indicator, Stability, Reliability

The appropriate use of these words is particularly important as policymakers and educators attempt to fully understand and recover from the effects of the disruptions of the past two years on student learning and as states restart or reinvent their state accountability systems.

Measure for Measure

We’ve all heard stories of languages or cultures that many words for snow (apparently, the Scots with 421 have the most). Psychometricians, on the other hand, are more “efficient” with our language, getting the most bang for the buck out of every word.

It’s not unusual for people in our field to be at a loss for words, but that doesn’t mean we also have to be skimpy on vocabulary.

Take the word measure.

We use it as a transitive verb with an object; that is to measure something. (Of course, don’t get us started on what it actually means to measure something. Again, there are books on that topic.)

But we also use it as a noun to refer to

the instrument used to measure,
sometimes individual items/tasks on that instrument,
the score obtained via the instrument, and
the construct that we are measuring.

As I said, efficient.

Consider Mathematics Achievement.

When discussing student achievement in mathematics, it would not be unusual to see the word measure used as a verb to describe the process of gathering information about a student’s level of mathematics achievement and also as a noun referring to the test instrument used to measures and the test score obtained on that instrument. Then we will refer to Mathematics Achievement itself, perhaps expressed as some status classification or growth statistic, as a measure of student performance or school effectiveness.

I get it, it’s educational measurement and we want to convince others and ourselves that we are actually measuring something real, but go to your bosom; knock there, and ask your heart can’t we do better.

Measures v. Indicators

I have never understood how it became commonplace to use the words measures and indicators interchangeably (and don’t get me started on metrics). The meanings of the two words are not even all that similar. But it’s not just those of us in education who have adopted this practice. A Google search of ‘measures v. indicators’ returns pages of articles on the topic.

As one of the first articles in the search results explains succinctly:

A measure measures something.

An indicator indicates something.

Could a difference be any clearer than that? (a hint of sarcasm)

Accepting the definition that measures measure and indicators indicate, it should be clear that not all measures are indicators, at least not very good indicators, of something other than the construct they were designed to measure (a very narrow definition of the word indicator). That doesn’t make them any less of a measure.

It’s also unlikely that all indicators have to also be measures, but many are.

It’s a little like “all squares are rectangles, but not all rectangles are squares,” but not at all like the relationship between “Kleenex” and tissue – although it wouldn’t surprise me to learn that CTB tried to trademark “measure” and/or “indicator” at some point in the past.

As I thought about it, it occurred to me that we probably understand the difference between measure and indicator but are just mentally skipping over some steps when we use the words interchangeably.

Back to Mathematics Achievement.

Is Mathematics Achievement a measure or an indicator?

At first glance, it seems like a measure.

We deduce, however, that we wouldn’t have put all of the time, effort, and other valuable resources into measuring something like Mathematics Achievement if it weren’t an indicator of something important – more important than simply student proficiency on the state’s math standards. Flawed logic perhaps, but logic, nonetheless.

And if we went to all that trouble to measure Mathematics Achievement, and it’s an indicator of one thing, it’s not all that difficult to convince ourselves that it could also be an indicator of something else.

That, girls, boys et al, is the way that a measure of Mathematics Achievement, which began life as an indicator of whether money provided to districts and schools serving high percentages of children from low-income families was being used effectively to close the gap in mathematics achievement can become an indicator of School Effectiveness and then Teaching Effectiveness.

And applying the previously established transitive property of the word “measure”: If Mathematics Achievement is an indicator of School Effectiveness and Teaching Effectiveness then the state test and its test scores are also indicators of School Effectiveness and Teaching Effectiveness.

It’s difficult for me to see that thought process going the same way in the opposite direction. That is, I doubt that “individual students’ scores on the state test in Mathematics” would make the top 10 list of answers arrived at by a committee starting with the question “What measures would be good lagging or leading indicators for Teaching Effectiveness?”.

See, this is why the Logic section was so prominent on the GRE.

Stability v. Reliability

Our experience with Teaching Effectiveness, when teacher evaluation became the rage under the Obama-Duncan era regulations in the 2010s, also serves as an example of the problems associated with using stability and reliability interchangeably.

In short, test-based scores measuring teachers’ “impact on student learning” fluctuated from one year to the next. Cries rang out that the measures weren’t reliable.

The more likely explanation, however, is that Teaching Effectiveness, at least as defined in terms of “impact on student learning,” is not as stable as policymakers and psychometricians thought it to be; and certainly not as stable as we needed it to be in a system making annual ratings of teaching effectiveness.

There may be some random (or non-random) variation in Teaching Effectiveness from one year to the next due to factors solely related to the teacher (e.g., family issues, physical health, work relationships). There may be even more variation in Teaching Effectiveness from one class to the next due to a teacher-by-student interaction. There are reasons (beyond scheduling, seniority, local politics, and pettiness) why teachers are not randomly assigned to students.

All of those factors are related to the stability of the construct or trait being measured. None of them are related to the reliability of the measure or instrument used – its accuracy, precision, or consistency. In fact, the better and more reliable the measure is, the more likely it will be to capture year-to-year or class-to-class fluctuations in teachers’ impact on student learning.

Recall that although we have formulas available to compute an array of reliability coefficients, reliability as a property of a test, particularly test-retest reliability, is largely theoretical.

We discuss test-retest reliability in terms of what would happen to the test score if a student were able to take a test multiple times (at least two) with absolutely nothing happening between test administrations and prior administrations having no impact at all on subsequent administration.

We assume that the construct or trait being measured is unchanged (i.e., stable) between test administrations.

In the real world, things change between measurement opportunities.

Understanding the stability, or lack thereof, of a trait like Teaching Effectiveness is a prerequisite for identifying good lagging or leading indicators of it.

Restricting our use of stability to discussions of the trait or construct being measured and reliability to discussion of the measurement instrument might be a good first step.

Saying What We Mean and What We Meant to Say

It’s time to strike a better balance between the effort we put into improving the precision and accuracy of our measures and measurement instruments and the effort we put into the precision and accuracy of the way that we communicate about them and use them.

The four words discussed in this post are just the tip of the iceberg with regard to recovery and reinvention in public education. Appropriate use and a common understanding of words and phrases like growth, progress, test, assessment, opportunity-to-learn, equity, justice, and even mathematics achievement and reading achievement will be just as important, if not more important.

Image by Extravis_Marketing from Pixabay