Reliability: Measurement’s Middle Child

Validity. Reliability. Fairness.

Reliability is one of the big three foundations of educational measurement and testing. It’s right up there with a chapter of its own in the Part 1 of the joint Standards along with Validity and Fairness. We’re supposed to love and respect them all the same. But we all know that Reliability is treated differently than the other two.

Validity is the firstborn and most important. Measurement is nothing without Validity. Measurement’s journals are filled with articles about Validity. Everybody wants their pet issue to be considered a part of Validity’s inner circle (or two-by-two table, as it were).

Fairness is the baby. It took its spot in the front section of the Standards well after the others. (Perhaps it was an accident or the unintended consequence of one-too-many drinks at the NCME President’s Reception). Nobody really understands what Fairness is. Nobody even really pays it much attention; that is, until it starts to throw a tantrum. Then the only way to calm it down is to give it whatever it wants – NOW!

Reliability sits there in middle. It doesn’t even really have its own identity anymore. It’s not enough to title the chapter simply Reliability. In 1999, it was Reliability and Errors of Measurement. In 2014, it became Reliability/Precision and Errors of Measurement.

Not convinced. Consider how each of those chapters in the Standards begins.

Validity:

Validity refers to the degree to which evidence and theory support the interpretations of test scores for proposed uses of tests. Validity is, therefore, the most fundamental consideration in developing and evaluating tests.

Fairness:

This chapter addresses the importance of fairness as a fundamental issue in protecting test takers and test users in all aspects of testing. [and after a couple hundred words explaining that fairness is so important and so complex that this chapter cannot possibly do it justice] Fairness is a fundamental validity issue and requires attention throughout all stages of test development and use.

We get it. Validity and Fairness, fundamental and important to all aspects of testing.

Contrast those openings with the beginning of the chapter on reliability.

Reliability:

A test, broadly defined, is a set of tasks or stimuli designed to elicit responses that provide a sample of an examinee’s behavior or performance in a specified domain. Coupled with the test is a scoring procedure that enables the scorer to evaluate the behavior or work samples and generate a score. [We’re two sentences and 52 words in and no mention of reliability yet.] In interpreting and using test scores, it is important to have some indication of their reliability.

That’s it. That’s the opening paragraph on reliability. It is important to have some indication of reliability. That’s a far cry from fundamental.

[Shoot. Even I am starting to feel sorry for Reliability, and I’m writing this stuff.]

Seriously, if you close your eyes and listen carefully, can’t you just picture Reliability throwing itself down on the sofa, its frustration overflowing, and lamenting “All I hear all day long is we need more evidence of Validity. Validity, Validity, Validity! I’m tired of being in Validity’s shadow all the time.”

Yeah, reliability is like the tagalong little sibling who always wants to hang out with you and your friends, or worse yet, the one that your parents send to the movies with you and your date.

Reliability – A Short Course

In short, the best you can hope for is that reliability doesn’t make the situation worse. It cannot possibly make it better.

And that’s what we were taught about reliability.

Those of us from the grayscale, line drawing, pre-digital image era remember the thermometer-like “validity bar” that demonstrated how the “total amount of validity” (whatever that means) was capped by reliability. Plain and simple, you couldn’t have any more validity than you had reliability.

(Younger folks undoubtedly were subjected to some fancy bullseye image and analogy, but I like the bar.)

Trying to Find a Place in this World

So, reliability did what any middle child would do. It tried to find its niche, its place in the world, the thing that it could do better than validity and fairness.

The answer was obvious – statistics. Validity and fairness had no statistic, no indicator, no index. DIF statistics, bias indices, and so-called “validity coefficients” don’t count. Really, they don’t. Validity and fairness were conceptual, kind of like a state of mind. You had to gather evidence to support their existence, but you couldn’t really see them. It was all very context dependent. And it was a never-ending process.

Reliability would have statistics. Lots of statistics. Concrete statistics. Statistics that you can calculate right away to tell you if your measurement instrument is reliable or not reliable.

Reliability was giddy.

Those geeky, quantitative nerds love statistics – the poor bastards.

And… and….and, I’ll report every single one of those so very different reliability statistics on a scale that goes from 0.00 to 1.00, which they’ll internalize as 0 to 100, so that everybody and their sibling will think they understand it. Assessment contractors will latch onto reliability statistics to promote their tests. Technical Reports will feature reliability and barely even mention validity. Even the USED will require tables and tables of reliability statistics. It will be great.

When they want to administer a single-prompt writing test and worry that it cannot possibly be reliable, I’ll give them inter-rater agreement and they’ll call it reliability. Exact agreement not good enough, no worries, adjacent agreement on a 4-point scale is fine. Trust me.

Worried about alternate assessments? Relax, I love them most of all.

Oh, they’ve become so fixated on metrics, it will be easy.

I’ll have computer output that shows them how each and every item contributes to reliability and how much reliability will be improved when a “bad” item is eliminated. Will removing the item detract from validity? They won’t even ask the question.

G-theory and performance assessments? Same thing. If they want to increase the “generalizability” of their assessment, I’ll show them that they just have to reduce task variance. Oh, a few may worry about violating the assumptions, but I’ll just tell them you know what happens when you ASSUME. That will be enough to confuse them.

I’ll tell them what they need to hear, and they’ll love me.

Is it valid? Well, you see, that’s not really the right question…

Is it fair? Fair to whom and for what purpose? At this time, under those conditions …

Is it reliable? Yes, of course. The reliability is .95.

It Didn’t Have to Be That Way

It wouldn’t have been hard for the Standards chapter on Reliability (oh, I’m sorry, Reliability/Precision, Errors of Measurement, & The Perils of Poor Judgment) to begin something like this:

Reliability:

Reliability refers to understanding the uncertainty that is to some degree associated with every test score. Reliability, therefore, is fundamental to the interpretation and use of tests and test scores.

We can quibble over where to place the word ‘understanding’ but you get the idea.

Would it have helped? Would it have made things better?

It couldn’t have hurt. It couldn’t have made them worse.

If the Standards had been written by or for test users (i.e., the people who have to interpret test scores and live with the consequences of their use) instead by the people and for the people who design, develop, administer, and have to defend tests in court things might have worked out differently for reliability.

We might have been able to skip much of the middle-child drama and focus on the fact that reliability is just as conceptual as validity and fairness. With reliability, we are interested in understanding how consistent a student’s score might be “across different occasions of testing, different editions of the test, or different raters scoring the test taker’s responses.” The range of possibilities under each of those conditions is infinite, to some extent hypothetical, and offers its own unique set of reliability challenges.

Take a Deep Breath in the Mirror

There is not one type of reliability. Reliability is not a statistic that can be computed precisely to several decimal places and reported on a scale from 0 to 100. And we cannot consider reliability in isolation, separate from validity and fairness. Sorry, it just doesn’t work that way.

We need to rethink our understanding and treatment of reliability as a concept and an as an evidence-gathering process. We need to think about how to effectively communicate uncertainty (hint: probably stop calling it error). We need to embrace uncertainty in educational measurement, and not approach it from a deficit mindset.

I’ve been spending the last eight years thinking our infatuation with, misuse of, and misunderstanding of reliability will never end. But now with the extensive interest in the interpretation and use of test scores and increasing focus on the utility of testing programs, on a Monday in late January, I am hopeful that I am watching the field begin again.

Image by Francine Sreca from Pixabay

Reliability – A Short Course

Trying to Find a Place in this World

It Didn’t Have to Be That Way

Take a Deep Breath in the Mirror

Share this:

Published by Charlie DePascale