So, you’re interested in a career in educational measurement

During the first pandemic summer I attended a virtual NCME session organized by Derek Briggs titled, “Teaching and Learning ‘Educational Measurement’: Defining the Discipline?” In that session distinguished panelists addressed the critical question, “What should it mean to be an educational measurement expert in the future?” Later in 2020, as president of NCME, Derek convened a Task Force whose charge was

To develop and maintain foundational competencies for the field of educational measurement.
To illustrate one or more curricular models for a graduate program in educational measurement.
To engage NCME membership and the field with Task Force findings through conference presentations and published journal articles.

Following that virtual session, and again last week as I read through the draft report of the Task Force, I began to look back on my three-decade career in large-scale assessment and the decade of school-based training that preceded it and ask myself a series of questions.

What were the competencies that I relied on most in my various roles; and when and how did I acquire them?
Which competencies were foundational?
Which competencies were developed over time, and perhaps can only be developed through time and experience?
Which of those competencies could be described as competencies in educational measurement?

Staying Ahead of The Curve
Not Falling (too far) Behind the Curve

(I couldn’t decide which heading was the better choice for this section.)

All of my school-based training in education and educational measurement occurred in the 1980s

From undergraduate courses in 1980 and 1981 (sociology, psychology, philosophy, social psychology, statistics for the social sciences, education policy),
to my M.Ed. in educational research in 1983, to
to my Ph.D. in measurement and evaluation with an unofficial minor in industrial/organizational psychology, defending my dissertation in November 1989.

Accordingly, I learned how to submit statistical analyses on IBM punch cards or from remote terminals, wrote my own 3-way ANOVA and correlation programs in FORTRAN, took a series of courses based on Classical Test Theory, and learned a lot about the three real types of validity plus face validity.

The foundational competencies in all three of those critical domains of educational measurement were drastically different when my career in large-scale testing began at Advanced Systems in the fall of 1989.

On my first day at Advanced Systems, I was placed in front of a PC, given a SAS manual, and tasked with writing programs to perform school-level IRT analyses. And the field was abuzz with talk of the Messick chapter that had been published while I was busy completing and defending my dissertation. Shortly thereafter, I received the floppy disk containing BILOG and the “manual” which contained a list of commands; and a little later a floppy disk containing MULTILOG along with a nice letter from David Thissen, but no manual.

Rich Hill’s boast that I didn’t know how to spell IRT when he hired me was not quite accurate. I had walked over to the psychology building in Minnesota to audit a couple of courses by David Weiss. But it was close.

Most people are not going to begin their careers in a year like 1989, a defining moment, with that trifecta of change in the field and in technology, but change will happen.

At some point competencies in IRT became foundational, then it was modeling with courses in SEM and HLM. Then perhaps Bayesian statistics and some data science-y, learning analytic tools became foundational competencies. Soon, if not already, it will be machine learning and computational psychometrics. Which competencies are jettisoned, which are integrated into the growing mix. Does the level of foundational competency become thinner as the list of competencies that are foundational grows longer.

And then there’s data visualization with its principles and software which has to be a foundational competency in any educational measurement program. A picture’s worth a thousand words. Possessing the competencies needed to conceive of and produce the picture – priceless.

It’s cliché but learning how to learn is probably the most important foundational competency in any K-12, undergraduate, or graduate program. The field is going to continue to evolve and required knowledge and skills are going to change.

When considering foundational competencies and graduate programs, therefore, it may be important to determine which educational measurement competencies and principles

are universal and constant,
are universal but will likely have to be applied in different settings and contexts, and
are likely to be replaced by different competencies in the near future.

A Catch-22, however, is that the period when you are most likely to need the concrete, procedural competencies that are most susceptible to change is early in your career are when you are fresh out of graduate school beginning your career.

A Place in This World

At the outset of his recent book Derek Briggs emphasizes that “testing and measurement are two distinct activities” and as I discussed extensively in a post earlier this year, I view myself as a testing specialist (or assessment specialist). The broad definitions that the Task Force offers for educational measurement and careers in educational measurement, however, suggest to me that educational measurement involves both testing and measurement.

That perspective on educational measurement makes me wonder where a career like mine fits in this conversation about foundational competencies in educational measurement – what I would need from a graduate program in educational measurement, or more generally, where I fit within the world of educational measurement.

At various points in my career, I was called a psychometrician, which only further clouds the issue in my mind. Psychometrics, applied to large-scale testing, seems to fall somewhere in the intersection of the disciplines of statistics and measurement; and the role of the practicing psychometrician is defined broadly enough to include the work of both testing and measurement specialists.

The distribution of people laying claim to the title psychometrician includes the special few, those truly scary people (in the best sense of the word scary) who develop new psychometric models, a larger group who apply those various models in traditional and innovative ways to design, develop, administer, and evaluate results from measurement instruments, and then the big pile of people in the fat part of the curve at the far left who simply run psychometric procedures in operational and research settings. It is without a doubt a positively skewed distribution.

And what to make of those people who spend the better part of their career as managers and then advisors, far removed from data, interpreting the outcomes of psychometric analyses.

The question “Who should be called a psychometrician?” or “What qualifies someone to be a psychometrician?” is one that the late Kevin Sweeney and I discussed on several occasions during down times at standard setting meetings or TAC meetings. For Kevin, the defining characteristic was conducting an equating analysis. If you had been responsible for equating tests in an operational large-scale assessment program, then you had earned the right to be called a psychometrician. That criterion works for me because truly conducting an equating analysis from design to interpretation, not simply running data through STUIRT, is a cumulative, culminating, competency.

So, when thinking about foundational competencies for educational measurement, I might ask what competencies are necessary to be able to equate two, or more, test forms, then explain and stand behind the results. Because the reality is that a psychometrician is likely going to be required to perform that task fairly early in their career.

Building a Strong Foundation –
Education, Measurement, and a Whole Lot More

I cannot think about the term foundational competencies without thinking about the distinction that Kane made between licensure and certification on several occasions, including in an article on the bar exam:

Bar examination scores can be interpreted as measures of competencies that are critical for practice in the sense that they are necessary, but not sufficient, for effective performance in practice. A high level of achievement in the critical competencies does not guarantee success in practice, but lack of competence would be a serious impediment in practice and would tend to put clients at risk.

With that distinction in mind, I returned to the question of foundational competencies in educational measurement and arrived at ten understandings that are critical for a person practicing educational measurement must possess not to put clients at risk.

Distinguishing between measurement and statistics – it’s all rock and roll to me

The line that demarcates statistical and measurement processes and procedures is quite fuzzy to begin with, and I can only imagine that the distinction between the two is totally lost on many graduate students learning those procedures for the first time. We set sail with the good ship S.S. Stevens showing students that the mean of two numbers might be meaningless. We need to continue to communicate that a procedure may not be appropriate, and results are not necessarily meaningful, just because a program converges and produces an output.

Peel back the curtain, but look for the right things when you do

If you’re familiar with my posts, you know there is no bigger advocate for understanding and learning from the past. Still, I find myself bemused by people who are now hyper-concerned about the politics, religious beliefs, and breakfast choices of the person who developed or used the procedure decades or a century ago, but have no qualms about using the output from a statistical software package, blissfully unaware of what is taking place within that black box.

Personally, I am much more concerned about whether a particular procedure can capture differential growth within a subgroup of the population or how robust it is to inevitable violations of its assumptions than whether it may have been developed by someone who promoted eugenics or whether the results it produces have the potential to be misused. (Spoiler alert: Results can always be misused.)

There’s more than one type of error. They are all different – and possibly significant.

Sampling error, non-sampling error, measurement error, equating error, random error, systematic error, equating error, …. How are they different? Why does it matter? For which is sample size irrelevant? What is their impact?

And just for fun, try to come up with a good way to depict and communicate error on individual student score reports.

There’s more than one type of reliability. They are all different

We learn very early on that there are different types of reliability: test-retest, internal consistency, parallel forms, inter-rater. It’s imperative, however, that students understand that they are not interchangeable.

Although it will seem counterintuitive, particularly when dealing with output from statistical packages, it’s also critical that students understand that simply maximizing reliability is seldom the goal.

The data don’t fit the model v. The model doesn’t fit the data

Understand the difference between those two statements and why it’s important to understand that they are different.

Why Am I such a Misfit?

There is often a lot to learn from examining which students, schools, items, etc. don’t fit the model.

Learned v. Learning

There is a fundamental difference between “measuring” what’s been learned, which has been a primary focus of those practicing educational measurement and understanding the learning process, including understanding the barriers to and facilitators of learning, which is often the primary focus of those using measurement to support education. The former is but one piece of critical data for the latter.

Outside-In v. Inside-Out

Historically, educational measurement has sought patterns in large groups to make inferences about individual students. Other branches of psychological measurement explore deep within the brains of individuals in search of answers that might be applied to groups of students. Neither approach is complete or sufficient for educational purposes.

Tests are not testing programs

Without drifting into the debates about assessment v. testing or the role of consequences and use in validity, the simple fact is that tests are different from testing programs. Angoff noted this decades ago when the Standards changed their focus from tests to testing. It’s more important now than ever before that students understand the differences and the implications of those differences.

And accountability systems are neither tests nor testing programs.

Why? So What? What Else? What If? What About?

These five questions may be the most powerful tools in the educational measurement professional’s toolbox. Like all powerful tools, they can be dangerous when mishandled. Knowing when and how to use them and the ability to recognize when they are being used inappropriately is a critical competency that likely builds over time.

Just one more thing…

In closing, I have one final thought for those designing programs in educational measurement with the goal of producing the educational measurement experts of the future. I stated earlier that the task force had broadlydefined educational measurement and careers in educational measurement.

Educational measurement involves measurement of knowledge, skills, dispositions, and abilities for some educational purpose, such as supporting learning, certifying learning, or identifying policies and practices that improve learning.

The question that nags me, however, is whether that definition of educational measurement is actually too narrow. For a long time, I have wondered where measurement of the system of schooling fits within educational measurement. The question seemed particularly relevant a few years back when we tried to measure teacher effectiveness and teachers’ impact on student learning.

We may model that students are nested within classes within schools within districts, etc., but we rarely directly address how, or whether, system dynamics fit within educational measurement. As the field moves toward more complex modeling of student learning, I can only assume that it will become increasingly important to address this issue head on. If school system dynamics don’t reside within educational measurement, where do they reside?

Image by Gerd Altmann from Pixabay