Matrix Sampling: Resurrected

It is impossible to read an article or hear a presentation about the future of large-scale state testing without some discussion of matrix sampling.

If your primary concerns about large-scale testing are time and cost, the answer is matrix sampling.
If your primary concern is coverage of comprehensive and complex standards like the Next Generation Science Standards, the answer is matrix sampling.
If your primary concern is too much standardization, matrix sampling offers alternatives.
If your primary concern is engagement and equity, consider matrix sampling.

Matrix sampling is the pill that will make everything better.

Would you like the red pill or the blue pill?

Welcome to The Matrix

Before you make your choice, perhaps a word or two about what is meant by the term matrix sampling would be helpful.

In common parlance, you may see the term matrix sampling used to describe any testing plan which does not require

a) testing all students every year in every grade level in English language arts and mathematics and/or

b) testing all students at a particular “year x grade x content” combination with the same test.

What can be sampled in the matrix?

You may sample items with designs in which all students are tested, but do not take the same set of items. There are a variety of approaches to sampling items.
You may sample students such that all students are not tested in a particular content area within a given year. Students within a school might be randomly assigned to the Reading, Writing, Mathematics, Science, or Social Studies test – or perhaps two of those tests.
You may sample content areas, deciding, for example, that it’s important to test English language arts at grades 3, 5, 7, and 10, and to test mathematics at grades 4, 6, 8, and 9.
If the mood strikes, you may even decide that it makes sense to sample districts or schools, testing one-third of the schools each year on a 3-year cycle, or one-fifth of the schools each year on a five-year accountability and accreditation cycle.

Any, all, or a combination of any and all of the approaches to sampling items, students, content areas, or schools may be considered, constrained only by your imagination, your budget, and what you are trying to accomplish.

Ah, there’s the catch. You do still have to identify what it is that you are trying to accomplish via your large-scale testing program and determine whether, and if so, how, sampling might help you reach your goal.

Ah, another catch. Will matrix sampling actually help you reach your goal? There are technical underpinnings, inferences supported and not well-supported, and consequences (intended and unintended) associated with each form of sampling.

Now

Would you like the red pill or the blue pill?

This is your last chance. After this there is no turning back.

The Blue Pill – Continue to Believe Testing is the Problem

You take the blue pill… you wake up in your bed and believe whatever you want to believe.

What our field and the broader education community has wanted to believe is that the problem to be solved is a testing problem. As I listed above, large-scale testing takes too much time and costs too much money, it doesn’t measure the right things, it’s not engaging or culturally relevant. Matrix sampling can help with all of those issues.

Really, it can. We’ve done it before.

I know that it was real because I saw it.

We used matrix sampling successfully in the 1980s and 1990s.

In Massachusetts, school scores in grade 8 mathematics were based on approximately 400 items – instead of the 40 that remained when we abandoned matrix sampling in favor of reporting of individual student scores.
In Maine, we tested and reported school scores in reading, mathematics, science, social studies, arts and humanities, and writing.
In New Hampshire, we convinced people to accept individual student scores based on matrix-sampled items.
In Kentucky, no I’m still not ready to talk about Kentucky.

With matrix sampling, narrowing of the curriculum was not as much of a concern as it later became. As expressed by Jim Popham in the 1993 Kappan article, Circumventing the High Costs of Authentic Assessment,

The recommendation for matrix sampling is based on the assumption that educators are influenced by what is eligible to be assessed just as much as they are influenced by what is actually assessed on any given form of a test.

As the summary of the article states, the goal is to assess only the numbers of students and assessment tasks necessary to influence educators’ instructional efforts.

“the trick is to sample the smallest number of students possible in a school so that a meaningful per-school estimate can be calculated…”

In an ideal world there would be so many performance-based items available covering the entire curriculum, there was also the view expressed by my TAC Yoda, Dale Carlson,

“In advance all 153 writing prompts release I could. Hmmm.”
“To write well kids would learn. Yes.”

(At least that’s how I remember it now. Fun fact: In 1990-91, Dale became the second and last NCME President to come from a state Department of Education.)

Authentic, performance-based assessment is the answer. Matrix sampling makes it feasible.

Ignorance is bliss.

The Red Pill – Can You Handle the Truth?

You take the red pill… and I show you how deep the rabbit hole goes.

You take the red pill, and the meaning of term statistical techniques suddenly becomes clear. It was there in front of you all along, but now you know – there is no spoon.

The first thing you understand about matrix sampling is that you get a different answer when you estimate a school score directly with IRT than when you aggregate individual student scores. Methods and models matter.

But that’s OK, there are statistical techniques.

Next you will fully realize that content isn’t the only thing that our IRT models assume is unidimensional. So much rests on the assumption that all of the students thrown into your model are drawn from a single, homogeneous population – that they are interchangeable widgets as they interact with items.

But, but… What about the socio-cognitive-cultural? But NAEP does it?

There are statistical techniques. Throwing out items with ‘c-level DIF’ was the tip of the iceberg. You’ll be amazed at what is done in the matrix. Why in Kentucky, we …. No, I just can’t go there. Too soon.

It will start to dawn on you that it’s not the statistical techniques that are the problem, but rather the assumptions about students, schools, and systems we are making blindly, because, you know, the blue pill.

Then you will think about the primary reasons for administering large-scale tests that we have been reciting repeatedly from Horace Mann to Arne Duncan.

You want assurance that all schools are providing effective instruction to all students in the state.
- You want to close the “Honesty Gap” – states or schools or teachers painting a far rosier picture than reality for parents and the public.
You want to demonstrate good assessment methods that teachers should use in the classroom (seriously, we have said that).
You want to provide better and more timely information to teachers to inform instruction and improve student learning.
You want to be able to know that “things are getting better” – that more students each year are proficient and competent and engaged and well-fed; are early to bed, early to rise and are finding themselves happy, wealthy, and wise.

I’m sorry. This is a dead end.

Then, after about nine years in The Matrix, it’s going to hit you that none of those are problems that can be solved by improving the large-scale state test.

In fact, none of those ever was a testing problem.

Even if you improve the underlying (or are they overarching?) systemic, societal problems affecting public schools, these are still not testing problems.

They are certainly not measurement problems.

If we want to apply the distinctions between measurement, testing, and assessment that have been described recently by Greg Cizek, Derek Briggs and others (and why wouldn’t you), school effectiveness and accountability may be an assessment problem

There is, and always was, a data collection problem.

And at one point, a long, long time ago, large-scale standardized tests may have offered the best solution to that data collection problem. But that was then, this is now, and yes, This is Us – The Final Chapter, Tuesdays at 9.

There is a communication problem.

There is a curriculum-instruction-assessment problem.

There are problems with educational measurement and the use of measurement in education.

As Chris Brandt and Juan D’Brot wrote recently, there are no silver-bullet solutions to these complex and layered problems. That’s the problem with problems.

After careful reflection, you will likely realize that large-scale testing (improved large-scale testing) can, at best, play a small role in the solution of your problems.

You will want to put your fist through a concrete wall, but the walls surrounding you aren’t made of concrete.

At this point, you may be so entrenched in your career and/or so close to retirement that you are tempted to simply take a drink, perhaps a few more pills, and continue on with all the others who took the blue pill at the beginning.

Ignorance is bliss.

People always prefer security and comfort.

We do only what we’re meant to do.

But you look around and see people breaking free from The Matrix.

Some leave large-scale testing and go to work more closely with schools and districts, some even returning to teaching.

Some leave large-scale testing to work on testing, measurement, and assessment problems and solutions more closely tied to instruction and student learning.

Some, who have always seen themselves as large-scale testing people, seek out problems with large-scale testing solutions.

You even recognize a precious few people who took the red pill still fighting valiantly to improve K-12 large-scale state testing from inside the matrix.

And you realize that your mission is to just remind people what a free mind can do.

Image by Gerd Altmann from Pixabay