Organizations, institutions, and individuals who have been called to educational measurement and/or educational assessment as their pathway through life have joined others this summer in an exercise in introspection and self-reflection. The goal of this exercise is to emerge with a better understanding of how we, through our life’s work, have contributed to, endorsed, promulgated, perpetuated, or otherwise facilitated racist beliefs, actions, policies, and systemic racism. This is not a small task for a field that has struggled to define itself and its place in education, let alone in society as a whole.
As summer ends, a new school year begins, and people begin to report out on what they have discerned, an image of educational measurement and assessment is coming into focus. After listening to virtual NCME sessions and other webinars, and reading articles, blog posts, and Zoom chats over the past month, if I were to create a word cloud of the terms used most often to describe our field the two largest words by far would be eugenics and Jim Crow. A solid line is being drawn connecting Galton to Ternan to aptitude tests to tracking to literacy testing in the southern United States to No Child Left Behind to current test-based accountability policies and the “new Jim Crow.” In addition, the bodies of knowledge that we measure and the U.S. school systems in which we practice our craft are being characterized by some as fundamentally racist, branding educational measurement and assessment as racist simply by association and continued participation.
Viewing this from the perspective of a lifelong practicing Catholic, I can appreciate, and even respect, healthy doses of self-doubt, guilt, and penitence accompanied by a satisfying amount of self-flagellation. I must, however, draw the line at self-immolation, and that is what I fear is taking place before our eyes. I think that we can all agree that it is not a good omen when a self-generated list of the defining characteristics of your field evoke emotions so strong that they effectively shut down conversations and further explorations that might lead to a deeper understanding of our history and how we arrived at our current policies and practices.
Some fields may be able to use this summer’s epiphany as a jumping off point for renewal and growth. We must remember, however, that the events of this summer are not occurring in a vacuum (even though it feels like the pandemic has sucked the air out of everything we know and do). Even prior to COVID-19 and George Floyd, many already viewed 2020 as an inflection point for educational measurement and assessment. (Yes, I know that “inflection point” is already one of the most overused buzz words of 2020, but who has a better claim to it than us.). We were already in a fight for survival – under attack from the outside and stressed to a breaking point from the inside.
From outside of the field, 2019 was the year that began with AERA president Amy Stuart Wells using the platform of her presidential address and a multi-modal format to portray educational assessment policies as the new Jim Crow. We entered 2020 with a lawsuit challenging the use of college admissions tests, the current Democratic candidate for president promising to ban high-stakes standardized testing, and the University of California system eliminating the use of the ACT and SAT for admissions.
Internally, our traditional approaches to large-scale assessment are being stressed on several fronts. New complex content and performance standards require new complex assessments and measurement models. The ongoing effort to provide actionable information to inform instruction has been combined with a new focus on assessment in the classroom, through-course testing, and testing on demand is replacing on-demand testing. The availability of process data has increased the pressure to expand educational measurement beyond the measurement of student outcomes.
I come here today neither to bury educational measurement and assessment nor to praise it. I come here not to defend historical and current practices nor to suggest that we back away from the challenges set forth by my colleague Susan Lyons in her blog post, We Are Part of The Problem.
Rather, I am here as a (re)tired old, white, male with a blog that reaches nearly tens of people to suggest that if we hope to contribute to anti-racism and a more equitable system of education we need to begin with a much deeper and more nuanced understanding of our field than eugenics and Jim Crow (old and new).
In so many ways, we are a people and a field defined by the unintended consequences of our actions. I suppose that is fitting. After all, so many advances in science and opportunities for personal growth are byproducts of lessons learned from unintended consequences. It would be nice, however, if the next generation of psychometricians and assessment specialists can learn from the lessons of our past and do a better job than we did in proactively defining our work and its purpose. In that spirit of hope, I offer the following six pieces of advice to a field trying to find itself and find its way into the future.
Know Thyself
Know Thyself
inscription at the Temple of Apollo at Delphi
It has become commonplace to cite Francis Galton and his work in eugenics as the beginning of modern educational measurement. It is certainly true that there is a strong association between Galton’s statistical innovations and the unidimensional IRT models that are the foundation of so many of our current assessments. However, a reminder that correlation is not causation seems uniquely suited to attempts to link current IRT models to Galton’s work in eugenics. Ironically, it would also be a case of the genetic fallacy to condemn the current use of Galton’s statistical methods solely on the basis of his use of them for his work in eugenics.
It is also true that aptitude tests, which also can be traced back to Galton’s interest in individual differences, made their way into testing in the United States in the first part of the 20th century. Although those tests are much better known for their use outside of K-12 education (e.g., college admissions, military job classifications), there were certainly deleterious effects on groups and individuals from the use of aptitude tests for placement in K-12 programs that must be understood as part of our history. There is ample evidence, however, that in many cases the use of aptitude or intelligence tests in K-12 and higher education was intended to promote meritocracy over aristocracy; that is, to enhance equity. Perhaps the appropriate place for aptitude testing in our history is as the first of many unintended consequences of assessment policy.
The bigger problems with focusing on Galton and aptitude testing as the foundation of our field, however, are a) it positions us as psychologists interested in the study of individual differences, which clearly is not the case in recent history, b) it ignores the longstanding role that behavioral psychology has played in education in the design of curriculum, instruction, and assessment, and perhaps most relevant to the current conversation, c) it ignores the beginnings of test-based accountability in U.S. education and the intense desire for efficiency in testing which has been a major force driving assessment for the past 100 years.
A history of educational measurement and assessment in the United States must go back a few decades prior to Galton and eugenics to include Horace Mann and his introduction into public education of written achievement tests for accountability purposes.
A history of educational measurement and assessment in the United States must also acknowledge the conditions that demanded an efficient method of assessing students. The quest for efficiency in testing has been constant from the influx of children of non-English speaking European immigrants into the public education system around the beginning of the 20th century through to the demands of No Child Left Behind to test every student in grades 3 through 8 (and once in high school) every year, which led Assistant Secretary of Education Susan Neumann to proclaim let them use multiple-choice tests.
Because Science
Nothing in life is to be feared, it is only to be understood. Now is the time to understand more, so that we may fear less.
Marie Curie
Many of us involved in educational measurement and assessment fancy ourselves to be scientists. By nature and definition, scientists are a curious and inquisitive lot – even a casual read of the Next Generation Science Standards (NGSS) will confirm this. As scientists involved in educational measurement we must seek to better understand factors that influence student learning and apply that knowledge to educational practice. As scientists involved in educational measurement we must ask uncomfortable questions about individual and group differences and be prepared to deal with the answers. It has always been so and will always be so (another bit of historical context that we lose by truncating our history to the late 1800s).
At this point in our history, society seems to be comfortable with research that finds that environmental factors influence student learning and will excitedly embrace information that demonstrates that cultural factors influence student learning. Societal values and perhaps historical consequences, however, predispose us to reject genetic explanations for group differences.
For the past 30 years or so, the field has not had to address questions related to genetic differences. Since the shift from classical test theory to IRT as the dominant force in educational measurement and assessment design there has been a growing separation of psychometrics (i.e., measurement) from educational psychology (again, correlation does not necessarily indicate causation, but it might). To the chagrin of many, psychometrics has functioned relatively free of research related to human development. There are signs, however, that learning sciences, cognitive science, and neuroscience will play a much greater role in educational measurement moving forward. As a field, we must be prepared to ask hard questions, question easy answers, and be prepared to change course based on knowledge gained from new information, because science.
It’s Supposed to Be Hard
“It’s supposed to be hard. If it wasn’t hard, everyone would do it. The hard is what makes it great.”
Jimmy Dugan, A League of Their Own
Educational measurement, or the measurement of education, is hard and it’s supposed to be hard. Education is quite complex. As a field, however, we must acknowledge that much of what we have attempted to measure throughout our history has been the lowest hanging fruit on the tree of educational knowledge. We have focused almost exclusively on student-level educational outcomes; and driven by the need for efficiency, those outcomes have been limited to the most basic educational outcomes that can be measured within tight constraints of time, cost, and ease of administration.
Beyond outcomes, we have paid little attention since the 1980s to measurement related to learning and the instructional process. We have also paid little attention to the measurement and assessment of non-cognitive skills such as social-emotional skills and their interaction with traditional cognitive knowledge and skills. That is not to suggest that research on instructional practices and their effectiveness has not continued over the past 30 years; only that the measurement and assessment community has been largely absent from that research and that research has been largely absent from our assessments.
To remain relevant, it appears that much more is going to be asked of the measurement community in the near future. As we near the quarter-point of the 21st century, stakeholders are demanding more of educational measurement and assessment, including measurement of more complex individual knowledge and skills, measurement of 21st Century Skills such as collaboration and critical thinking, and measurement of social-emotional skills. Retooling current psychometric models and developing new models to meet these demands while also ensuring equity in measurement will be quite difficult and will likely require dramatic changes in the field.
Dream Things That Never Were
There are those that look at things the way they are, and ask why? I dream of things that never were, and ask why not?
Robert Kennedy, paraphrasing George Bernard Shaw
Given the significant effects that the federal requirement for test-based accountability has had on education and educational measurement in the United States, it is important to understand the role of race in its origins in 1965 and renewal in 2001.
The passage of the Elementary and Secondary Education Act of 1965 (ESEA) is regarded as the beginning of the current era of federal involvement in education, including educational measurement and assessment. As part of a package of Civil Rights legislation passed in the mid-1960s, Title 1 of ESEA included funding to improve equity by providing funding to economically disadvantaged school districts so that all children could have access to a basic education. Senator Robert Kennedy of New York was a leading proponent of including requirements for testing in Title 1 to hold states and LEA accountable to their local communities and Congress for the effective use of Title 1 funds. If you have questions regarding the motivation of Sen. Kennedy with regard to race, I urge you to read or listen to his Day of Affirmation Address delivered at the University of Capetown in South Africa or his Speech on Race delivered at the University of California Berkeley in 1966.
A similar sentiment regarding improving equity can be found in the words of George W. Bush leading up to the passage of NCLB in 2001.
Too many American children are segregated into schools without standards, shuffled from grade-to-grade because of their age, regardless of their knowledge. This is discrimination, pure and simple — the soft bigotry of low expectations. And our nation should treat it like other forms of discrimination: We should end it. (Republican National Convention, 2000)
It is only through understanding the original intent of these programs (i.e., to enhance equity) that we can fully understand and evaluate the unintended consequences that followed.
Know Your Limitations
A man’s got to know his limitations.
Harry Callahan (aka Dirty Harry), Magnum Force
A good test cannot fix an inappropriate measurement model or bad policy. The implementation of a good policy can be derailed by a poorly designed test. A good measurement model and a good test will not save an ill-conceived policy. I will expand on that final point in an upcoming post, AYP Would Have Sucked Just As Much Without Tests.
There is a tendency to treat every problem related to the use of tests as a measurement problem; and in recent years, the measurement community has wholeheartedly accepted this yoke upon its shoulders. Unfortunately, conflating problems of policy or procedure with actual measurement problems not only makes it more difficult to identify and correct policy and procedural problems, but also serves to draw attention away from problems that actually are measurement problems – poorly defined constructs and non-falsifiable validity arguments come to mind as examples.
A field has to know its limitations and boundaries. Individuals within the field have to know their limitations and boundaries. Although educational measurement can certainly contribute to inequities in education, it cannot eliminate them on its own. This sentiment should not be interpreted as the testing equivalent of “guns don’t kill people, people do.” Rather, it is intended as an admonition that we cannot simply measure our way to equity and improved student learning.
In previous presentations and posts, I have cited the story that has become known as the parable of the three stonecutters:
An old story tells of three stonecutters who were asked what they were doing. The first replied, ‘I am making a living.’ The second kept on hammering while he said, ‘I am doing the best job of stonecutting in the entire country.’ The third one looked up with a visionary gleam in his eyes and said, ‘I am building a cathedral.’
In most retellings of the story, the third stonecutter is held in the highest regard; and that is a message that we hear often in educational measurement and assessment, for example, “It’s all about student learning. Period.” My advice to the measurement community, however, is to be aware of the big picture or larger goal, but to understand our role in accomplishing it and focus on doing the best job of stonecutting in the entire country.
Remember the Past
“Those who cannot remember the past, are condemned to repeat it.”
George Santayana
As the field moves forward it will be critical to remember and address one of the most critical lessons from our past (and present). There is a fundamental disconnect between the personalized, content-based, actionable information that stakeholders seek from assessment and psychometric procedures designed to model group performance on a unidimensional construct. More complex psychometric models, more thoughtful reporting of assessment results, and better communication with stakeholders can only partially solve the problem.
We will likely attempt to address the issue through increased “personalization” of assessments. To the extent, however, that personalization is based on modeling driven by psychometrics, data science, or machine learning we must be sufficiently cautious and humble in the claims that we make about individuals, aware that such model-based and data-driven approaches will privilege “typical” behavior and can exacerbate built-in biases.
We must always be wary of making assumptions or claims about individual students based on the central tendencies of the group, or groups, to which we assign them based on coarse characteristics. There can always be more variation within groups than between groups. That fact will not change as we attempt to design assessments, report results, and use assessments in ways that are more culturally aware and appropriate.