Search EdWorkingPapers

Search EdWorkingPapers by author, title, or keywords.

Methodology, measurement and data

Kelli A. Bird, Benjamin L. Castleman, Yifeng Song.

Predictive analytics are increasingly pervasive in higher education. However, algorithmic bias has the potential to reinforce racial inequities in postsecondary success. We provide a comprehensive and translational investigation of algorithmic bias in two separate prediction models -- one predicting course completion, the second predicting degree completion. Our results show that algorithmic bias in both models could result in at-risk Black students receiving fewer success resources than White students at comparatively lower-risk of failure. We also find the magnitude of algorithmic bias to vary within the distribution of predicted success. With the degree completion model, the amount of bias is nearly four times higher when we define at-risk using the bottom decile than when we focus on students in the bottom half of predicted scores. Between the two models, the magnitude and pattern of bias and the efficacy of basic bias mitigation strategies differ meaningfully, emphasizing the contextual nature of algorithmic bias and attempts to mitigate it. Our results moreover suggest that algorithmic bias is due in part to currently-available administrative data being less useful at predicting Black student success compared with White student success, particularly for new students; this suggests that additional data collection efforts have the potential to mitigate bias.

More →


Ashley Edwards, Justin C. Ortagus, Jonathan Smith, Andria Smythe.

Using data from nearly 1.2 million Black SAT takers, we estimate the impacts of initially enrolling in an Historically Black College and University (HBCU) on educational, economic, and financial outcomes. We control for the college application portfolio and compare students with similar portfolios and levels of interest in HBCUs and non-HBCUs who ultimately make divergent enrollment decisions - often enrolling in a four-year HBCU in lieu of a two-year college or no college. We find that students initially enrolling in HBCUs are 14.6 percentage points more likely to earn a BA degree and have 5 percent higher household income around age 30 than those who do not enroll in an HBCU. Initially enrolling in an HBCU also leads to $12,000 more in outstanding student loans around age 30. We find that some of these results are driven by an increased likelihood of completing a degree from relatively broad-access HBCUs and also relatively high-earning majors (e.g., STEM). We also explore new outcomes, such as credit scores, mortgages, bankruptcy, and neighborhood characteristics around age 30.

More →


Brendan Bartanen, Aliza N. Husain, David D. Liebowitz.

School principals are viewed as critical actors to improve student outcomes, but there remain important methodological questions about how to measure principals’ effects. We propose a framework for measuring principals’ contributions to student outcomes and apply it empirically using data from Tennessee, New York City, and Oregon. As commonly implemented, value-added models misattribute to principals changes in student performance caused by unobserved time-varying factors over which principals exert minimal control, leading to biased estimates of individual principals’ effectiveness and an overstatement of the magnitude of principal effects. Based on our framework, which better accounts for bias from time-varying factors, we find that little of the variation in student test scores or attendance is explained by persistent effectiveness differences between principals. Across contexts, the estimated standard deviation of principal value-added is roughly 0.03 student-level standard deviations in math achievement and 0.01 standard deviations in reading.

More →


Dorottya Demszky, Jing Liu, Heather C. Hill, Shyamoli Sanghi, Ariel Chung.

While recent studies have demonstrated the potential of automated feedback to enhance teacher instruction in virtual settings, its efficacy in traditional classrooms remains unexplored. In collaboration with TeachFX, we conducted a pre-registered randomized controlled trial involving 523 Utah mathematics and science teachers to assess the impact of automated feedback in K-12 classrooms. This feedback targeted “focusing questions” – questions that probe students’ thinking by pressing for explanations and reflection. Our findings indicate that automated feedback increased teachers’ use of focusing questions by 20%. However, there was no discernible effect on other teaching practices. Qualitative interviews revealed mixed engagement with the automated feedback: some teachers noticed and appreciated the reflective insights from the feedback, while others had no knowledge of it. Teachers also expressed skepticism about the accuracy of feedback, concerns about data security, and/or noted that time constraints prevented their engagement with the feedback. Our findings highlight avenues for future work, including integrating this feedback into existing professional development activities to maximize its effect.

More →


Jing Liu, Megan Kuhfeld, Monica Lee.

Noncognitive constructs such as self-e cacy, social awareness, and academic engagement are widely acknowledged as critical components of human capital, but systematic data collection on such skills in school systems is complicated by conceptual ambiguities, measurement challenges and resource constraints. This study addresses this issue by comparing the predictive validity of two most widely used metrics on noncogntive outcomes|observable academic behaviors (e.g., absenteeism, suspensions) and student self-reported social and emotional learning (SEL) skills|for the likelihood of high school graduation and postsecondary attainment. Our  ndings suggest that conditional on student demographics and achievement, academic behaviors are several-fold more predictive than SEL skills for all long-run outcomes, and adding SEL skills to a model with academic behaviors improves the model's predictive power minimally. In addition, academic behaviors are particularly strong predictors for low-achieving students' long-run outcomes. Part-day absenteeism (as a result of class skipping) is the largest driver behind the strong predictive power of academic behaviors. Developing more nuanced behavioral measures in existing administrative data systems might be a fruitful strategy for schools whose intended goal centers on predicting students' educational attainment.

More →


Joshua B. Gilbert, James S. Kim, Luke W. Miratrix.

Longitudinal models of individual growth typically emphasize between-person predictors of change but ignore how growth may vary within persons because each person contributes only one point at each time to the model. In contrast, modeling growth with multi-item assessments allows evaluation of how relative item performance may shift over time. While traditionally viewed as a nuisance under the label of “item parameter drift” (IPD) in the Item Response Theory literature, we argue that IPD may be of substantive interest if it reflects how learning manifests on different items at different rates. In this study, we present a novel application of the Explanatory Item Response Model (EIRM) to assess IPD in a causal inference context. Simulation results show that when IPD is not accounted for, both parameter estimates and their standard errors can be affected. We illustrate with an empirical application to the persistence of transfer effects from a content literacy intervention on vocabulary knowledge, revealing how researchers can leverage IPD to achieve a more fine-grained understanding of how vocabulary learning develops over time.

More →


Paul T. von Hippel.

Longitudinal studies can produce biased estimates of learning if children miss tests. In an application to summer learning, we illustrate how missing test scores can create an illusion of large summer learning gaps when true gaps are close to zero. We demonstrate two methods that reduce bias by exploiting the correlations between missing and observed scores on tests taken by the same child at different times. One method, multiple imputation, uses those correlations to fill in missing scores with plausible imputed scores. The other method models the correlations implicitly, using child-level random effects. Widespread adoption of these methods would improve the validity of summer learning studies and other longitudinal research in education.

More →


Arielle Boguslav, Julie Cohen.

Teacher preparation programs are increasingly expected to use data on pre-service teacher (PST) skills to drive program improvement and provide targeted supports. Observational ratings are especially vital, but also prone to measurement issues. Scores may be influenced by factors unrelated to PSTs’ instructional skills, including rater standards and mentor teachers’ skills. Yet we know little about how these measurement challenges play out in the PST context. Here we investigate the reliability and sensitivity of two observational measures. We find measures collected during student teaching are especially prone to measurement issues; only 3-4% of variation in scores reflects consistent differences between PSTs, while 9-17% of variation can be attributed to the mentors with whom they work. When high scores stem not from strong instructional skills, but instead from external circumstances, we cannot use them to make consequential decisions about PSTs’ individual needs or readiness for independent teaching.

More →


Kirsten Slungaard Mumma.

The recent spike in book challenges has put school libraries at the center of heated political debates. I investigate the relationship between local politics and school library collections using data on books with controversial content in 6,631 public school libraries. Libraries in conservative areas have fewer titles with LGBTQ+, race/racism, or abortion content and more Christian fiction and discontinued Dr. Seuss titles. This is true even though most libraries have at least some controversial content. I also find that state laws that restrict curricular content are negatively related to some kinds of controversial books. Finally, I present descriptive short-term evidence that book challenges in the 2021-22 school year have had “chilling effects” on the acquisition of new LGBTQ+ titles.

More →


Zachary Himmelsbach, Heather C. Hill, Jing Liu, Dorottya Demszky.

This study provides the first large-scale quantitative exploration of mathematical language use in U.S. classrooms. Our approach employs natural language processing techniques to describe variation in the use of mathematical language in 1,657 fourth and fifth grade lessons by teachers and students in 317 classrooms in four districts over three years. Students’ exposure to mathematical language varies substantially across lessons and between teachers. Students whose teachers use more mathematical language are more likely to use it themselves, and they perform better on standardized tests. These findings suggest that teachers play a substantial role in students’ mathematical language use.

More →