Search EdWorkingPapers by author, title, or keywords.
Methodology, measurement and data
School principals are viewed as critical mechanisms by which to improve student outcomes, but there remain important methodological questions about how to measure principals' effects. We propose a framework for measuring principals' contributions to student outcomes and apply it empirically using data from Tennessee, New York City, and Oregon. We find that using contemporaneous student outcomes to assess principal performance is flawed. Value-added models misattribute to principals changes in student performance caused by factors that principals minimally control. Further, little to none of the variation in average student test scores or attendance is explained by persistent effectiveness differences between principals.
After near-universal school closures in the United States at the start of the pandemic, lawmakers and educational leaders made plans for when and how to reopen schools for the 2020-21 school year. Educational researchers quickly assessed how a range of public health, political, and demographic factors were associated with school reopening decisions and parent preferences for in-person and remote learning. I review this body of literature, to highlight what we can learn from its findings, limitations, and influence on public discourse. Studies consistently highlighted the influence of partisanship, teachers’ unions, and demographics, with mixed findings on COVID-19 rates. The literature offers useful insight and requires more evidence, and it highlights benefits and limitations to rapid research with large-scale quantitative data.
Analyses that reveal how treatment effects vary allow researchers, practitioners, and policymakers to better understand the efficacy of educational interventions. In practice, however, standard statistical methods for addressing Heterogeneous Treatment Effects (HTE) fail to address the HTE that may exist within outcome measures. In this study, we present a novel application of the Explanatory Item Response Model (EIRM) for assessing what we term “item-level” HTE (IL-HTE), in which a unique treatment effect is estimated for each item in an assessment. Results from data simulation reveal that when IL-HTE are present but ignored in the model, standard errors can be underestimated and false positive rates can increase. We then apply the EIRM to assess the impact of a literacy intervention focused on promoting transfer in reading comprehension on a digital formative assessment delivered online to approximately 8,000 third-grade students. We demonstrate that allowing for IL-HTE can reveal treatment effects at the item-level masked by a null average treatment effect, and the EIRM can thus provide fine-grained information for researchers and policymakers on the potentially heterogeneous causal effects of educational interventions.
Given recent evidence challenging the replicability of results in the social and behavioral sciences, critical questions have been raised about appropriate measures for determining replication success in comparing effect estimates across studies. At issue is the fact that conclusions about replication success often depend on the measure used for evaluating correspondence in results. Despite the importance of choosing an appropriate measure, there is still no wide-spread agreement about which measures should be used. This paper addresses these questions by describing formally the most commonly used measures for assessing replication success, and by comparing their performance in different contexts according to their replication probabilities – that is, the probability of obtaining replication success given study-specific settings. The measures may be characterized broadly as conclusion-based approaches, which assess the congruence of two independent studies’ conclusions about the presence of an effect, and distance-based approaches, which test for a significant difference or equivalence of two effect estimates. We also introduce a new measure for assessing replication success called the correspondence test, which combines a difference and equivalence test in the same framework. To help researchers plan prospective replication efforts, we provide closed formulas for power calculations that can be used to determine the minimum detectable effect size (and thus, sample sizes) for each study so that a predetermined minimum replication probability can be achieved. Finally, we use a replication dataset from the Open Science Collaboration (2015) to demonstrate the extent to which conclusions about replication success depend on the correspondence measure selected.
How scholars name different racial groups has powerful salience for understanding what researchers study. We explored how education researchers used racial terminology in recently published, high-profile, peer-reviewed studies. Our sample included all original empirical studies published in the non-review AERA journals from 2009 to 2019. We found two-thirds of articles used at least one racial category term, with an increase from about half to almost three-quarters of published studies between 2009 and 2019. Other trends include the increasing popularity of the term Black, the emergence of gender-expansive terms such as Latinx, the popularity of the term Hispanic in quantitative studies, and the paucity of studies with terms connoting missing race data or including terms describing Indigenous and multiracial peoples.
We design a commitment contract for college students, "Study More Tomorrow," and conduct a randomized control trial testing a model of its demand. The contract commits students to attend peer tutoring if their midterm grade falls below a pre-specified threshold. The contract carries a financial penalty for noncompliance, in contrast to other commitment devices for studying tested in the literature. We find demand for the contract, with take-up of 10% among students randomly assigned a contract offer. Contract demand is not higher among students randomly assigned to a lower contract price, plausibly because a lower contract price also means a lower commitment benefit of the contract. Students with the highest perceived utility for peer tutoring have greater demand for commitment, consistent with our model. Contrary to the model's predictions, we fail to find evidence of increased demand among present-biased students or among those with higher self-reported tendency to procrastinate. Our results show that college students are willing to pay for study commitment devices. The sources of this demand do not align fully with behavioral theories, however.
A significant share of education and development research uses data collected by workers called “enumerators.” It is well-documented that “enumerator effects”—or inconsistent practices between the individual people who administer measurement tools— can be a key source of error in survey data collection. However, it is less understood whether this is a problem for academic assessments or performance tasks. We leverage a remote phone-based mathematics assessment of primary school students and survey of their parents in Kenya. Enumerators were randomized to students to study the presence of enumerator effects. We find that both the academic assessment and survey was prone to enumerator effects and use simulation to show that these effects were large enough to lead to spurious results at a troubling rate in the context of impact evaluation. We therefore recommend assessment administrators randomize enumerators at the student level and focus on training enumerators to minimize bias.
This study introduces the signal weighted teacher value-added model (SW VAM), a value-added model that weights student-level observations based on each student’s capacity to signal their assigned teacher’s quality. Specifically, the model leverages the repeated appearance of a given student to estimate student reliability and sensitivity parameters, whereas traditional VAMs represent a special case where all students exhibit identical parameters. Simulation study results indicate that SW VAMs outperform traditional VAMs at recovering true teacher quality when the assumption of student parameter invariance is met but have mixed performance under alternative assumptions of the true data generating process depending on data availability and the choice of priors. Evidence using an empirical data set suggests that SW VAM and traditional VAM results may disagree meaningfully in practice. These findings suggest that SW VAMs have promising potential to recover true teacher value-added in practical applications and, as a version of value-added models that attends to student differences, can be used to test the validity of traditional VAM assumptions in empirical contexts.
Despite policy relevance, longer-term evaluations of educational interventions are relatively rare. A common approach to this problem has been to rely on longitudinal research to determine targets for intervention by looking at the correlation between children’s early skills (e.g., preschool numeracy) and medium-term outcomes (e.g., first-grade math achievement). However, this approach has sometimes over—or under—predicted the long-term effects (e.g., 5th-grade math achievement) of successfully improving early math skills. Using a within-study comparison design, we assess various approaches to forecasting medium-term impacts of early math skill-building interventions. The most accurate forecasts were obtained when including comprehensive baseline controls and using a combination of conceptually proximal and distal short-term outcomes (in the nonexperimental longitudinal data). Researchers can use our approach to establish a set of designs and analyses to predict the impacts of their interventions up to two years post-treatment. The approach can also be applied to power analyses, model checking, and theory revisions to understand mechanisms contributing to medium-term outcomes.
This paper introduces a new measure of the labor markets served by colleges and universities across the United States. About 50 percent of recent college graduates are living and working in the metro area nearest the institution they attended, with this figure climbing to 67 percent in-state. The geographic dispersion of alumni is more than twice as great for highly selective 4-year institutions as for 2-year institutions. However, more than one-quarter of 2-year institutions disperse alumni more diversely than the average public 4-year institution. In one application of these data, we find that the average strength of the labor market to which a college sends its graduates predicts college-specific intergenerational economic mobility. In a second application, we quantify the extent of “brain drain” across areas and illustrate the importance of considering migration patterns of college graduates when estimating the social return on public investment in higher education.