Search EdWorkingPapers

Search EdWorkingPapers by author, title, or keywords.

Methodology, measurement and data

Michael Gilraine, Jeffrey Penney.

An administrative rule allowed students who failed an exam to retake it shortly after, triggering strong `teach to the test' incentives to raise these students' test scores for the retake. We develop a model that accounts for truncation and find that these students score 0.14 standard deviations higher on the retest. Using a regression discontinuity design, we estimate thirty percent of these gains persist to the following year. These results provide evidence that test-focused instruction or `cramming' raises contemporaneous performance, but a large portion of these gains fade-out. Our findings highlight that persistence should be accounted for when comparing educational interventions.

More →


Ishtiaque Fazlul, Todd R. Jones, Jonathan Smith.

Millions of high school students who take an Advanced Placement (AP) course in one of over 30 subjects can earn college credit by performing well on the corresponding AP exam. Using data from four metro-Atlanta public school districts, we find that 15 percent of students’ AP courses do not result in an AP exam. We predict that up to 32 percent of the AP courses that do not result in an AP exam would result in a score of 3 or higher, which generally commands college credit at colleges and universities across the United States. Next, we examine disparities in AP exam-taking rates by demographics and course taking patterns.  Most immediately policy relevant, we find evidence consistent with the positive impact of school district exam subsidies on AP exam-taking rates. In fact, students on free and reduced-price lunch (FRL) in the districts that provide a higher subsidy to FRL students than non-FRL students are more likely to take an AP exam than their non-FRL counterparts, after controlling for demographic and academic covariates.

More →


Kelli A. Bird, Benjamin L. Castleman, Zachary Mabel, Yifeng Song.

Colleges have increasingly turned to predictive analytics to target at-risk students for additional support. Most of the predictive analytic applications in higher education are proprietary, with private companies offering little transparency about their underlying models. We address this lack of transparency by systematically comparing two important dimensions: (1) different approaches to sample and variable construction and how these affect model accuracy; and (2) how the selection of predictive modeling approaches, ranging from methods many institutional researchers would be familiar with to more complex machine learning methods, impacts model performance and the stability of predicted scores. The relative ranking of students’ predicted probability of completing college varies substantially across modeling approaches. While we observe substantial gains in performance from models trained on a sample structured to represent the typical enrollment spells of students and with a robust set of predictors, we observe similar performance between the simplest and most complex models.

More →


David M. Houston, Michael B. Henderson, Paul E. Peterson, Martin R. West.

Do Americans hold a consistent set of opinions about their public schools and how to improve them? From 2013 to 2018, over 5,000 unique respondents participated in more than one consecutive iteration of the annual, nationally representative Education Next poll, offering an opportunity to examine individual-level attitude stability on education policy issues over a six-year period. The proportion of participants who provide the same response to the same question over multiple consecutive years greatly exceeds the amount expected to occur by chance alone. We also find that teachers offer more consistent responses than their non-teaching peers. By contrast, we do not observe similar differences in attitude stability between parents of school-age children and their counterparts without children.

More →

Download616.68 KB

Peter Q. Blair, Papia Debroy, Justin Heck.

Over the past four decades, income inequality grew significantly between workers with bachelor’s degrees and those with high school diplomas (often called “unskilled”). Rather than being unskilled, we argue that these workers are STARs because they are skilled through alternative routes—namely their work experience. Using the skill requirements of a worker’s current job as a proxy of their actual skill, we find that though both groups of workers make transitions to occupations requiring similar skills to their previous occupations, workers with bachelor’s degrees have dramatically better access to higher wage occupations where the skill requirements exceed the workers’ observed skill. This measured opportunity gap offers a fresh explanation of income inequality by degree status and reestablishes the important role of on-the-job-training in human capital formation.

More →


Dorottya Demszky, Jing Liu, Zid Mancenido, Julie Cohen, Heather C. Hill, Dan Jurafsky, Tatsunori Hashimoto.

In conversation, uptake happens when a speaker builds on the contribution of their interlocutor by, for example, acknowledging, repeating or reformulating what they have said. In education, teachers' uptake of student contributions has been linked to higher student achievement. Yet measuring and improving teachers' uptake at scale is challenging, as existing methods require expensive annotation by experts. We propose a framework for computationally measuring uptake, by (1) releasing a dataset of student-teacher exchanges extracted from US math classroom transcripts annotated for uptake by experts; (2) formalizing uptake as pointwise Jensen-Shannon Divergence (pJSD), estimated via next utterance classification; (3) conducting a linguistically-motivated comparison of different unsupervised measures and (4) correlating these measures with educational outcomes. We find that although repetition captures a significant part of uptake, pJSD outperforms repetition-based baselines, as it is capable of identifying a wider range of uptake phenomena like question answering and reformulation. We apply our uptake measure to three different educational datasets with outcome indicators. Unlike baseline measures, pJSD correlates significantly with instruction quality in all three, providing evidence for its generalizability and for its potential to serve as an automated professional development tool for teachers.

More →


Daniel Kreisman, Jonathan Smith, Bondi Arifin.

How do college non-completers list schooling on their resumes? The negative signal of not completing might outweigh the positive signal of attending but not persisting. If so, job-seekers might hide non-completed schooling on their resumes. To test this we match resumes from an online jobs board to administrative educational records. We find that fully one in three job-seekers who attended college but did not earn a degree omit their only post-secondary schooling from their resumes. We further show that these are not casual omissions but are strategic decisions systematically related to schooling characteristics, such as selectivity and years of enrollment. We also find evidence of lying, and show which degrees listed on resumes are most likely untrue. Lastly, we discuss implications. We show not only that this implies a commonly held assumption, that employers perfectly observe schooling, does not hold, but also that we can learn about which college experiences students believe are most valued by employers.

More →


Sophie Litschwartz, Luke W. Miratrix.

In multisite experiments, we can quantify treatment effect variation with the cross-site treatment effect variance. However, there is no standard method for estimating cross-site treatment effect variance in multisite regression discontinuity designs (RDD). This research rectifies this gap in the literature by systematically exploring and evaluating methods for estimating the cross-site treatment effect variance in multisite RDDs. Specifically, we formalize a fixed intercepts/random coefficients (FIRC) RDD model and develop a random effects meta-analysis (Meta) RDD model for estimating cross-site treatment effect variance. We find that a restricted FIRC model works best when the running variables' relationship to the outcome is stable across sites but can be biased otherwise. In those instances, we recommend using either the unrestricted FIRC model or the meta-analysis model; with the unrestricted FIRC model generally performing better when the average number of in-bandwidth observations is less than 120 and the meta-analysis model performing better when the average number of in-bandwidth observations is above 120. We apply our models to a high school exit exam policy in Massachusetts that required students who passed the high school exit exam but were still determined to be nonproficient to complete an ``Education Proficiency Plan" (EPP). We find the EPP policy had a positive local average treatment effect on whether students completed a math course their senior year on average across sites, but that the impact varied enough such that a third of schools could have had a negative impact.

More →


Susana Claro, Valentina Paredes, Verónica Cabezas, Gabriel Cruz.

Growing evidence shows that a student's growth mindset (the belief that intelligence is malleable) can benefit their academic achievement. However, due to limited information, little is known about how a teachers’ growth mindset affects their students’ academic achievement. In this paper, we study the impact of teacher growth mindset on academic achievement for a nationwide sample of 8th and 10th grade students in Chile in 2017. Using a student fixed effect model that exploits data from two subject teachers for each student, we find that being assigned to a teacher with a growth mindset increases standardized test scores by approximately 0.02 standard deviations, with larger effects on students with high GPAs and particularly on students in low socioeconomic schools.

More →


Lina Anaya, Nagore Iriberri, Pedro Rey-Biel, Gema Zamarro.

Standardized assessments are widely used to determine access to educational resources with important consequences for later economic outcomes in life. However, many design features of the tests themselves may lead to psychological reactions influencing performance. In particular, the level of difficulty of the earlier questions in a test may affect performance in later questions. How should we order test questions according to their level of difficulty such that test performance offers an accurate assessment of the test taker's aptitudes and knowledge? We conduct a field experiment with about 19,000 participants in collaboration with an online teaching platform where we randomly assign participants to different orders of difficulty and we find that ordering the questions from easiest to most difficult yields the lowest probability to abandon the test, as well as the highest number of correct answers. Consistent results are found exploiting the random variation of difficulty across test booklets in the Programme for International Student Assessment (PISA), a triannual international test, for the years of 2009, 2012, and 2015, providing additional external validity. We conclude that the order of the difficulty of the questions in tests should be considered carefully, in particular when comparing performance between test-takers who have faced different order of questions.

More →