When analyzing treatment effects on test score data, education researchers face many choices for scoring tests and modeling results. This study examines the impact of those choices through Monte Carlo simulation and an empirical application. Results show that estimates from multiple analytic methods applied to the same data will vary because, as predicted by Classical Test Theory, two-step models using sum or IRT-based scores provide downwardly biased standardized treatment effect coefficients compared to latent variable models. This bias dominates any other differences between models or features of the data generating process, such as the variability of item discrimination parameters. An errors-in-variables (EIV) correction successfully removes the bias from two-step models. Model performance is not substantially different in terms of precision, standard error calibration, false positive rates, or statistical power. An empirical application to data from a randomized controlled trial of a second-grade literacy intervention demonstrates the sensitivity of the results to model selection and tradeoffs between model selection and interpretation. This study shows that the psychometric principles most consequential in causal inference are related to attenuation bias rather than optimal scoring weights.
causal inference, latent variable models, item response theory, psychometrics, educational measurement
Download 05/20232.25 MB
Document Object Identifier (DOI)