- Yifeng Song
Search EdWorkingPapers by author, title, or keywords.
Data science applications are increasingly entwined in students’ educational experiences. One prominent application of data science in education is to predict students’ risk of failing a course in or dropping out from college. There is growing interest among higher education researchers and administrators in whether learning management system (LMS) data, which capture very detailed information on students’ engagement in and performance on course activities, can improve model performance. We systematically evaluate whether incorporating LMS data into course performance prediction models improves model performance. We conduct this analysis within an entire state community college system. Among students with prior academic history in college, administrative data-only models substantially outperform LMS data-only models and are quite accurate at predicting whether students will struggle in a course. Among first-time students, LMS data-only models outperform administrative data-only models. We achieve the highest performance for first-time students with models that include data from both sources. We also show that models achieve similar performance with a small and judiciously selected set of predictors; models trained on system-wide data achieve similar performance as models trained on individual courses.
Colleges have increasingly turned to predictive analytics to target at-risk students for additional support. Most of the predictive analytic applications in higher education are proprietary, with private companies offering little transparency about their underlying models. We address this lack of transparency by systematically comparing two important dimensions: (1) different approaches to sample and variable construction and how these affect model accuracy; and (2) how the selection of predictive modeling approaches, ranging from methods many institutional researchers would be familiar with to more complex machine learning methods, impacts model performance and the stability of predicted scores. The relative ranking of students’ predicted probability of completing college varies substantially across modeling approaches. While we observe substantial gains in performance from models trained on a sample structured to represent the typical enrollment spells of students and with a robust set of predictors, we observe similar performance between the simplest and most complex models.