Predicting Marathon Finishing Times Using Ensemble Learning: An Empirical Study on Boston Marathon Data

VERSION OF RECORD ONLINE: 11/09/2025

Authors

Corressponding author's email:

phuongttn@hcmute.edu.vn

DOI:

https://doi.org/10.54644/jte.2025.1924

Keywords:

Ensemble Learning, Marathon Prediction, Boston Marathon, Machine learning, Performance forecasting

Abstract

This study proposes an ensemble machine learning model to predict marathon finishing times, using empirical data from the Boston Marathon spanning 2015–2017. After thorough preprocessing and feature engineering—including intermediate checkpoint times (5K, 10K, Half Marathon), age, gender, nationality, and year of participation—six models were implemented and evaluated: K-Nearest Neighbors (KNN), Artificial Neural Network (ANN), Case-Based Reasoning (CBR), a prior benchmark model (FA-PP-R-ML), Long Short-Term Memory (LSTM), and a novel ensemble model combining Linear Regression, Random Forest, and MLPRegressor via a meta-learning approach. Experimental results on the test set demonstrate that the proposed ensemble model achieved the highest predictive performance, with a Mean Absolute Error (MAE) of 7.32 minutes, Root Mean Squared Error (RMSE) of 11.06 minutes, and R² score of 0.928—outperforming all baseline models in both accuracy and robustness. Visualization techniques such as scatter plots and boxplots further confirmed the model’s high agreement between predicted and actual values. Nevertheless, the study acknowledges several limitations, including a constrained dataset limited to three years of a single event, a narrow scope of model comparison, simplifications in algorithmic assumptions, and limited hyperparameter tuning. Future work should explore more diverse datasets, incorporate exogenous factors (e.g., weather, elevation), adopt advanced modeling techniques such as attention mechanisms, graph-based learning, or AutoML, and enhance model interpretability to support real-world applications in athlete coaching and performance forecasting.

Downloads: 0

Download data is not yet available.

Author Biographies

Anh Khoa Mai, Ho Chi Minh City University of Technology and Education, Vietnam

Anh Khoa Mai is a fourth-year student in Information Technology, majoring in Artificial Intelligence at Ho Chi Minh City University of Technology and Education. Currently, he is working as an intern at FPT Software Co., Ltd., Ho Chi Minh City. This paper is his first publication, developed from the idea of his undergraduate thesis. It provides him with an opportunity to further study Artificial Intelligence and Deep Learning, while practicing scientific research skills in an academic environment. Research areas: machine learning, deep learning, reinforcement learning, chatbot.

Email: anhkhoamai11040307@gmail.com. ORCID : https://orcid.org/0009-0007-2204-2040

Ha Quynh Giao Nguyen, Ho Chi Minh City University of Technology and Education, Vietnam

Giao Quynh Ha Nguyen is currently a fourth-year student in Information Technology, majoring in Software Engineering at Ho Chi Minh City University of Technology and Education. Currently, she is working as an intern at FPT Software Co., Ltd., Ho Chi Minh City.

This paper is her first publication during her studies at HCMUTE, serving as an opportunity for her to practice research skills and synthesize specialized knowledge.

Research areas: Mobile Programming, Deep Learning.

Email: nguyenhaquynhgiao9569@gmail.com. ORCID : https://orcid.org/0009-0004-5643-207X

Cong Manh Hoang, Hung Yen University of Technology and Education, Vietnam

Cong Manh Hoang is currently a fourth-year student in Information Technology, majoring in Software Engineering at Ho Chi Minh City University of Technology and Education. This report is his first academic work during his studies, serving as an opportunity to practice research skills and synthesize specialized knowledge. Research areas: Mobile Programming, Web Programming.

Email: hoangmanh6889@gmail.com . ORCID:  https://orcid.org/0009-0005-6456-2613

Thi Ngoc Phuong Truong, Ho Chi Minh City University of Technology and Education, Vietnam

Phuong Thi Ngoc Truong is currently a lecturer at Ho Chi Minh City University of Technology and Education. She graduated from the University of Science, Ho Chi Minh City, in 2005 and pursued a Master’s degree in Information Technology at Kookmin University, South Korea. She is now a Ph.D. candidate at the Computer Science Laboratory, University of Information Technology.

Her research interests include Computer Vision, Deep Learning, and Mobile Programming.

Email: phuongttn@hcmute.edu.vn. ORCID:  https://orcid.org/0009-0003-9963-9874. Phone: +84 – 942920912.

References

A. Keogh, O. Sheridan, O. McCaffrey, S. Dunne, A. Lally, and C. Doherty, “The determinants of marathon performance: An observational analysis of anthropometric, pre-race and in-race variables,” Int. J. Exerc. Sci., vol. 13, no. 6, pp. 1132–1142, 2020.

W. Yong, P. Lingyun, and W. Jia, “Statistical analysis and ARMA modeling for the big data of marathon score,” Sci. Sports, vol. 35, no. 6, pp. 375–385, 2020.

Rojour, “Finishers Boston Marathon 2015, 2016 & 2017,” Kaggle, 2017. [Online]. Available: https://www.kaggle.com/datasets/rojour/boston-results. Accessed: 2025.

L. Lerebourg, D. Saboul, M. Clémençon, and J. B. Coquart, “Prediction of marathon performance using artificial intelligence,” Int. J. Sports Med., vol. 44, no. 5, pp. 352–360, 2023.

C. Feely, B. Caulfield, A. Lawlor, and B. Smyth, “Using case-based reasoning to predict marathon performance and recommend tailored training plans,” in Proc. 28th Int. Conf. Case-Based Reasoning (ICCBR 2020), 2020.

J. Chen, “Factor and correlation analysis for predicting marathon race performance using machine learning algorithms,” J. Electr. Syst., pp. 1948–1958, 2024.

H. Muijlwijk, B. Smyth, M. C. Willemsen, and W. A. IJsselsteijn, “Benefits of human-AI interaction for expert users interacting with prediction models: A study on marathon running,” in Proc. 29th Int. Conf. Intell. User Interfaces (IUI ’24), Greenville, SC, USA, 2024.

Y. Ding, “Analyzing athletes’ physical performance and trends in athletics competitions using time series data mining algorithms,” J. Electr. Syst., pp. 736–746, 2024.

K. K. El-Kassabi and M. A. S. H. Taha, “Deep learning approach for forecasting athletes’ performance in sports tournaments,” unpublished.

R. Huang, Z. Qian, H. Ma, Z. Han, and Y. Xie, “Sports performance prediction for college students through ensemble learning algorithm,” IEICE Trans. Inf. Syst., vol. E108.D, no. 7, pp. 776–783, 2025.

T. Anande, S. Alsaadi, and M. Leeson, “Enhanced modelling performance with boosting ensemble meta learning and Optuna optimization,” SN Comput. Sci., vol. 6, Art. no. 12, 2024.

Rojour, “boston_results: Scrapping and visualizing Boston Marathon results,” GitHub, 2017. [Online]. Available: https://github.com/rojour/boston_results. Accessed: 2025.

D. H. Wolpert, “Stacked generalization,” Neural Netw., vol. 5, no. 2, pp. 241–259, 1992.

D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by back-propagating errors,” Nature, vol. 323, pp. 533–536, 1986.

T. M. Cover and P. E. Hart, “Nearest neighbor pattern classification,” IEEE Trans. Inf. Theory, vol. 13, no. 1, pp. 21–27, 1967.

J. L. Kolodner, “An introduction to case-based reasoning,” Artif. Intell. Rev., vol. 6, pp. 3–34, 1992.

S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Comput., vol. 9, no. 8, pp. 1735–1780, 1997.

A. K. Kuchibhotla and L. D. Brown, “Model-free study of ordinary least squares linear regression,” arXiv preprint arXiv:1809.05296, Sep. 2018.

S. Lee, “7 surprising stats where linear regression shapes sports data analysis,” Number Analytics, LLC, Mar. 19, 2025. [Online]. Available: https://www.numberanalytics.com/blog/surprising-stats-linear-regression-sports-data-analysis. Accessed: Apr. 29, 2025.

TechGoGreen, “Random forest algorithm,” TechGoGreen, Jun. 20, 2023. [Online]. Available: https://techgogreen.com/random-forest-algorithm/?utm_source=chatgpt.com. Accessed: Apr. 29, 2025.

A. Kumar, “Sklearn neural network example – MLPRegressor,” Analytics Yogi, May 2, 2023. [Online]. Available: https://vitalflux.com/sklearn-neural-network-regression-example-mlpregressor/. Accessed: Apr. 29, 2025.

V. Hua, N. T. Dang, M. S. Nguyen, H. N. Bui, and A. B. Arun, “The impact of data imputation on air quality prediction problem,” PLoS One, vol. 19, no. 9, Art. no. e0306303, 2024.

A. Vaswani et al., “Attention is all you need,” in Proc. 31st Conf. Neural Inf. Process. Syst. (NeurIPS 2017), Long Beach, CA, USA, 2017.

Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu, “A comprehensive survey on graph neural networks,” IEEE Trans. Neural Netw. Learn. Syst., vol. 32, no. 1, pp. 4–24, 2021.

X. He, K. Zhao, and X. Chu, “AutoML: A survey of the state of the art,” Knowl.-Based Syst., vol. 212, Art. no. 106622, 2021.

Published

11-09-2025

How to Cite

Mai Anh Khoa, Nguyễn Hà Quỳnh Giao, Hoàng Công Mạnh, & Trương Thị Ngọc Phượng. (2025). Predicting Marathon Finishing Times Using Ensemble Learning: An Empirical Study on Boston Marathon Data: VERSION OF RECORD ONLINE: 11/09/2025. Journal of Technical Education Science. https://doi.org/10.54644/jte.2025.1924