Runs like the Zürcher Silvesterlauf, the Escalade in Geneva, and the Lausanne Marathon attract 20,000 to 40,000 runners every year who compete against themselves, and especially against time.
This challenge aims to develop an individual betting system for long-distance races. For Datasport to realistically estimate the times that each runner aims to beat, it needs an algorithm that uses data from previous events with which to predict the finishing times of the current race.
Have you ever run a half-marathon or marathon or cheered someone on at the finish line? If so, you've definitely come to the right place for this challenge!
Datasport has been in the timekeeping business since 1983. Through innovative ideas, speed and professionalism, it has grown into the leading IT service provider for popular and mass sports events in Switzerland and abroad. Its products are used in running, cycling, multisport, winter sports, walking and serial running, and other events.
The aim was to develop a model for predicting a current runner’s time based on the data collected from previous runners between 2016-2021.
Data analysis used different variables about runners, such as their age, nationality, gender, finishing times or type of run, among others.
Hackday participants used various approaches to develop the most accurate model possible based on the data. They applied clustering and machine-learning algorithms as well as linear models to structure the data and predict a runner’s finishing time. They then tested the general models based on the entire dataset and the individual models for each runner.
For the linear approaches, it turned out that data from 10-20 runs was needed for a more accurate individual prediction. For the general model across all runners, the Gradient Boosting Model or the Extra Trees Model produced good values based on the respective metrics.
The challenge demonstrated that accurate predictions are possible only when individual values of each runner are available. To improve the quality of the predictions it is therefore necessary to include other variables on the physical condition, high-level training data, standardized split times, and topology of runs.
A prediction web application was set up in which runners can use the model to predict their own time. The application uses the trained model in the background for predicting the final times per runner.
To properly model the wide variance of variables – e.g. runner’s age and gender, or the race’s distance and topology – proved to be challenging. Another key part involved defining and deriving the project’s output based on the dataset. In this regard, it was decisive to abandon the one-fits-all machine learning model and instead build and tweak some simpler models that take the runner’s perspective more accurately into account.
Potential next steps
- Enhance existing user profiles by adding more variables for more sophisticated individual models (see the last slide in the presentation, e.g. mixed models)
- Use a web application to calculate the finishing times of runs (topology and distance)
- Tailor the statistical model for production
- Adapt Datasport’s business model for coaching
The finishing times of races is the dependent variable. The data set includes variables such as runners’ characteristics like age and gender as well as split times, type of race, and other things. To develop an effective model, it is necessary to have data collected over several years and from different races.