In the Money: Predicting the Outcomes of Horse Races with RaceQuant
RaceQuant is a startup specializing in consulting for horse race betting. RaceQuant enlisted our team to use machine learning to more accurately predict the outcome of horse races, to advise betting strategy. They provided three years worth of Hong Kong Jockey Club horse racing data (2015-2017) from the tracks in Sha Tin and Happy Valley, including public data from the HKJC website and enhanced data which included 35 additional variables.
The payout for horse race bets is based on the amount of money bet on each horse, after the HKJC has taken an 18% commission. So, in order to turn a profit, your bets must beat this 18% hurdle. Our approach was to model the probability of each horse winning in a given race and compare this to the market perceived probability and then recommend bets on horses whose chances exceeded the market probability. We will discuss in detail below.
The raw data contained over 29,000 observations, covering 2,384 HKJC races between 2015 and 2017. Before this data could be fed into models, some transformation was necessary. The data cleaning and processing was done using the Pandas library in Python.
Handling Missing Data
Some features did not have data for all horses. Some of these could be reasonably imputed. For example, horses with only one race didn’t yet have an
avg_horse_body_weight. These were imputed as
horse_body_weight (the horse’s current body weight). In other cases, features with missing data were replaced altogether by new features. For example,
avg_winning_body_weight was dropped in favor of the new features,
prev_won (if the horse has previously won a race) and
wtdf_prev_avg_win_wt (the weight difference between the horse’s current race and it’s average winning weight).
We created several new features to make the data more manageable and better capture important information about the horses’ previous performance:
prev_won- whether the horse has won a previous race.
previously_raced- whether the horse has had a previous race (win or lose).
wtdf_prev_race- weight difference between the previous race and this one (if the horse has no previous race, this value is 0).
wtdf_prev_avg_win_wt- weight difference between the average winning body weight and this race (if the horse has no previous wins, this value is 0).
Because of the nature of horse races (many discrete races with 7-14 horses), it is difficult to build a model which predicts horse rank in a given race outright. Furthermore, many betting strategies rely on predicting the probability of a given horse winning a race and comparing it to the perceived market probability to determine what to bet. Consequently, our approach was to build models to predict horse run times and use these to simulate each race many times, allowing us to extract the probability of each horse for each race. Breaking it down:
- Predict the mean run time for a given horse under a given set of conditions.
- Predict the variance in run time for this horse/conditions. Combined with the mean prediction, this gives us a distribution of possible times for this horse under these conditions.
- Repeat for each horse in the race.
- Using the predicted time distributions for each horse in the race, simulate 100,000 races. Treat the fraction of races won by each horse as its probability of winning the actual race (ex: if horse A wins 20,000 of the 100,000 simulated races, it’s predicted probability of winning the race is 0.2).
We used several types of models with this approach, including:
|Linear Regression||scikit-learn LinearRegression|
|Ridge and Lasso Regression||scikit-learn LassoCV, ElasticnetCV, RidgeCV|
|K-Nearest Neighbors (KNN)||scikit-learn KNeighborsRegressor|
|Random Forest||scikit-learn RandomForestRegressor|
We took several approaches to measuring the success of our model:
- Measuring the percent of correctly predicted first and second place horses for our test set of races.
- Measuring the return on investment assuming a flat bet of $10 on only the winner of each race.
- Measuring the return on investment using a betting strategy derived from the Kelly Criterion.
Predicting Winners and ROI For $10 Bets
The first measure of success that we used was to consider the number of races for which we correctly predicted the winner. Out of 805 races in our test set, our best model correctly predicted the first place horse in 17.52% of races (141 races). It correctly predicted the second place horse in 12.80% of races (103 races). These results are stronger than betting randomly, which is expected to return ~9% correct first place horses.
We then applied a flat betting strategy to these predictions, betting $10 on each horse we predicted to win. The return on investment for this strategy was -1.13% (a loss of $91 for an investment of $8,050). Note that the payout on each race is dependent on how much the betting population bet on each horse (a proxy for how much each horse is expected to win). The more bets placed on an individual horse, the lower the payout. Therefore, there is less of a reward for guessing horses that are perceived as likely to win.
Return For Kelly Betting Strategy
The Kelly Criterion is a formula used to optimize betting strategy. The Kelly Criterion weighs the payout for each bet against the probability of winning and recommends what fraction of your bankroll to wager. It can be described with the following equation:
f*is the fraction of the current bankroll to wager.
bis the net odds received for the wager (your return is "
bto 1" where you on a bet of $1 you would receive $
bin addition to getting the $1 back).
pis the probability of winning.
qis the probability of losing (
Using this approach, our return on investment was -100% (hitting a remaining budget of $0 after 69 races).
Areas for Improvement
There are a variety of areas for improvement for this project:
- Exploring additional modeling techniques, particularly for modeling the error in race time.
- Employing a multinomial conditional logistic regression model (see below for more detail).
- Considering more conservative betting strategies.
- Adding additional historical data to train model.
- Additional feature engineering.
Multinomial Conditional Logistic Regression
The Multinomial Conditional Logistic Regression model (MCLR) is an alternative methodology to our approach. Instead of modeling the run time and subsequently assessing the error, MCLR would provide the probability that each horse in any given race finishes in 1st place, which is precisely our target. While conceptually simpler, the hesitation to implement it is the complexity for our specific use-case.
About the Team
This was completed as a capstone project at NYC Data Science Academy. The members of this team are Kevin Cannon, Howard Chang, Julie Levine, Lavanya Gupta, and Max Schoenfeld. Thanks to RaceQuant for working with us!