We Built an AI Horse Racing Model. Here's What Actually Happened. | StableBet

Every month, someone in our inbox asks the same question: "Can AI predict horse races?" Usually they've read a breathless piece about a neural network that "found value the bookies missed" and they want to know if they should be betting on machine picks. We got tired of giving the same hedge-everything answer, so we built one. An actual model. UK National Hunt, a real database of two racing seasons, every runner's form, the weather at every course, the trainer and jockey figures — the lot. Then we tested whether it could beat bookmaker starting prices. This article is the honest story of what happened: what the model got right, what it got wrong, and what the whole exercise told us about how efficient the betting market actually is. No hype, no selling anything, no "subscribe for our AI picks" footer. Just the numbers.

What we actually built

We wanted to answer a specific, falsifiable question: can a sensible machine learning model, built from freely available UK racing data, consistently find value against bookmaker starting prices? To test that properly, we needed three things. First, data. Not just this weekend's card — years of it. We ended up scraping public Sporting Life race pages for the complete 2023/24 and 2024/25 UK National Hunt seasons, from October through to April each year. That gave us roughly 6,000 races and 56,000 runners, with every horse's finish position, starting price, weight carried, official rating, jockey, trainer, and how it finished (won, placed, fell, pulled up). We cross-referenced every race against 200,000 hourly weather readings from Open-Meteo's free historical API, so the model could see the temperature, rainfall and wind at each course on each race day. Second, features. Raw data doesn't feed into a model — it has to be turned into numbers that represent what a punter would actually think about. For every horse in every race, we computed about 30 features: age, weight, handicap rating, days since last run, career strike rate, average finish position in last three runs, strike rate at this specific course, strike rate at this distance, strike rate on this going, trainer's 14-day form, jockey's 14-day form, trainer's lifetime win rate at this particular track, field size, race class, distance, going firmness, weather, and whether the horse had been in handicap or class company before. Crucially, every feature only used information that was available before the race started. Otherwise we'd be cheating and any result would be fantasy. Third, a model. We used XGBoost — a well-regarded gradient boosting library that's effectively the gold standard for tabular prediction problems. We trained it to predict whether each horse would win, given the 30 features describing it. We then added an isotonic calibrator — a second-stage fitting step that makes sure the model's probability numbers mean what they say. Without it, models trained on small datasets tend to produce probabilities that are too flat (everything looks like 10% likely), which matters hugely when you're comparing to bookmaker prices. And fourth, because we wanted to test the model properly, we wrote a backtester. The rule is simple: train the model on everything that happened strictly before date D, then predict the races on date D onwards, and check whether the model's "value bets" would actually have made money if you'd placed them at starting price.

The first result looked brilliant (it wasn't)

We trained on the first four months of the 2024/25 season and tested on February through April 2025 — the final stretch that includes Cheltenham Festival and the Grand National. The first backtest gave us a very tempting number:

+15.73% ROI on 330 bets over three months. At 1-unit stakes across 330 bets, that's a profit of about 52 units. Above 18% strike rate on value-flagged horses. The model filter was "bet whenever the model thinks the horse is at least 20% more likely to win than the market implies, the expected value is at least +15%, and the starting price is 8/1 or shorter." That sounds like a proper working strategy. Here's the problem. When you sweep over 150 different combinations of filter settings and pick the one that made the most money on your test data, you've essentially tested 150 different strategies against the same period and picked the winner. Some of them were going to look good by chance alone. The statistics word for this trap is selection bias — you're searching for a streetlight to look under, and by definition you'll find one. The only way to know whether a result is real or just a lucky parameter is to test it on a period you haven't touched yet. So we did exactly that, on the December-January window of the same season. The same filter settings, on the same model, on a period the sweep had never seen: -17% ROI. The "edge" disappeared the moment we stepped outside the optimised window. The Feb-Apr result had been mostly coincidence.

What we did next, and what we learned

Rather than throw the model away, we tried three separate fixes in sequence. Each of them is a standard tool in anyone's ML toolkit, and each one helps a little, but none of them turned the model into a reliable money-printer. That's worth spelling out because the failure modes are the interesting part. Fix 1: More features. We added horse-level features for beaten lengths in last run, going-specific win rate, a days-off-since-last-run bucket, trainer's strike rate at this exact course, each horse's official rating relative to the field average, and separate counts for course experience and distance experience. The feature count went from 19 to 30. Result: the Feb-Apr window ticked up slightly (+18% best case), Dec-Jan stayed broken. Fix 2: More data. We went back and scraped the entire 2023/24 UK National Hunt season as well, nearly doubling our training pool. The logic: XGBoost with 30 features really wants 5,000+ training races to generalise properly, and Dec-Jan had only seen two months of training data in the first attempt. Result: Feb-Apr 2025 got better again (+26.9% on the same filter), Dec-Jan improved from -17% to about -17% still (basically unchanged). More data didn't fix the inconsistency between windows. Fix 3: Isotonic calibration. The textbook solution to "model probabilities don't match reality". We held out the last 20% of the training data, fit the model on the first 80%, then used the held-out fifth to teach a second-stage mapping from raw model scores to realistic probabilities. Result: small improvement everywhere, no step change. Combining all three fixes and re-running the sweep, we found what looked like a stable pattern: the filter "max starting price 12/1, model probability at least twice the market probability" produced positive ROI on both the Feb-Apr 2024 window (+29%) and the Feb-Apr 2025 window (+26%). Same strategy, same profit direction, on two independent years. About 319 bets combined, averaging +27% ROI. That looked like the real deal. So we pre-registered the strategy — wrote down exactly what filter we were going to use — and tested it on a window we had not touched at any point during the tuning: October-November 2024, the early part of the 2024/25 season.

-16.81% ROI on 119 bets. The strategy lost money on the untouched window. The "cross-window stability" we thought we'd found was actually just coincidence across two correlated test windows from the same part of the racing calendar.

What this actually tells you about betting markets

The result we ended up with is, paradoxically, the real finding. It's more interesting than if the model had worked, because it tells you something that almost nobody in the AI-picks-winners marketing world will admit. The UK National Hunt market is very efficient. A sensible machine learning model trained on every publicly available feature — the things a thoughtful punter would look at — doesn't find persistent edges over starting prices. Every time we thought we saw an edge, wider testing dissolved it. That's not because our model was stupid. The headline accuracy numbers are actually fine:

Top-1 accuracy: 22%. When the model nominates a single horse as the most likely winner, that horse wins about 22% of races. For typical National Hunt fields of 8-12 runners, random guessing would get around 10-12%. So the model is roughly twice as accurate as random. It's genuinely learning what a winner looks like.
Log-loss 0.35 (against a uniform baseline around 2.3). The model's confidence is well-placed: when it says a horse has a 30% chance it's roughly right, when it says 5% it's roughly right.
Feature importance matches what a human punter would prioritise. The top five drivers were jockey 14-day strike rate, field size, trainer 14-day strike rate, recent form (average finish in last three runs), and career strike rate. Nothing weird. The model independently rediscovered what the form book already tells you. So why doesn't a model that's twice as accurate as random make money? Because the bookies are also twice as accurate as random, and then some. The starting price you see on your screen isn't a bookmaker's guess — it's the consensus of a massive, well-funded pricing team, backed by years of historical data, settled against actual money flowing through the market in the minutes before the off. The market converges on something very close to the true probability, then adds about 10-20% overround as its margin. We measured the overround ourselves. Simply backing every favourite at flat stakes across the 2024/25 season lost 12.8% of turnover — and the favourite-only strategy on Dec-Jan 2025 lost 21%. That's the market's built-in advantage, sitting there waiting to eat any strategy that doesn't clearly beat it. Our model, for all its effort, couldn't clear that bar consistently. The pros who do beat these markets aren't beating them with 30 features and a free weather API. They're using:
Pace data — speed figures and sectional times, not just finishing positions. A horse that won last time by sticking on at the end is very different from one that won by blazing from the front.
Dam and sire breeding — for jumps racing especially, bloodlines tell you what type of ground a horse will handle.
Stable reports and pre-race comments — the "stable whisper" layer that never appears in public data.
Historical horse-specific ratings from Timeform, Racing Post, or their own proprietary models, going back a decade or more.
Live exchange prices rather than starting prices, because the exchange is less efficient in the final minutes and executable prices sometimes differ from SP meaningfully.
A lot more data than we had — ten to fifteen years of every race, not two. And even the pros don't claim to find value on every race. The ones who make a living at it bet selectively, sometimes just a handful of races per week, at specific kinds of edges they've verified over thousands of historical bets. They don't blast a flat-ROI strategy across 500 random bets a quarter.

The three things we learned that any punter can use

Even though the model didn't turn into a money-printer, the project did produce three genuinely useful rules of thumb that any sensible punter should internalise.

1. Long-shot "value" is nearly always an illusion

The single most catastrophic strategy we tested across all our backtests was: "bet horses where the model thinks they're at least three times more likely to win than the market implies." That's the definition of an apparent monster value bet. Across every window we tried, that strategy lost between -52% and -64% of turnover. Why? Because when the market prices a horse at 25/1, the market is usually right — that horse is almost certainly a 2% or 3% chance, not the 6% or 8% your model thinks. The market has priced in things your model doesn't know: the horse has been off for 400 days, the stable has had a virus, the trainer's form has collapsed since the last run. Any time a model disagrees dramatically with a bookmaker at long prices, the correct default is to trust the bookmaker. It's almost never the other way round. If you take nothing else from this article, take that one. Do not chase model-flagged value at prices longer than about 10/1 unless you have extremely specific, verifiable reasons why the market is wrong. It's a reliable way to lose money.

2. Trainer and jockey recent form are the two biggest signals

In every version of our model, the top two features by importance were the trainer's 14-day strike rate and the jockey's 14-day strike rate. Nothing else came close. Not weight, not class, not distance, not even the horse's own career record. That tells you something important about how racing actually works: trainers and jockeys go through measurable hot streaks and cold streaks, and those streaks are the single biggest adjustment you can make to a raw analysis of the form book. If a horse has okay-looking form but its yard has won 2 from 30 in the last fortnight, the odds should drift. If the same horse is running from a yard on 10 from 30, the odds should tighten. This is already priced into SP, but it's a useful lens when you're looking at an early morning market that hasn't fully formed. A practical consequence: when you're reading a race card, look up the trainer's 14-day form box (it's on most racing sites) before you read anything else about the horses. That single number will sharpen your analysis more than any single feature of an individual runner.

3. The overround is brutal and you cannot outrun it with volume

We ran several "blind" reference strategies through the backtester: bet every favourite, bet every second-favourite, bet every horse the model's top pick, and so on. Every single one of them lost 10-20% of turnover over 90 days. That's not because any individual strategy was bad — it's because the bookmaker's margin is built into every price you see, and volume betting just means paying that margin more times. A punter betting five races a day at flat stake, with no discrimination, is being charged roughly 12-20% by the market on every bet. That adds up fast. £10 a bet × 5 bets × 90 days = £4,500 of turnover, and about £550-£900 of that is simply the overround being deducted from your expected return. You can't win a long-run game at those rates without genuine edge. This is why professional punters are so picky. Fewer, better-reasoned bets at prices where you have a genuine information advantage is the only approach that survives the overround over time. Volume is the enemy.

What we'd do differently (or: the honest next steps)

We haven't given up on the model. The pipeline we built is reproducible and extensible — we can re-run the whole training and backtest in about a minute, which makes iterating cheap. There's a list of things we think could meaningfully change the result, in roughly descending order of how promising they look. More data. Two seasons is almost certainly too few. The pros who beat these markets do it with a decade of training history minimum. Our next move will be backfilling the Sporting Life archive back to the 2015/16 season. That's potentially 25,000+ additional races and might finally give XGBoost enough examples to stabilise calibration across any window. Pace and sectional features. Our current features treat all winners as equal. A horse that won by 10 lengths running from the front is a completely different proposition from one that nicked it on the line from last place. Pace data — where in the field a horse ran, and how strongly — would probably be our single biggest leverage point. It's not in the free Sporting Life JSON, but we could compute rough approximations from the race commentary strings. A rank-aware model instead of a win-probability model. Right now we train XGBoost to predict "did this horse win or not?" as an independent binary question. A more principled approach for racing is LightGBM's LambdaRank, which trains directly on the within-race ordering. It's a few days of work to convert and could matter materially. Retrieval-augmented generation from tipster articles. This is the one that goes beyond pure structured ML. The idea is to scrape free preview articles from Sporting Life, At The Races, and Timeform's free tier, embed them into a vector database, and at prediction time ask Claude (or another LLM) to pull the relevant commentary for each runner and synthesise an adjustment to the model's probability. Not a replacement for the structured model — a complement, adding qualitative signal the structured data can't capture. It's a much bigger project than pure feature engineering, but it's the most interesting angle. None of those are guaranteed to produce a profitable model. They might all lead to the same answer: the market is efficient, deal with it. But each one is a meaningful test that sharpens the question.

If you're reading this as a punter

The most useful takeaways from this whole exercise, if you're a recreational punter who's read about AI betting and wondered whether to try it:

Be extremely sceptical of services selling AI picks. If a model genuinely worked on UK racing, the people running it would be betting the picks themselves, not selling them. The maths of that is brutal: if an AI could find 15% ROI bets reliably, one person with modest capital would turn it into millions in a single season. They don't because it doesn't.
Flat-stake betting on model picks is a faster way to lose money than random betting, because models tend to identify the same horses (model-picked favourites), and that tends to concentrate losses on short-priced favourites losing.
If you want to use a model as part of your own process, use it as a ranking aid, not a value detector. Our model's top-picked horse wins about 22% of races — that's a useful starting filter for where to look harder, not a signal to bet.
The genuine edge in betting is at the margins: specific situations a computer is unlikely to see (local knowledge, rumour, first-time headgear, stable form turnaround), and markets less efficient than horse racing (lower-league sports, niche events, in-play exchange layers).

Responsible gambling

This article is about the mathematics of an ML backtest, not an invitation to start betting more. The model we built lost money on a pre-registered test window. Anyone reading this as a green light to start pumping volume through a racing market is reading it wrong. If you're going to bet, do it with money you can afford to lose, set clear limits before you start, and stop when you hit them. If betting is becoming a problem, BeGambleAware offers free, confidential support and advice.

What's next on this project

We'll keep iterating on the model in public. When the next round of features and backfilled data is in, we'll post a follow-up with the new numbers — whatever they are. If we find a genuinely reproducible edge, we'll share exactly what it is. If we don't, we'll say so plainly. One thing we can promise: the next update won't be behind a subscription. The whole point of doing this properly is to show punters what's actually possible with a sensible model on public data, and you can't do that honestly behind a paywall. If you want to be notified when the follow-up lands, check back on our betting strategies hub. And if you take one rule of thumb from this piece, let it be this: the long-shot that "looks like value" almost never is, and the trainer's form box tells you more than any single feature of an individual horse.