
Moneyball was one of the first movies that showed people that there’s more to professional sports than physical fitness, instinct, and competitiveness. It tells the story of how the Oakland A’s were revived to greatness by manager Billy Beane with the help of Peter Brand by incorporating Sabermetrics, a blanket term for sports analytics for the empirical analysis of baseball. This disruptive data-driven approach forever changed how baseball was played.
Today, all elite professional sports teams use predictive data-driven analytics to drive strategic decisions that shape team composition, game strategy, and ultimately determine winners and losers. They consider a wide range of factors in their analysis such as matchups, weather, injuries, unexpected events, and other external drivers – the same factors that Las Vegas sportsbooks use to set their gambling lines, which are essentially predictions on game outcomes.

Fast forward to 2025. Artificial intelligence models can now ingest enormous amounts of data and make predictions – and they’re becoming increasingly powerful. But can they play moneyball better than Las Vegas? We wanted to find out, so we put six leading LLMs into an arena to try to predict the outcomes of NFL games on Thanksgiving.
American football is dynamic, complex, and notoriously tough to predict, which is why we believed it to be such a good challenge for evaluating the predictive capabilities of AI models. Unlike in baseball, which primarily uses “On Base Percentage” (OBP), no single stat dominates NFL analysis due to the game’s inherently unpredictable nature. For reference, NFL game spreads of Las Vegas sportsbooks are only correct 55% of the time, which is barely better than a coin toss. To succeed at this challenge, AI models would need to ingest many different datasets and perform complex analysis to see if they could beat the benchmarks used by the bookies in Vegas.

Recall’s NFL Arena isn’t the first time ML and AI models have been tested on football predictions. Previous experiments using machine learning on NFL data have achieved classification accuracies between 75-86% for predicting game winners, while LLMs predicting continuous outcomes like point spreads have claimed accuracy levels between 72-77%. Some models testing on recent NFL seasons have even reported accuracy scores as high as 85%. Though these benchmarks sound impressive, they have several fundamental problems:
Unrealistic Backtesting and Data Leakage: Published accuracy rates come from backtesting, not live predictions. For example, a model is trained on 2010-2020 data, then tested on 2021 games – all in the past. Not only is this approach susceptible to data leakage, where unseen data finds its way into the model’s analysis, but the main challenge in sports predictions is the impossibility of comparing results across games. Live game predictions are a more accurate way of evaluating a model’s ability to predict truly unseen outcomes.

Dataset Heterogeneity: After reaching a threshold level of model quality, the data that an AI model can access to generate predictions matters more than the model itself. The lack of publicly available NFL benchmark datasets has made model evaluation challenging, as much of this is often only available to professional teams. Requiring models to react to live and dynamic game scenarios puts all models on the same playing field when it comes to evaluation.
Confidence Matters More than Accuracy: Model calibration is more important than raw accuracy for sports betting and prediction applications. Most benchmarks today optimize for accuracy, while ignoring whether models actually understand the uncertainty of their own predictions. Properly calibrated models that understand this achieve an average return on investment (ROI) of +34.69% versus 35.17% for accuracy optimized models.

Unpredictability of Live Games: Predicting sports events remains a challenging task due to inherent uncertainties and dynamic factors. Conventional prediction techniques relying on static attributes fail to capture these dynamic conditions. This is why traditional AI benchmarks can’t answer whether models have predictive reasoning or just sophisticated pattern-matching capabilities. A better benchmark needs to be run in real-time on live games.

On November 27, 2025, we launched NFL Arena to evaluate how well the six most powerful AI models could predict the outcomes of all three NFL games on Thanksgiving. Unlike previous NFL prediction benchmarks, models in the NFL Arena competed against each other in real-time on live games and dynamic scenarios, not backtested datasets.
We tracked time-weighted confidence throughout each game, penalizing late conviction shifts and rewarding models that reasoned consistently rather than reactively chased scoreboard changes. And because these games hadn’t happened yet, no amount of training data memorization could help. To succeed, models had to accurately reason about genuinely uncertain futures.

Comparing pregame predictions to actual outcomes, you will see that all six AI models predicted every game incorrectly. The LLMs unanimously favored the KC Chiefs, BAL Ravens, and DET Lions, which all ended up losing their respective games. If they were making pregame bets, they would have lost money.
DAL Cowboys 31, KC Chiefs 28 — Upset
CIN Bengals 32, BAL Ravens 14 — Major upset
GB Packers 31, DET Lions 24 — Upset
Pregame prediction accuracy wasn’t the only way we scored models in the NFL Arena. We also considered how well they adapted to the flow of the game in real-time to adjust their live predictions. Below, find the final results of our NFL Arena benchmark.

Lower scores than static benchmarks. Live testing is a tougher, more realistic test than static benchmarks. Our test generated scores that fell below the reported findings in other static NFL prediction benchmarks. Models couldn’t overfit or memorize the answers to this challenge.
Wide capability differences. The spread between first place (Claude Opus 4.5 – 0.651) and last place (Grok 4.1 – 0.409) represented a 59% performance differential. Live predictions surfaced real capability differences between models that static benchmarks mask.
Confidence ≠ Accuracy. Models expressed varying confidence levels in their predictions, but confidence scores didn’t reliably correlate with accuracy. This mirrors the calibration problem identified in academic research: models optimized for benchmark accuracy often develop poor uncertainty estimation.
So what can we extract from this experiment and generalize to broader learnings about the behavior and predictive performance of AI models?
We analyzed reasoning patterns across all predictions. The results revealed a dependency on betting markets (simple) rather than game-state analysis (complex). Notably, GPT-5.1 cited betting lines in 99.6% of its predictions. Even with DAL Cowboys winning in the 4th quarter and the game flow favoring their momentum, it reasoned: "Live betting still has Kansas City as a solid favorite around -190 on the moneyline."

DeepSeek and Grok never predicted the DAL Cowboys would win, even when Dallas was ahead by 11 points with 5 minutes left or when the game was about to conclude. Similarly, Gemini, GPT-5.1, and Grok never predicted the CIN Bengals would win, even as the Bengals led 32-14 in the 4th quarter. These models maintained confidence in the pregame favorite despite overwhelming contrary evidence.

We measured the "overconfidence gap", the difference between average confidence when wrong vs. when right. Most models were more confident when wrong. DeepSeek was the exception, higher confidence when correct, but this came from extreme late-game confidence (80-95%) when outcomes became obvious.

NFL Arena was our first experiment in using arenas to better evaluate the predictive reasoning of AI models. The results confirmed our hypothesis: frontier LLMs show meaningful capability differences in predictive reasoning that static benchmarks don’t capture. Moving forward, we will expand this framework to include additional skill domains, models, and more complex prediction tasks. Our goal is to build an extensible AI evaluation infrastructure that truly reveals what these models can actually do when facing genuine uncertainty.
If you have an idea for an arena or want us to run one for you, get in touch.
Chef
No comments yet