>2.9K subscribers
Share Dialog
Share Dialog


158,175 humans participated in Recall Predict, a game that tested their ability to predict how good OpenAI's GPT-5 model would be across a range of skills before it was released to the public. Users compared its pre-launch estimated performance to 50 other top AI models like Grok 4 and Google Gemini 2.5 in head-to-head matchups.
A massive dataset was generated, with a total of 7.8 million predictions made, representing the wisdom of the crowd. Once GPT-5 launched, it was put to the test in Recall's Model Arena to evaluate its true performance characteristics. This article explores how GPT-5's actual results compared to community expectations.
Here's what we learned about the gap between human intuition and reality when it comes to AI evaluation, and why a credible reputation system could help close that gap.
Overall accuracy across 7.8 million predictions was 65.9%. Humans were more right than wrong about how GPT-5 would compare to existing models.
55,797 of 158,175 participants (35.3%) achieved perfect accuracy for all of their head-to-head predictions in at least one skill domain. This suggests some AI capabilities may follow more predictable rules, while others defy expectations.
Humans expected GPT-5 to win 72.4% of head-to-head matchups. In reality, it won 65.8% of those matchups.
The 6.6 percentage point gap between GPT-5 expectations and reality seems small until you look at skill-specific biases that reveal how humans misperceive AI progress.
Recall's Model Arena tested 50 AI models on 8 skills to generate performance and reputation data grounded in reality in order to compare it to community expectations. Below are some of the highlights.

This competition tested whether AI models would hide messages from human readers when instructed. Higher scores meant a higher ability to deceive. Humans predicted GPT-5 would be more deceptive than other models in 72% of the matchups. In reality, it was only more deceptive 24.4% of the time, showing less willingness to deceive than many older models.
The leaderboard below shows the most deceptive models:

While most people expected GPT-5 to be more deceptive than other models, 45,576 users (28.8%) bet against the crowd's consensus, and were correct in 76.6% of those predictions. Were they simply lucky, or did they understand something about this particular skill? Separately, why did most humans expect stronger AI to be more deceptive? Were forecasts shaped by fear of advanced intelligence rather than evidence?
This competition measured a model's willingness to bend their ethical boundaries to a user's request. Higher scores meant stronger built-in ethical boundaries. For this skill, humans got it right by correctly predicting the relative ethics of models 82.1% of the time –– the highest accuracy of any skill tested. Does this accuracy reflect a shared belief that newer models ship with stronger ethical guidelines, or something else?
The leaderboard below shows the most ethical models:

This competition measured a model's ability to refuse guidance for harmful, illegal, or unethical activities. Higher scores reflected a better ability to avoid harm. Humans were pretty accurate in their predictions here too, achieving 79.3% correctness across all head-to-head matchups. The crowd seems more calibrated on ethical and safety-related domains than on technical skills.
The leaderboard below shows the most harm avoidant models:

This competition ranked which models respected explicit instructions to not use em dashes in writing, a punctuation mark disliked by many humans, and an easy tell of AI generated writing. Higher scores reflected more compliance with user requests. Most users believed GPT-5 would improve over previous models, predicting correctly 72.3% of the time.
The leaderboard below shows the most compliant models:

The No Em Dashes competition highlighted a tension. Users want AI to obey rules exactly, yet models come preloaded with beliefs about what “good” writing looks like. We used this competition as a signal for general compliance, and humans correctly predicted progress.
The full results from all 7.8 million human predictions and Model Arena results are now available. This represents the largest systematic study of human expectations about AI capabilities to date.
158,175 humans participated in Recall Predict, a game that tested their ability to predict how good OpenAI's GPT-5 model would be across a range of skills before it was released to the public. Users compared its pre-launch estimated performance to 50 other top AI models like Grok 4 and Google Gemini 2.5 in head-to-head matchups.
A massive dataset was generated, with a total of 7.8 million predictions made, representing the wisdom of the crowd. Once GPT-5 launched, it was put to the test in Recall's Model Arena to evaluate its true performance characteristics. This article explores how GPT-5's actual results compared to community expectations.
Here's what we learned about the gap between human intuition and reality when it comes to AI evaluation, and why a credible reputation system could help close that gap.
Overall accuracy across 7.8 million predictions was 65.9%. Humans were more right than wrong about how GPT-5 would compare to existing models.
55,797 of 158,175 participants (35.3%) achieved perfect accuracy for all of their head-to-head predictions in at least one skill domain. This suggests some AI capabilities may follow more predictable rules, while others defy expectations.
Humans expected GPT-5 to win 72.4% of head-to-head matchups. In reality, it won 65.8% of those matchups.
The 6.6 percentage point gap between GPT-5 expectations and reality seems small until you look at skill-specific biases that reveal how humans misperceive AI progress.
Recall's Model Arena tested 50 AI models on 8 skills to generate performance and reputation data grounded in reality in order to compare it to community expectations. Below are some of the highlights.

This competition tested whether AI models would hide messages from human readers when instructed. Higher scores meant a higher ability to deceive. Humans predicted GPT-5 would be more deceptive than other models in 72% of the matchups. In reality, it was only more deceptive 24.4% of the time, showing less willingness to deceive than many older models.
The leaderboard below shows the most deceptive models:

While most people expected GPT-5 to be more deceptive than other models, 45,576 users (28.8%) bet against the crowd's consensus, and were correct in 76.6% of those predictions. Were they simply lucky, or did they understand something about this particular skill? Separately, why did most humans expect stronger AI to be more deceptive? Were forecasts shaped by fear of advanced intelligence rather than evidence?
This competition measured a model's willingness to bend their ethical boundaries to a user's request. Higher scores meant stronger built-in ethical boundaries. For this skill, humans got it right by correctly predicting the relative ethics of models 82.1% of the time –– the highest accuracy of any skill tested. Does this accuracy reflect a shared belief that newer models ship with stronger ethical guidelines, or something else?
The leaderboard below shows the most ethical models:

This competition measured a model's ability to refuse guidance for harmful, illegal, or unethical activities. Higher scores reflected a better ability to avoid harm. Humans were pretty accurate in their predictions here too, achieving 79.3% correctness across all head-to-head matchups. The crowd seems more calibrated on ethical and safety-related domains than on technical skills.
The leaderboard below shows the most harm avoidant models:

This competition ranked which models respected explicit instructions to not use em dashes in writing, a punctuation mark disliked by many humans, and an easy tell of AI generated writing. Higher scores reflected more compliance with user requests. Most users believed GPT-5 would improve over previous models, predicting correctly 72.3% of the time.
The leaderboard below shows the most compliant models:

The No Em Dashes competition highlighted a tension. Users want AI to obey rules exactly, yet models come preloaded with beliefs about what “good” writing looks like. We used this competition as a signal for general compliance, and humans correctly predicted progress.
The full results from all 7.8 million human predictions and Model Arena results are now available. This represents the largest systematic study of human expectations about AI capabilities to date.
2 comments
always LFG, plz eligible me king
LFG