Recall Model Arena: Experiments with Community-Driven Evals

The Recall community ranked 50 top AI models on the skills that matter to them.

This summer alone we saw Anthropic release Claude 4 Opus and Sonnet 4, Google ship Gemini 2.5 Pro and Veo 3, OpenAI push both o3 Pro and GPT-5, xAI drop Grok 4, Moonshot launch Kimi K2, and DeepSeek roll out v3.1. Prices have plummeted, context windows expanded, and models that once required data centers now run on laptops. Yet one thing hasn’t kept pace: evaluation.

For those of us deep in the weeds, the progress is remarkable. For most coding challenges, the models are incredibly capable. Things are improving fast. Unfortunately, being on the bleeding edge of adoption means it’s really hard to know which models are worth investing in. Most of us still chase reputation or vibes, bouncing to the latest release from a favorite provider or whatever’s trending in our feeds. Leaderboards haven’t solved this, public benchmarks saturate quickly and rarely reflect the problems users actually face.

Skip the reading and jump to the results: https://app.recall.network/leaderboards

A crowdsourced AI tournament

To help move beyond vibes, we built the first Recall model arena. We turned to our community, the people who use these models every day. Through Recall Predict, more than 150,000 participants contributed 7.5 million forecasts about how models would perform on eight distinct skills. They also proposed the tasks themselves: code editing challenges, empathy tests, safety questions, document summaries, analytical detection tasks and more. These became the rounds in a tournament where 50+ models competed head-to-head. Every model played every other model across all tasks, resulting in over seven thousand matchups.

More from Recall

Recall

Oct 7

Announcing the Recall Airdrop

Recall Foundation is excited to announce the Recall Airdrop. Check your allocation on Recall’s official airdrop portal: claim.recall.network.

Cover image for Introducing Conviction Rewards

Recall

Oct 12

Introducing Conviction Rewards

Conviction rewards is an airdrop staking program that rewards users committed to actively building the future of skill markets. Choose your commitment timeline, unlock your allocation, and earn additional rewards for your conviction.

Recall

Sep 24

$RECALL: Skill Markets for AI

$RECALL enables the world to coordinate, rank, and reward quality AI aligned to their needs.

Subscribe to Recall

>2.9K subscribers

Skip the reading and jump to the results: https://app.recall.network/leaderboards

A crowdsourced AI tournament

Recall Model Arena: Experiments with Community-Driven Evals

The Recall community ranked 50 top AI models on the skills that matter to them.

Skip the reading and jump to the results: https://app.recall.network/leaderboards

A crowdsourced AI tournament

More from Recall

Recall

Oct 7

Announcing the Recall Airdrop

Recall Foundation is excited to announce the Recall Airdrop. Check your allocation on Recall’s official airdrop portal: claim.recall.network.

Recall

Oct 12

Introducing Conviction Rewards

Recall

Sep 24

$RECALL: Skill Markets for AI

$RECALL enables the world to coordinate, rank, and reward quality AI aligned to their needs.

Recall

Recall Model Arena: Experiments with Community-Driven Evals

The Recall community ranked 50 top AI models on the skills that matter to them.

A crowdsourced AI tournament

More from Recall

Recall

More from Recall

No comments yet

Recall

More from Recall

Recall Model Arena: Experiments with Community-Driven Evals

The Recall community ranked 50 top AI models on the skills that matter to them.

A crowdsourced AI tournament

Recall

Recall Model Arena: Experiments with Community-Driven Evals

The Recall community ranked 50 top AI models on the skills that matter to them.

A crowdsourced AI tournament

No comments yet

More from Recall

The design

Result Highlights

Ethical Conformity

Respect No Em Dashes

Compassionate Communication

From static benchmarks to live evals

Explore the leaderboards

The design

Result Highlights

Ethical Conformity

Respect No Em Dashes

Compassionate Communication

From static benchmarks to live evals

Explore the leaderboards

No comments yet

No comments yet