Announcing ARC-AGI-3
The only unsaturated agentic intelligence benchmark in the world
Humans score 100%, AI <1%
This human-AI gap demonstrates we do not yet have AGI
Most benchmarks test what models already know, ARC-AGI-3 tests how they learn
Grok 4 (Thinking) achieves new SOTA on ARC-AGI-2 with 15.9%
This nearly doubles the previous commercial SOTA and tops the current Kaggle competition SOTA
New verified ARC-AGI-Pub SoTA!
@OpenAI o3 has scored a breakthrough 75.7% on the ARC-AGI Semi-Private Evaluation.
And a high-compute o3 configuration (not eligible for ARC-AGI-Pub) scored 87.5% on the Semi-Private Eval.
1/4
After the o3 price reduction, we retested the o3-2025-04-16 model on ARC-AGI to determine whether its performance had changed.
We compared the retest results with the original results and observed no difference in performance.
Today we are announcing ARC-AGI-2, an unsaturated frontier AGI benchmark that challenges AI reasoning systems (same relative ease for humans).
Grand Prize: 85%, ~$0.42/task efficiency
Current Performance:
* Base LLMs: 0%
* Reasoning Systems: <4%
New SOTA on ARC-AGI
- V1: 79.6%, $8.42/task
- V2: 29.4%, $30.40/task
Custom submissions by @jeremyberman and @_eric_pang_ are now the best known solutions to ARC-AGI
Both:
* Are open source
* Use Grok 4
* Implement program-synthesis outer loops with test-time adaptation
Today, we're announcing a preview of ARC-AGI-3, the Interactive Reasoning Benchmark with the widest gap between easy for humans and hard for AI
We’re releasing:
* 3 games (environments)
* $10K agent contest
* AI agents API
Starting scores - Frontier AI: 0%, Humans: 100%
Analyzing the Hierarchical Reasoning Model by @makingAGI
We verified scores on hidden tasks, ran ablations, and found that performance comes from an unexpected source
ARC-AGI Semi Private Scores:
* ARC-AGI-1: 32%
* ARC-AGI-2: 2%
Our 4 findings:
New ARC-AGI SOTA: GPT-5 Pro
- ARC-AGI-1: 70.2%, $4.78/task
- ARC-AGI-2: 18.3%, $7.41/task
@OpenAI’s GPT-5 Pro now holds the highest verified frontier LLM score on ARC-AGI’s Semi-Private benchmark
Clarifying o3’s ARC-AGI Performance
OpenAI has confirmed:
* The released o3 is a different model from what we tested in December 2024
* All released o3 compute tiers are smaller than the version we tested
* The released o3 was not trained on ARC-AGI data, not even the train set
* The released o3 is tuned for chat/product use, which introduces both strengths and weaknesses on ARC-AGI
What ARC Prize will do:
* We will re-test the released o3 (all compute tiers) and publish updated results. Prior scores will be labeled “preview”
* We will test and release o4-mini results as soon as possible
* We will test o3-pro once available
Introducing SnakeBench, an experimental benchmark side quest
We made 50 LLMs battle each other in head-to-head snake 🐍
2.8K matches showed which models are the best at snake real-time strategy and spatial reasoning
Here’s the top match between o3-mini and DeepSeek-R1
🧵
Gemini-2.5-Pro Experimental Preview Results
ARC-AGI-1
* Public Eval: 24.3%
* Semi Private: 12.5%
ARC-AGI-2
* Public Eval: .8%
* Semi Private: 1.3%
These results are on par with Deepseek's R1
o3 and o4-mini on ARC-AGI's Semi Private Evaluation
* o3-medium scores 53% on ARC-AGI-1
* o4-mini shows state-of-the-art efficiency
* ARC-AGI-2 remains virtually unsolved (<3%)
Through analysis we highlight differences from o3-preview and other model behavior
We put OpenAI o1 to the test against ARC Prize.
Results: both o1 models beat GPT-4o. And o1-preview is on par with Claude 3.5 Sonnet.
Can chain-of-thought scale to AGI? What explains o1's modest scores on ARC-AGI?
Our notes:
arcprize.org/blog/openai-o1-…
Tiny Recursion Model (TRM) results on ARC-AGI
- ARC-AGI-1: 40%, $1.76/task
- ARC-AGI-2: 6.2%, $2.10/task
Thank you to @jm_alexia for contributing TRM, a well written, open source, and thorough research to the community based on the HRM from @makingAGI
AGI is reached when the capability gap between humans and computers is zero
ARC Prize Foundation measures this to inspire progress
Today we preview the unbeaten ARC-AGI-2 + open public donations to fund ARC-AGI-3
TY Schmidt Sciences (@ericschmidt) for $50k to kick us off!
Claude Opus 4 on ARC-AGI Semi Private Eval
Base
* ARC-AGI-1: 22.5%, $0.40/task
* ARC-AGI-2: 1.3%, $0.63/task
Thinking 16K
* ARC-AGI-1: 35.7%, $1.25/task
* ARC-AGI-2: 8.6%, $1.93/task
Opus 4 sets new SOTA (8.6%) on ARC-AGI-2
This performance on ARC-AGI highlights a genuine breakthrough in novelty adaptation.
This is not incremental progress. We're in new territory.
Is it AGI? o3 still fails on some very easy tasks, indicating fundamental differences with human intelligence.
2/4
We tested every major AI reasoning system. There is no clear winner.
Accuracy goes up as you stack modern CoT techniques, but efficiency goes way down.
This gives rise to a Pareto frontier on accuracy vs. cost using ARC-AGI as a consistent measuring stick.
ARC-AGI-3 Preview - 30-Day Learnings
30 days ago we released a preview of our first Interactive Reasoning Benchmark
Our goal was to ship quick, learn from the community, and inform the next >100 games.
Here’s what we learned after 100s of agents and >3,900 game plays:
ARC-AGI-3 Preview: +3 Games Released
We’ve opened 3 previously private holdout games from the Preview Agent Competition
Now 6 games are available to play online and via Agents API
Each game was selected to expand the novelty of ARC-AGI-3 public games
Can you beat them?
R1-Zero matches performance of R1 on ARC-AGI
We’ve verified that R1-Zero scored 14% on ARC-AGI-1 (vs 15% on R1)
@mikeknoop explains why R1-Zero is more important than R1, why scaling inference isn’t going away, and what happens when “inference becomes training”
1/4
ARC-AGI-3 Developer Preview
* Hands on first look at ARC-AGI-3 (live demos & API access)
* Fireside with @fchollet moderated by @dwarkesh_sp
7/17, San Francisco
Open to sponsors & researchers of @arcprize (very limited public slots available)
Impressive work by @makingAGI and team
No pre-training or CoT with material performance on ARC-AGI
> With only 27 million parameters, HRM achieves exceptional performance on complex reasoning tasks using only 1000 training samples
🚀Introducing Hierarchical Reasoning Model🧠🤖
Inspired by brain's hierarchical processing, HRM delivers unprecedented reasoning power on complex tasks like ARC-AGI and expert-level Sudoku using just 1k examples, no pretraining or CoT!
Unlock next AI breakthrough with neuroscience. 🌟
📄Paper: arxiv.org/abs/2506.21734
💻Code: github.com/sapientinc/HRM
Today, alongside our analysis of o3's ARC-AGI-Pub performance, we're also releasing data (results, attempts, and prompt) from our high-compute testing.
o3 was unable to solve ~9% set of Public Eval tasks that are straightforward for humans. Curious to see why?
We invite the community to help assess the characteristics of both solved and unsolved tasks.
arcprize.org/blog/oai-o3-pub…
ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems
Our paper introduces the leading benchmark for evaluating AI’s abstract reasoning capabilities
- Humans solve 100% of tasks
- Frontier AI scores <5%
@fchollet@mikeknoop@GregKamradt@bryanlanders Henry Pinkard
We’re working to reproduce Qwen 3’s reported 41% on ARC-AGI-1. This score is not yet verified.
Reminder, all scores on the ARC-AGI Leaderboard reflect our own verified testing on our semi-private holdout set.
On Claude Opus 4 results
We're currently unable to finish testing Claude Opus 4 on ARC-AGI due to consistent timeouts and rate limits
We're actively trying to get this unblocked by the @AnthropicAI team, if you can help get us in touch, please do
Once resolved, we'll complete and share the results
Wow! One of our donors has anonymously decided to materially increase their support to $1M!
This fully funds our 2025 goal in just 1 day
With this support, we’ll launch v2, build v3, and continue driving progress in measuring AGI
"I've updated my AGI timeline."
One year later, @dwarkesh_sp and @fchollet meet on camera again.
Both of them have shifted their AGI timelines.
They dive into AGI macroeconomics, the singularity, and ARC-AGI-3 preview.
Interview filmed July 17, 2025 in San Francisco, CA
Learn more about ARC-AGI-3: https://arcprize.org/arc-agi/3/
Play the games: https://three.arcprize.org/
arcprize.org
Previously shared, ARC-AGI-2 (same format - verified easy for humans, harder for AI) will launch alongside ARC Prize 2025.
We're committed to running the Grand Prize competition until a high-efficiency, open-source solution scoring 85% on the latest ARC-AGI is created.
3/4
Read our full o3 testing report and @fchollet's perspective on this exciting breakthrough, the future of the ARC-AGI benchmark, and the path to AGI.
arcprize.org/blog/oai-o3-pub…
4/4
On Dec. 6...
We'll announce the winners of ARC Prize 2024, including top score & paper award progress prizes.
And we'll publish a paper documenting state-of-the-art approaches to ARC-AGI.
We're now reviewing paper submissions and verifying the leaderboard.
Stay tuned...
The Next Chapter: ARC Prize Foundation
Beyond the benchmark - the North Star for AGI
We're excited to announce important updates to our leadership, entity structure, and initiatives for 2025
1/5
Introducing *ARC‑AGI Without Pretraining* – ❌ No pretraining. ❌ No datasets. Just pure inference-time gradient descent on the target ARC-AGI puzzle itself, solving 20% of the evaluation set. 🧵 1/4
[Paper] One approach to solve ARC-AGI is to learn a domain-specific language from the training set and add to the DSL on-the-fly when faced with novel tasks.
arxiv.org/abs/2410.06209
[Paper] Dreamcoder's inductive program synthesis has inspired many ARC-AGI approaches.
By combining neural networks + symbolic abstractions, it can tackle tasks from programming to physics.
arxiv.org/abs/2006.08381
Finding #1: The "hierarchical" architecture had minimal performance impact when compared to a similarly sized transformer
A drop-in transformer comes within a few points without any hyperparameter optimization.
See our full post: arcprize.org/blog/hrm-analys…
Interactive Reasoning Benchmarks are the next step in frontier evaluations
Hear @GregKamradt share why measuring human-like intelligence requires multi-turn environments
Including a sneak peak of ARC-AGI-3
Want to help us build interactive evaluations? We're hiring
ARC Prize is now 3 months old - we're announcing:
🏆 +$100K Grand Prize (now $600k)
📜 +$25K Paper Awards (now $75k)
And we're committing funds for a US university tour in October and the development of the next iteration of ARC-AGI.
arcprize.org/blog/3-month-up…
Base LLMs (no reasoning) are currently scoring 0% on ARC-AGI-2.
Specialized AI reasoning systems (like R1 and o3-mini) score <4%. Even AI systems with high adaptation like o1 pro and o3 low score single-digits (est.)
Our belief is that once we can no longer come up with quantifiable problems that are relatively easy for humans, yet hard for AI, we have reached AGI.
ARC-AGI-2 proves that we do not have AGI. New ideas are still needed!
ARC Prize Foundation @ MIT
We're hosting an evening with top researchers to explore measuring sample efficient in humans and machines
Join us to hear from Francois Chollet along with a world class panel: Josh Tenenbaum, Samuel Gershman, Laura Schulz, Jacob Andreas
ARC-AGI-1 was designed to challenge deep learning
ARC-AGI-2 challenges reasoning systems – while still maintaining a 100% human solve rate
Early results show frontier AI systems scoring 10-20% on ARC-AGI-2 and we're launching it March 2025
This gap demonstrates that we have not yet achieved AI systems that reach human-level general intelligence
nitter.app/fchollet/status/187017…
Does this mean the ARC-AGI benchmark has saturated?
Yes -- the v1 version of the benchmark is starting to saturate. There were already signs of this in the Kaggle competition this year -- an ensemble of all submissions would score 81%.
The competition next year will run on ARC-AGI-2, an updated version of the dataset that keeps the same format as v1, but features fewer tasks that can be easily brute-forced.
Early indications are that ARC-AGI-v2 will represent a complete reset of the state-of-the-art, and it will remain extremely difficult for o3. Meanwhile, a smart human or a small panel of average humans would still be able to score >95%.
ARC Prize will present at @OpenAI DevDay this October
We'll be sharing ARC-AGI-3 progress, including first results on human performance and how interactive evaluations open a new axis for measuring intelligence
OpenAI DevDay
Oct 6, 2025 in San Francisco
Our biggest one yet:
- 1500+ developers
- Livestreamed opening keynote
- Hands-on building with our latest models & tools
- More stages & more demos
devday.openai.com
📣 Competition Launch Alert! NeurIPS 2025 hosted by @GoogleDeepMind
🎯 To create Python programs that solve abstract reasoning tasks from the ARC-AGI benchmark
💰 $100,000 Prize Pool
⏰ Entry Deadline: October 23, 2025
kaggle.com/competitions/goog…
Are You Smarter Than A.I.?
An interactive article by @nytimes covers @arcprize and @fchollet
"Some experts predict that A.I. will surpass human intelligence within the next few years.
Play this puzzle to see how far the machines have to go."