Thanks for the support @AndrewYNg! Completely agree, faster token generation will become increasingly important as a greater proportion of output tokens are consumed by models, such as in multi-step agentic workflows, rather than being read by people.
Shoutout to the team that built artificialanalysis.ai/ . Really neat site that benchmarks the speed of different LLM API providers to help developers pick which models to use. This nicely complements the LMSYS Chatbot Arena, Hugging Face open LLM leaderboards and Stanford's HELM that focus more on the quality of the outputs. I hope benchmarks like this encourage more providers to work on fast token generation, which is critical for agentic workflows!
49
41
623
505,335
xAI gave us early access to Grok 4 - and the results are in. Grok 4 is now the leading AI model. We have run our full suite of benchmarks and Grok 4 achieves an Artificial Analysis Intelligence Index of 73, ahead of OpenAI o3 at 70, Google Gemini 2.5 Pro at 70, Anthropic Claude 4 Opus at 64 and DeepSeek R1 0528 at 68. Full results breakdown below. This is the first time that @elonmusk's @xai has the lead the AI frontier. Grok 3 scored competitively with the latest models from OpenAI, Anthropic and Google - but Grok 4 is the first time that our Intelligence Index has shown xAI in first place. We tested Grok 4 via the xAI API. The version of Grok 4 deployed for use on X/Twitter may be different to the model available via API. Consumer application versions of LLMs typically have instructions and logic around the models that can change style and behavior. Grok 4 is a reasoning model, meaning it ‘thinks’ before answering. The xAI API does not share reasoning tokens generated by the model. Grok 4’s pricing is equivalent to Grok 3 at $3/$15 per 1M input/output tokens ($0.75 per 1M cached input tokens). The per-token pricing is identical to Claude 4 Sonnet, but more expensive than Gemini 2.5 Pro ($1.25/$10, for <200K input tokens) and o3 ($2/$8, after recent price decrease). We expect Grok 4 to be available via the xAI API, via the Grok chatbot on X, and potentially via Microsoft Azure AI Foundry (Grok 3 and Grok 3 mini are currently available on Azure). Key benchmarking results: ➤ Grok 4 leads in not only our Artificial Analysis Intelligence Index but also our Coding Index (LiveCodeBench & SciCode) and Math Index (AIME24 & MATH-500) ➤ All-time high score in GPQA Diamond of 88%, representing a leap from Gemini 2.5 Pro’s previous record of 84% ➤ All-time high score in Humanity’s Last Exam of 24%, beating Gemini 2.5 Pro’s previous all-time high score of 21%. Note that our benchmark suite uses the original HLE dataset (Jan '25) and runs the text-only subset with no tools ➤ Joint highest score for MMLU-Pro and AIME 2024 of 87% and 94% respectively ➤ Speed: 75 output tokens/s, slower than o3 (188 tokens/s), Gemini 2.5 Pro (142 tokens/s), Claude 4 Sonnet Thinking (85 tokens/s) but faster than Claude 4 Opus Thinking (66 tokens/s) Other key information: ➤ 256k token context window. This is below Gemini 2.5 Pro’s context window of 1 million tokens, but ahead of Claude 4 Sonnet and Claude 4 Opus (200k tokens), o3 (200k tokens) and R1 0528 (128k tokens) ➤ Supports text and image input ➤ Supports function calling and structured outputs See below for further analysis 👇
445
1,648
7,971
3,425,413
DeepSeek takes the lead: DeepSeek V3-0324 is now the highest scoring non-reasoning model This is the first time an open weights model is the leading non-reasoning model, a milestone for open source. DeepSeek V3-0324 has jumped forward 7 points in Artificial Analysis Intelligence Index, now sitting ahead of all other non-reasoning models. It sits behind DeepSeek’s own R1 in Intelligence Index, as well as other reasoning models from OpenAI, Anthropic and Alibaba, but this does not take away from the impressiveness of this accomplishment. Non-reasoning models answer immediately without taking time to ‘think’, making them useful in latency-sensitive use cases. Three months ago, DeepSeek released V3 and we we wrote that there is a new leader in open source AI - noting that V3 came close to leading proprietary models from Anthropic and Google but did not surpass them. Today, DeepSeek are not just releasing the best open source model - DeepSeek are now driving the frontier of non-reasoning open weights models, eclipsing all proprietary non-reasoning models, including Gemini 2.0 Pro, Claude 3.7 Sonnet and Llama 3.3 70B. This release is arguably even more impressive than R1 - and potentially indicates that R2 is going to be another significant leap forward. Most other details are identical to the December 2024 version of DeepSeek V3, including: ➤ Context window: 128k (limited to 64k on DeepSeek’s first-party API) ➤ Total parameters: 671B (requires >700GB of GPU memory to run in native FP8 precision - still not something you can run at home!) ➤ Active parameters: 37B ➤ Native FP8 precision ➤Text only - no multimodal inputs or outputs ➤ MIT License
65
613
3,377
486,757
DeepSeek’s R1 leaps over xAI, Meta and Anthropic to be tied as the world’s #2 AI Lab and the undisputed open-weights leader DeepSeek R1 0528 has jumped from 60 to 68 in the Artificial Analysis Intelligence Index, our index of 7 leading evaluations that we run independently across all leading models. That’s the same magnitude of increase as the difference between OpenAI’s o1 and o3 (62 to 70). This positions DeepSeek R1 as higher intelligence than xAI’s Grok 3 mini (high), NVIDIA’s Llama Nemotron Ultra, Meta’s Llama 4 Maverick, Alibaba’s Qwen 3 253 and equal to Google’s Gemini 2.5 Pro. Breakdown of the model’s improvement: 🧠 Intelligence increases across the board: Biggest jumps seen in AIME 2024 (Competition Math, +21 points), LiveCodeBench (Code generation, +15 points), GPQA Diamond (Scientific Reasoning, +10 points) and Humanity’s Last Exam (Reasoning & Knowledge, +6 points) 🏠 No change to architecture: R1-0528 is a post-training update with no change to the V3/R1 architecture - it remains a large 671B model with 37B active parameters 🧑‍💻 Significant leap in coding skills: R1 is now matching Gemini 2.5 Pro in the Artificial Analysis Coding Index and is behind only o4-mini (high) and o3 🗯️ Increased token usage: R1-0528 used 99 million tokens to complete the evals in Artificial Analysis Intelligence Index, 40% more than the original R1’s 71 million tokens - ie. the new R1 thinks for longer than the original R1. This is still not the highest token usage number we have seen: Gemini 2.5 Pro is using 30% more tokens than R1-0528 Takeaways for AI: 👐 The gap between open and closed models is smaller than ever: open weights models have continued to maintain intelligence gains in-line with proprietary models. DeepSeek’s R1 release in January was the first time an open-weights model achieved the #2 position and DeepSeek’s R1 update today brings it back to the same position 🇨🇳 China remains neck and neck with the US: models from China-based AI Labs have all but completely caught up to their US counterparts, this release continues the emerging trend. As of today, DeepSeek leads US based AI labs including Anthropic and Meta in Artificial Analysis Intelligence Index 🔄 Improvements driven by reinforcement learning: DeepSeek has shown substantial intelligence improvements with the same architecture and pre-train as their original DeepSeek R1 release. This highlights the continually increasing importance of post-training, particularly for reasoning models trained with reinforcement learning (RL) techniques. OpenAI disclosed a 10x scaling of RL compute between o1 and o3 - DeepSeek have just demonstrated that so far, they can keep up with OpenAI’s RL compute scaling. Scaling RL demands less compute than scaling pre-training and offers an efficient way of achieving intelligence gains, supporting AI Labs with fewer GPUs See further analysis below 👇
65
449
2,666
623,184
xAI has released Grok 4 Fast - breaking through our intelligence vs cost frontier by achieving Gemini 2.5 Pro level intelligence at a ~25X cheaper cost Intelligence: @xai shared with us pre-release access to Grok 4 Fast. In reasoning mode, the model scores an impressive 60 on our Artificial Analysis Intelligence Index, in line with Gemini 2.5 Pro and Claude 4.1 Opus, while sitting as expected below the prior Grok 4 release and GPT-5 (high). Grok 4 Fast performed especially well on coding evaluations, taking the number one spot on our leaderboard for LiveCodeBench, even outperforming its larger sibling Grok 4. Cost: xAI is offering Grok 4 Fast at a very competitive price of only $0.2/1M Input Tokens and $0.5/1M output tokens. The model is also quite token efficient compared to other reasoning models, taking 61M tokens to complete our intelligence index, significantly less than Gemini 2.5 Pro’s 93M and Grok 4’s 120M. This competitive pricing and efficiency translates to the cost of running Artificial Analysis Intelligence Index being ~25X lower than Gemini 2.5 Pro and ~23X lower than GPT-5 (reasoning mode high). Speed: When benchmarking the pre-release API, xAI’s endpoint for the model was very fast, achieving 344 output tokens per second - ~2.5X faster than OpenAI’s GPT-5 API. This also allows for End to End Latency results that are faster than most non-reasoning models for many workloads. Speeds may drop as traffic on the API increases - keep an eye on our live performance benchmarking to see how this evolves. Congratulations to the @xai team and @elonmusk on this new release! See below for more details and in-depth analysis 👇
86
283
2,337
731,585
🇰🇷 South Korean AI Lab Upstage AI has just launched their first reasoning model - Solar Pro 2! The 31B parameter model demonstrates impressive performance for its size, with intelligence approaching Claude 4 Sonnet in 'Thinking' mode and is priced very competitively Key details: ➤ Hybrid reasoning: The model offers optionality between 'reasoning' mode and standard non-reasoning mode ➤ Korean-language ability & Sovereign AI: Based in Korea, Upstage announced superior performance in Korean language evaluations. This release aligns with countries' interests to develop sovereign AI capabilities ➤ Pricing: Competitively priced at $0.5/1M tokens (input & output), significantly cheaper than comparable models including Claude 4 Sonnet Thinking ($3/$15/M input/output tokens) and Magistral Small ($0.5/$1.5/M input/output tokens) ➤ Proprietary: @upstageai has not released the model weights, though they have open-sourced previous Solar Pro models. Whether they will release Solar Pro 2's weights remains unclear as it wasn't mentioned in their announcement
184
355
1,878
7,468,629
MoonshotAI has released Kimi K2 Thinking, a new reasoning variant of Kimi K2 that achieves #1 in the Tau2 Bench Telecom agentic benchmark and is potentially the new leading open weights model Kimi K2 Thinking is one of the largest open weights models ever, at 1T total parameters with 32B active. K2 Thinking is the first reasoning model release within @Kimi_Moonshot's Kimi K2 model family, following non-reasoning Kimi K2 Instruct models released previously in July and September 2025. Key takeaways: ➤ Strong performance on agentic tasks: Kimi K2 Thinking achieves 93% in 𝜏²-Bench Telecom, an agentic tool use benchmark where the model acts as a customer service agent. This is the highest score we have independently measured. Tool use in long horizon agentic contexts was a strength of Kimi K2 Instruct and it appears this new Thinking variant makes substantial gains ➤ Reasoning variant of Kimi K2 Instruct: The model, as per its naming, is a reasoning variant of Kimi K2 Instruct. The model has the same architecture and same number of parameters (though different precision) as Kimi K2 Instruct and like K2 Instruct only supports text as an input (and output) modality ➤ 1T parameters but INT4 instead of FP8: Unlike Moonshot’s prior Kimi K2 Instruct releases that used FP8 precision, this model has been released natively in INT4 precision. Moonshot used quantization aware training in the post-training phase to achieve this. The impact of this is that K2 Thinking is only ~594GB, compared to just over 1TB for K2 Instruct and K2 Instruct 0905 - which translates into efficiency gains for inference and training. A potential reason for INT4 is that pre-Blackwell NVIDIA GPUs do not have support for FP4, making INT4 more suitable for achieving efficiency gains on earlier hardware. Our full set of Artificial Analysis Intelligence Index benchmarks are in progress and we will provide an update as soon as they are complete.
79
276
1,916
1,433,962
DeepSeek’s first reasoning model has arrived - over 25x cheaper than OpenAI’s o1 Highlights from our initial benchmarking of DeepSeek R1: ➤ Trades blows with OpenAI’s o1 across our eval suite to score the second highest in Artificial Analysis Quality Index ever ➤ Priced on DeepSeek’s own API at just $0.55/$2.19 input/output - significantly cheaper than not just o1 but o1-mini ➤ Served by DeepSeek at 71 output tokens/s (comparable to DeepSeek V3) ➤ Reasoning tokens are wrapped in <thinking> tags, allowing developers to easily decide whether to show them to users Stay tuned for more detail coming next week - big upgrades to the Artificial Analysis eval suite launching soon.
39
261
1,565
1,459,043
GPT-4o Mini, announced today, is very impressive for how cheap it is being offered 👀 With a MMLU score of 82% (reported by TechCrunch), it surpasses the quality of other smaller models including Gemini 1.5 Flash (79%) and Claude 3 Haiku (75%). What is particularly exciting is that it is also to be offered at a cheaper price than these models. The reported price is $0.15/1M input tokens and $0.6/1M output tokens. With such a cheap price for input tokens and its large 128k context window, it will be very compelling for long context use-cases (including large document RAG). @OpenAI have clearly made a very high quality model relative to its size (pricing can indicate size due to the direct relationship to compute cost). The model seems a worthy successor to GPT3.5 Turbo as OpenAI's smallest model and the model used for ChatGPT's free version.
29
158
1,035
1,086,190
Independent benchmarks of OpenAI’s gpt-oss models: gpt-oss-120b is the most intelligent American open weights model, comes behind DeepSeek R1 and Qwen3 235B in intelligence but offers efficiency benefits OpenAI has released two versions of gpt-oss: ➤ gpt-oss-120b (116.8B total parameters, 5.1B active parameters): Intelligence Index score of 58 ➤ gpt-oss-20b (20.9B total parameters, 3.6B active parameters): Intelligence Index score of 48 Size & deployment: OpenAI has released both models in MXFP4 precision: gpt-oss-120b comes in at just 60.8GB and gpt-oss-20b just 12.8GB. This means that the 120B can be run in its native precision on a single NVIDIA H100, and the 20B can be run easily on a consumer GPU or laptop with >16GB of RAM. Additionally, the relatively small proportion of active parameters will contribute to their efficiency and speed for inference: the 5.1B active parameters of the 120B model can be contrasted with Llama 4 Scout’s 109B total parameters and 17B active (a lot less sparse). This makes it possible to get dozens of output tokens/s for the 20B on recent MacBooks. Intelligence: Both models score extremely well for their size and sparsity. We’re seeing the 120B beat o3-mini but come in behind o4-mini and o3. The 120B is the most intelligent model that can be run on a single H100 and the 20B is the most intelligent model that can be run on a consumer GPU. Both models appear to place similiarly across most of our evals, indicating no particular areas of weakness. Comparison to other open weights models: While the larger gpt-oss-120b does not come in above DeepSeek R1 0528’s score of 59 or Qwen3 235B 2507s score of 64, it is notable that it is significantly smaller in both total and active parameters than both of those models. DeepSeek R1 has 671B total parameters and 37B active parameters, and is released natively in FP8 precision, making its total file size (and memory requirements) over 10x larger than gpt-oss-120b. Both gpt-oss-120b and 20b are text-only models (similar to competing models from DeepSeek, Alibaba and others). Architecture: The MoE architecture at appears fairly standard. The MoE router selects the top 4 experts for each token generation. The 120B has 36 layers and 20B has 24 layers. Each layer has 64 query heads, uses Grouped Query Attention with 8 KV heads. Rotary embeddings and YaRN are used to extend context window to 128k. The 120B model activates 4.4% of total parameters per forward pass, whereas the 20B model activates 17.2% of total parameters. This may indicate that OpenAI’s perspective is that a higher degree is of sparsity is optimal for larger models. It has been widely speculated that most top models from frontier labs have been sparse MoEs for most releases since GPT-4. API Providers: A number of inference providers have been quick to launch endpoints. We are currently benchmarking @GroqInc, @cerebras, @FireworksAI_HQ and @togethercompute on Artificial Analysis and will add more providers as they launch endpoints. Pricing: We’re tracking median pricing across API providers of $0.15/$0.69 per million input/output tokens for the 120B and $0.08/$0.35 for the 20B. These prices put the 120B close to 10x cheaper than OpenAI’s proprietary APIs for o4-mini ($1.1/$4.4) and o3 ($2/$8). License: Apache 2.0 license - very permissive! See below for further analysis:
43
154
1,000
173,171
Today’s GPT-4o update is actually big - it leapfrogs Claude 3.7 Sonnet (non-reasoning) and Gemini 2.0 Flash in our Intelligence Index and is now the leading non-reasoning model for coding This makes GPT-4o the second highest scoring non-reasoning model (excludes o3-mini, Gemini 2.5 Pro, etc), coming in just behind DeepSeek’s V3 0324 release earlier this week. Key benchmarking results: ➤ Significant jump in the Artificial Analysis Intelligence Index from 41 to 50, putting GPT-4o (March 2025) ahead of Claude 3.7 Sonnet ➤ Now the the leading non-reasoning model for coding: 🥇#1 in the Artificial Analysis Coding Index and in LiveCodeBench, surpassing DeepSeek V3 (March 2025) and Claude 3.7 Sonnet @OpenAI has committed an all-new AI model naming sin of simply refusing to name the model at all, so we will be referring to it as GPT-4o (March 2025). This update has also been released in a fairly confusing way - the March 2025 version of GPT-4o is currently available: ➤ In ChatGPT, when users select GPT-4o in the model selector ➤ Via API on the chatgpt-4o-latest endpoint - a non-dated endpoint that OpenAI described at launch as intended for research use only, with developers encouraged to use the dated snapshot versions of GPT-4o for most API use cases As of today, this means that the chatgpt-4o-latest endpoint is serving a significantly better model than the proper API versions GPT-4o (ie. the August 2024 and November 2024 snapshots). We recommend some caution for developers considering moving workloads to the chatgpt-4o-latest endpoint given OpenAI’s previous guidance, and note that OpenAI will likely release a dated API snapshot soon. We also note that OpenAI prices the chatgpt-4o-latest endpoint at $5/$15 per million input/output tokens, whereas the API snapshots are priced at $2.5/$10. See below for further analysis 👇
29
94
960
132,833
DeepSeek's R1 update consolidates the lead of 🇨🇳 Chinese AI Labs in open weights intelligence
17
102
917
74,201
Kimi K2 Thinking is the new leading open weights model: it demonstrates particular strength in agentic contexts but is very verbose, generating the most tokens of any model in completing our Intelligence Index evals @Kimi_Moonshot's Kimi K2 Thinking achieves a 67 in the Artificial Analysis Intelligence Index. This positions it clearly above all other open weights models, including the recently released MiniMax-M2 and DeepSeek-V3.2-Exp, and second only to GPT-5 amongst proprietary models. It used the highest number of tokens ever across the evals in Artificial Analysis Intelligence Index (140M), but with MoonShot’s official API pricing of $0.6/$2.5 per million input/output tokens (for the base endpoint), overall Cost to Run Artificial Analysis Intelligence Index comes in cheaper than leading frontier models at $356. Moonshot also offers a faster turbo endpoint priced at $1.15/$8 (driving a Cost to Run Artificial Analysis Intelligence Index result of $1172 for the turbo endpoint - second only to Grok 4 as the most expensive model). The base endpoint is very slow at ~8 output tokens/s while the turbo is somewhat faster at ~50 output tokens/s. The model is one of the largest open weights models ever at 1T total parameters with 32B active. K2 Thinking is the first reasoning model release in Moonshot AI’s Kimi K2 model family, following non-reasoning Kimi K2 Instruct models released previously in July and September 2025. Moonshot AI only refers to post-training in their announcement. This release highlights the continued trend of post-training & specifically RL driving gains in performance for reasoning models and in long horizon tasks involving tool calling. Key takeaways: ➤ Details: text only (no image input), 256K context window, natively released in INT4 precision, 1T total with 32B active (~594GB) ➤ New leader in open weights intelligence: Kimi K2 Thinking achieves a 67 in the Artificial Analysis Intelligence Index. This is the highest open weights score yet and significantly higher than gpt-oss-120b (61), MiniMax-M2 (61), Qwen 235B A22B 2507 (57) and DeepSeek-V3.2-Exp (57). This release continues the trend of open weights models closely following proprietary models in intelligence achieved ➤ China takes back the open weights frontier: Releases from China based AI labs have led in open weights intelligence offered for most of the past year. OpenAI’s gpt-oss-120b release in August 2025 briefly took back the leadership position for the US. Moonshot AI’s K2 Thinking takes back the leading open weights model mantle for China based AI labs ➤ Strong agentic performance: Kimi K2 Thinking demonstrates particular strength in agentic contexts, as showcased by its #2 position in the Artificial Analysis Agentic Index - where it is second only to GPT-5. This is mostly driven by K2 Thinking achieving 93% in 𝜏²-Bench Telecom, an agentic tool use benchmark where the model acts as a customer service agent. This is the highest score we have independently measured. Tool use in long horizon agentic contexts was a strength of Kimi K2 Instruct and it appears this new Thinking variant makes substantial gains ➤ Top open weights coding model, but behind proprietary models: K2 Thinking does not score a win in any of our coding evals - it lands in 6th place in Terminal-Bench Hard, 7th place in SciCode and 2nd place in LiveCodeBench. Compared to open weights models, it is in first or first equal for each of these evals - and therefore comes in ahead of previous open weights leader DeepSeek V3.2 in our Artificial Analysis Coding Index ➤ Biggest leap for open weights in Humanity’s Last Exam: K2 Thinking’s strongest results include Humanity’s Last Exam, where we measured a score of 22.3% (no tools) - an all time high for open weights models and coming in only behind GPT-5 and Grok 4 ➤ Verbosity: Kimi K2 Thinking is very verbose - taking 140M total tokens are used to run our Intelligence Index evaluations, ~2.5x the number of tokens used by DeepSeek V3.2 and ~2x compared to GPT-5. This high verbosity drives both higher cost and higher latency, compared to less verbose models. On Mooshot’s base endpoint, K2 Thinking is 2.5x cheaper than GPT-5 (high) but 9x more expensive than DeepSeek V3.2 (Cost to Run Artificial Analysis Intelligence Index) ➤ Reasoning variant of Kimi K2 Instruct: The model, as per its naming, is a reasoning variant of Kimi K2 Instruct. The model has the same architecture and same number of parameters (though different precision) as Kimi K2 Instruct. It continues to only support text inputs and outputs ➤ 1T parameters but INT4 instead of FP8: Unlike Moonshot’s prior Kimi K2 Instruct releases that used FP8 precision, this model has been released natively in INT4 precision. Moonshot used quantization aware training in the post-training phase to achieve this. The impact of this is that K2 Thinking is only ~594GB, compared to just over 1TB for K2 Instruct and K2 Instruct 0905 - which translates into efficiency gains for inference and training. A potential reason for INT4 is that pre-Blackwell NVIDIA GPUs do not have support for FP4, making INT4 more suitable for achieving efficiency gains on earlier hardware ➤ Access: The model is available on @huggingface with a modified MIT license. @Kimi_Moonshot is serving an official API (available globally) and third party inference providers are already launching endpoints - including @baseten, @FireworksAI_HQ, @novita_labs, @parasail_io
27
123
941
112,598
Wait - is the new GPT-4o a smaller and less intelligent model? We have completed running our independent evals on OpenAI’s GPT-4o release yesterday and are consistently measuring materially lower eval scores than the August release of GPT-4o. GPT-4o (Nov) vs GPT-4o (Aug): ➤ Artificial Analysis Quality Index decrease from 77 to 71 (now equal to GPT-4o mini) ➤ GPQA Diamond decrease from 51% to 39%, MATH decrease from 78% to 69% ➤ Speed increase from ~80 output tokens/s to ~180 tokens/s ➤ No pricing change Our Output Speed benchmarks are currently measuring ~180 output tokens/s for the Nov 20th model, while the August model shows ~80 tokens/s. We have generally observed significantly faster speeds on launch day for OpenAI models (likely due to OpenAI provisioning capacity ahead of adoption), but previously have not seen a 2x speed difference. Based on this data, we conclude that it is likely that OpenAI’s Nov 20th GPT-4o model is a smaller model than the August release. Given that OpenAI has not cut prices for the Nov 20th version, we recommend that developers do not shift workloads away from the August version without careful testing.
47
105
871
189,165
Releasing our Q2 2025 State of AI - China Report 🇨🇳: Chinese AI labs have achieved close to parity with US labs, led by DeepSeek's leap to world #2 in intelligence and backed by a deep ecosystem of 10+ players Key findings from our analysis: 🇨🇳 The Chinese AI Ecosystem has depth and has demonstrated consistent innovation with DeepSeek and Alibaba now releasing models within weeks of global counterparts, with comparable or superior performance across benchmarks. 10+ Chinese AI labs have models with impressive intelligence scores, including DeepSeek, Alibaba, ByteDance, Tencent, Moonshot, Zhipu, Stepfun, Xiaomi, Baichuan, MiniMax and 01 AI 👐 An open weights approach has supported international adoption: Several Chinese AI labs have embraced strategies of releasing open weights models, allowing broad accessibility and supporting adoption by developers worldwide 🏆 DeepSeek achieves impressive technical breakthroughs, with DeepSeek R1-0528 achieving frontier AI performance. This places it amongst the world's highest-performing models alongside Google's Gemini 2.5 Pro and above models from xAI, Meta, and Anthropic A highlights version of the report is freely available on the Artificial Analysis website for a limited time. Below we share key excerpts:
19
173
859
158,420
Google’s Gemini 2.5 Native Audio Thinking is the new leading Speech to Speech model per our Artificial Analysis Big Bench Audio benchmark The new model achieves a score of 92% on Big Bench Audio, the highest result recorded by Artificial Analysis to date. This not only places it ahead of all previously tested native Speech to Speech systems, but also above a GPT-4o pipeline approach (Whisper transcription → GPT-4o text reasoning → speech generation). Benchmark context: Big Bench Audio is the first dedicated dataset for evaluating reasoning performance of speech models. Big Bench Audio comprises 1,000 audio questions adapted from the Big Bench Hard text test set, chosen for its rigorous testing of advanced reasoning, translated into the audio domain. Performance: ➤ Reasoning: Achieves 92% on Big Bench Audio, setting a new state-of-the-art for native Speech to Speech reasoning ➤ Latency: At an average time to first token of 3.87 seconds, the new model is slower than leading OpenAI models including GPT Realtime (0.98 seconds), due to the thinking component. The non-thinking equivalent still leads on latency at 0.63 seconds Model details: ➤ Processes audio, video, and text inputs directly, generating both text and natural speech outputs ➤ Reasons over spoken input without transcription ➤ Supports function calling, search grounding, and thinking budgets ➤ 128k input and 8k output token limits with a knowledge cut-off of January 2025
30
105
824
148,658
Google is firing on all cylinders across AI - Gemini 2.5 Pro is equal #2 in intelligence, Veo 3 and Imagen 4 are amongst the leaders in media generation, and with TPUs they're the only vertically integrated player 🧠 Google is now equal #2 Artificial Analysis Intelligence Index with the recent release of the Gemini 2.5 Pro (June 2025) model, rivaling others including OpenAI, DeepSeek and Grok 📽️ Google Veo 3 now ranks second in the Artificial Analysis Video Arena Leaderboard only behind ByteDance’s new Seedance 1.0 model 🖼️ Google Imagen 4 now occupies 2 out of the top 5 positions on the Artificial Analysis Image Arena Leaderboard 👨‍🏭 Google has a full stack AI offering with offerings across the application layer, models, cloud inference and hardware TPUs)
26
120
867
434,641
We've launched benchmarks of the accuracy of providers offering APIs for gpt-oss-120b We compare providers by running GPQA Diamond 16 times, AIME25 32 times, and IFBench 8 times. We report the median score across these runs alongside minimum, 25th percentile, 75th percentile and maximum results. The number of repeats we run has been calibrated based on our confidence interval calculations. This is the first version of our endpoint accuracy testing. We plan to iterate over time to ensure it provides the fairest possible basis for comparing providers’ accuracy. Link to benchmarks below 👇
Inference providers have worked hard in the last week to make gpt-oss work well on their platforms. We just released a guide to help you verify API-compatibility & run your own evals. Additionally, @ArtificialAnlys started releasing per-provider evals for AIME, GPQA & IFBench 🧵
58
78
855
466,309
Google’s Gemini 2.5 Flash costs 150x more than Gemini 2.0 Flash to run Artificial Analysis Intelligence Index The increase is driven by: ➤ 9x more expensive output tokens - $3.5 per million with reasoning on ($0.6 with reasoning off) vs $0.4 for Gemini 2.0 Flash ➤ 17x higher token usage across our evals due to adding reasoning - the greatest volume of tokens used in reasoning that we have observed for any model to date This doesn’t mean Gemini 2.5 Flash is not a compelling value proposition - its 12 point bump in Artificial Analysis Intelligence Index makes it suitable for a range of use cases that may not perform sufficiently well on Gemini 2.0 Flash. With per-token pricing still slightly below OpenAI’s o4-mini, Gemini 2.5 Flash may still be a cost-effective option for certain use cases. It does mean that Gemini 2.5 Flash with Reasoning may not be a clear upgrade for everyone - for many use cases, developers may want to stay with 2.0 Flash or use 2.5 Flash with reasoning off.
33
83
842
445,802
Alibaba’s updated Qwen3 Max is the most intelligent non-reasoning model, placing ahead of Kimi K2 0905! Key takeaways: ➤ Intelligence uplift: Intelligence increased by +6 points to 55 in our Artificial Analysis Intelligence Index. Qwen3 Max is currently the most intelligent non-reasoning model. The previous most intelligent non-reasoning model, Kimi K2 0905, scored 50 on the index ➤ GA: Alibaba’s upgraded Qwen3 Max model is now in GA, the prior version was in Preview ➤ Broader capability improvements: Improvements across agentic tool use (𝜏²-Bench Telecom scores increased from 33% to 74%), coding (LiveCodeBench from 65% to 77%), and long context reasoning (AA-LCR from 40% to 47%). ➤ Higher token usage: Running the Artificial Analysis Intelligence Index required ~21M output tokens, ~7M more than Qwen3 Max (Preview). This continues the trend of non-reasoning models becoming more verbose, though there remains a distinction with Qwen3 Max still significantly below reasoning models Key model information: ➤🧠 Reasoning: Qwen3 Max is a non-reasoning model. Alibaba has indicated that a reasoning version, Qwen3-Max-Thinking, is under active training. ➤🔒 Proprietary: Like the Preview version, Qwen3 Max is proprietary, since Alibaba has not released the weights. ➤⚙️ Context window: The model supports a 256k-token context window. ➤📷 Multimodality: Qwen3 Max is text-only, with no multimodal inputs or outputs. ➤💲Pricing: The model is priced at $1.2/$6 per 1M input/output tokens Qwen3 Max is currently available in Qwen Chat and via Alibaba Cloud.
23
107
841
254,588
Less than 500 days since releasing their first model in March 2024, xAI has released the leading AI model
18
74
832
62,159
DeepSeek launches V3.1, unifying V3 and R1 into a hybrid reasoning model with an incremental increase in intelligence Incremental intelligence increase: Initial benchmarking results for DeepSeek V3.1 show Artificial Analysis Intelligence Index of 60 in reasoning mode, up from the R1’s score of 59. In non-reasoning mode, V3.1 achieves a score of 49, a greater increase from the earlier V3 0324 score of 44. This leaves V3.1 (reasoning) behind Alibaba’s latest Qwen3 235B 2507 (reasoning) - DeepSeek has not taken back the lead. Hybrid reasoning: @deepseek_ai has moved to a hybrid reasoning model for the first time - supporting both reasoning and non-reasoning modes. DeepSeek’s move to a unified hybrid reasoning model mimics the approach taken by OpenAI, Anthropic and Google. It is interesting to note, however, that Alibaba recently abandoned their the hybrid approach they favored for Qwen3 with their separate releases of Qwen3 2507 reasoning and instruct models. Function calling / tool use: While DeepSeek claims improved function calling for the model, DeepSeek V3.1 does not support function calling when in reasoning mode. This is likely to substantially limit its ability to support agentic workflows with intelligence requirements, including in coding agents. Token usage: DeepSeek V3.1 scores incrementally higher in reasoning mode than DeepSeek R1, and uses slightly fewer tokens across the evals we use for Artificial Analysis Intelligence Index. In non-reasoning mode, it uses slightly more tokens than V3 0324 - but still several times fewer than in its own reasoning mode. API: DeepSeek’s first party API now serves the new DeepSeek V3.1 model on both their chat and reasoning endpoints - simply changing whether the end thinking </think> token is provided to the model in the chat template to control whether the model will reason. Architecture: DeepSeek V3.1 is architecturally identical to prior V3 and R1 models, with 671B total parameters and 37B active parameters. Implications: We would advise caution in making any assumptions about what this release implies about DeepSeek’s progress toward a future model referred to in rumors as V4 or R2. We note that DeepSeek previously released the final model built on their V2 architecture on December 10 2024, just two weeks before releasing V3.
28
106
779
73,559
Google’s new Gemini 2.5 Pro Experimental takes the #1 position across a range of our evaluations that we have run independently Gemini 2.5 Pro is a reasoning model, it ‘thinks’ before answering questions. Google has released it as an experimental API in AI Studio only, and has not yet disclosed pricing. If Google prices Gemini 2.5 Pro at a similar level to Gemini 1.5 Pro ($1.25/$5 per million input/output tokens), Gemini 2.5 Pro will be significantly cheaper than leading models from OpenAI and Anthropic ($15/$60 for o1 and $3/$15 for Claude 3.7 Sonnet). Key benchmarking results: 🥇All time high scores in MMLU-Pro and GPQA Diamond of 86% and 83% respectively 🥇All time high and significant leap in Humanity’s Last Exam, scoring 17.7% - a leap from o3-mini-high’s previous 12.3% record 🥇All time high score in AIME 2024 of 88% 🏁 Speed: 195 output tokens/s, much faster than Gemini 1.5 Pro’s 92 tokens/s and nearly as fast as Gemini 2.0 Flash’s 253 tokens/s Gemini 2.5 Pro continues to support key features that the Gemini family is known for, including: ➤ 1 million token context window (2 million token context window, as supported by Gemini 1.5 Pro, coming soon) ➤ Multimodal inputs: image, video and audio (text output only) Additional benchmark results are provided below. Stay tuned for the Artificial Analysis Intelligence Index once we finish running all 7 evaluations. 👇
30
185
761
101,951
MiniMax’s M2 achieves a new all-time-high Intelligence Index score for an open weights model and offers impressive efficiency with only 10B active parameters (200B total) Key takeaways: ➤ Efficiency to serve at scale: MiniMax-M2 has 200B total parameters and is very sparse with only 10B active parameters per forward pass. Such few active parameters allow the model to be served efficiently at scale (DeepSeek V3.2 has 671B total and 37B active, Qwen3 has 235B total and 22B active). The model can also easily fit on 4xH100s at FP8 precision ➤ Strengths focus on agentic use-cases: The model’s strengths include tool use and instruction following (as shown by Tau2 Bench and IFBench). As such, while M2 likely excels at agentic use cases it may underperform other open weights leaders such as DeepSeek V3.2 and Qwen3 235B at some generalist tasks. This is in line with a number of recent open weights model releases from Chinese AI labs which focus on agentic capabilities, likely pointing to a heavy post-training emphasis on RL. Similar to most other leading open weights models, M2 is a text only model - Alibaba’s recent Qwen3 VL releases remain the leading open weights multimodal models ➤ Cost & token usage: MiniMax’s API is offering the model at a very competitive per token price of $0.3/$1.2 per 1M input/output tokens. However, the model is very verbose, using 120M token to complete our Intelligence Index evaluations - equal highest along with Grok 4. As such, while it is a low priced model this is moderated by high token usage ➤ Continued leadership in open source by Chinese AI labs: MiniMax’s release continues the leadership of Chinese AI labs in open source that DeepSeek kicked off in late 2024, and which has been continued by continued DeepSeek releases, Alibaba, Z AI and Moonshot AI See below for further analysis and a link to the model on Artificial Analysis
22
101
780
430,483
The cost of intelligence continues to fall rapidly after new frontiers are reached: Grok 4 Fast brings the cost of the Intelligence Index >60 category down to just $0.2/$0.5 per million input/output tokens We track the lowest priced model for different tiers of intelligence to measure the decline in the cost of intelligence over time. The cost of each new level of intelligence appears to be falling much faster in 2025 than 2023-2024. OpenAI’s o3 was the earliest model to be in our >60 category when it launched in April. At launch, o3’s pricing was $10/$40 - but OpenAI later cut its price to $2/$8 in June, which is what is shown on the chart below. Comparing blended prices (3:1 input to output tokens, as shown on the chart below), Grok 4 Fast is priced 64 times lower than o3’s release price and 12 times lower than o3’s current blended price (3:1 input to output tokens). Prices for reference: ➤ Grok 4 Fast: $0.2/$0.5 per million input/output tokens ➤ o3 original launch price: $10/$40 ➤ o3 updated June price: $2/$8 Please note chart below is based on token pricing - not our Cost to Run Artificial Analysis Intelligence Index metric.
21
111
764
71,918
Anthropic’s new Claude 4.5 Sonnet is now the #4 most intelligent model, beats 4.1 Opus, and places Anthropic in the top 3 in the race for frontier intelligence Claude 4.5 Sonnet offers a clear upgrade for Claude 4.1 Opus and Claude 4 Sonnet users, with greater intelligence at the same price and token efficiency as Claude 4 Sonnet. Claude 4.5 Sonnet’s token efficiency, even in its maximum reasoning mode, makes it cheaper to use for many tasks than GPT-5, Grok 4 or Gemini 2.5 Pro. Key benchmarking takeaways: ➤🧠 Anthropic’s most intelligent model: In reasoning mode, Claude 4.5 Sonnet scores 61 on the Artificial Analysis Intelligence Index. This is a jump of +4 points from Claude 4 Sonnet (Thinking) which was released in May 2025, and +2 points from Claude 4.1 Opus (Thinking). Claude 4.5 Sonnet (Thinking) now places ahead of Gemini 2.5 Pro (60) and Grok 4 Fast (60), but behind GPT-5 (high, 68) and Grok 4 (65). ➤📈 Largest increases: we see the biggest uplifts in individual evaluation scores in 𝜏²-Bench Telecom (+13 p.p.), Humanity's Last Exam (+14 p.p.) and Humanity's Last Exam (+7 p.p.). Claude 4.5 Sonnet achieves Anthropic’s best score yet TerminalBench-Hard, but only gains +1 p.p compared to Claude 4.1 Opus and remains behind Grok 4 and GPT-5 Codex (High. Interestingly, Claude 4.5 Sonnet does not achieve the highest score yet in any individual evaluation across the 10 evaluations in Artificial Analysis Intelligence Index. ➤⚡ Non-reasoning performance: In non-reasoning mode, Claude 4.5 Sonnet jumped from 44 to 49 on the Artificial Analysis Intelligence Index. We see the largest improvement in Agentic Tool Use (increase in 𝜏²-Bench Telecom score from 52% to 71%) with smaller improvements across other evals. ➤⚙️ Token efficiency: Anthropic have increased Claude’s evaluation scores without increasing output token usage and the Claude models continue to be more token efficient than all other reasoning models. For Claude 4.5 Sonnet (Thinking) - evaluated with a maximum reasoning budget of 64k tokens - we see a slight decrease in token usage to run Artificial Analysis Intelligence Index from 43M to 42M, compared to Claude 4 Sonnet. This is different to other model upgrades we have seen where increase in intelligence is often correlated with increase in output token usage ➤💲 Pricing: Claude 4.5 Sonnet is priced the same as Claude 4 Sonnet at $3/$15 per 1M input/output tokens. This represents a more compelling option, compared to Claude 4.1 Opus, offering higher intelligence in thinking mode at 1/5th the blended price (3:1 input to output token ratio) Key model details: ➤📏 Context window: 200K tokens ➤🪙 Max output tokens: 64K tokens ➤🌐 Availability: Claude 4.5 Sonnet is available via Anthropic‘s API, Google Vertex and Amazon Bedrock. Claude 4.5 Sonnet is also available via Claude, and Claude Code (v2 of which has also been released today)
37
75
774
141,204
Google’s updated Gemini 2.5 Pro now leads the AI intelligence frontier, matching OpenAI's o3 in our independent benchmarks Google’s May update of Gemini 2.5 Pro regressed in some performance evaluations compared to the initial March release. This June update not only fixes previous regressions but delivers significant improvements across our independent benchmarks. Key highlights: ➤ #1 Position across evals: Gemini 2.5 Pro (June) now leads across a range of evals including MMLU-Pro (86%), GPQA Diamond (84%), Humanity's Last Exam (21%) ➤ Leading coding performance: 80% on LiveCodeBench, matching o4-mini (high) ➤ Variable reasoning budget: Users can vary the maximum amount of output tokens that can be used for ‘thinking’, allocating more tokens as needed for harder tasks Gemini 2.5 Pro retains Google's signature capabilities: ➤ Ultra-long 1 million token context window ➤ Multimodal inputs: image, video, and audio support (text output only)
16
84
752
65,847
Llama 4 Intelligence Index Update: We have now replicated Meta’s claimed values for MMLU Pro and GPQA Diamond, pushing our Intelligence Index scores for both Scout and Maverick higher Key update details: ➤ We noted in our first post 48 hours ago that we noticed discrepancies between our measured results and Meta’s claimed scores for our multi-choice eval datasets (MMLU Pro and GPQA Diamond) ➤ After further experiments and and close review, we have decided that in accordance with our published principle against unfairly penalizing models where they get the content of questions correct but format answers differently, we will allow Llama 4’s answer style of ‘The best answer is A’ as legitimate answer for our multi-choice evals ➤ This leads to a jump in score for both Scout and Maverick (largest for Scout) in 2/7 of the evals that make up Artificial Analysis Intelligence Index, and therefore a jump in their Intelligence Index scores ➤ Scout’s Intelligence Index has moved from 36 to 43, and Maverick’s Intelligence Index has moved from 49 to 50. Overall, we continue to conclude that both Scout and Maverick are very impressive models and a significant contribution to the open weights AI ecosystem. While DeepSeek V3 0324 maintains a small lead over Maverick, we continue to note that Maverick has ~half the active parameters (17B vs 37B), and ~60% of the total parameters (402B vs 671B), while also supporting image inputs. All our tests have been performed on the Hugging Face release version of the Llama 4 weights for both Scout and Maverick, including testing via a range of third party cloud providers. None of our eval results are based on the experimental chat-tuned model provided to LMArena (Llama-4-Maverick-03-26-Experimental). We can also share that we have observed third party cloud APIs generally stabilizing over the last 48 hours. We will soon release endpoint-level comparison data to allow developers to understand whether any cloud providers are still serving versions of Llama 4 with accuracy issues.
48
155
716
149,194
Seedream 4.0 is the new leading image model across both the Artificial Analysis Text to Image and Image Editing Arena, surpassing Google's Gemini 2.5 Flash (Nano-Banana), across both! Seedream 4.0 is the latest release from Bytedance Seed, and is a substantial improvement on their Seedream 3.0 model with improved text rendering performance. It is the new leader in the Text to Image leaderboard, and achieves parity with Gemini 2.5 Flash in Image Editing. Seedream 4.0 maintains the same pricing as Seedream and SeedEdit 3.0 at $30 per 1k generations, and is currently available on @FAL , @replicate , and @BytePlusGlobal See the thread below to see Seedream 4.0 in the Artificial Analysis Image Arena for yourself!
32
88
748
119,036
Alibaba has released Qwen3 Next 80B: an open weights hybrid reasoning model that achieves DeepSeek V3.1-level intelligence with only 3B active parameters Key takeaways: 💡 Novel architecture: First model to introduce @Alibaba_Qwen's ‘Qwen3-Next’ foundation models, with several key architecture decisions such as a hybrid attention mechanism of Gated DeltaNet and Gated Attention, and high sparsity with a 3.8% active parameters share, compared to 9.4% for Qwen3 235B 🧠 Intelligence: Qwen3 Next 80B (Reasoning) scores 54 on the Artificial Analysis Intelligence Index, placed alongside DeepSeek V3.1 (Reasoning). The non-reasoning variant scores 45, in line with gpt-oss-20B and Llama Nemotron Super 49B v1.5 (Reasoning) 💲 Pricing model: Per token pricing on @alibaba_cloud is $0.5/$6 per 1M input/output tokens for reasoning and $0.5/$2 for the non-reasoning variant. This compares to higher prices for Qwen3 235B 2507 of $0.7/$8.4 with reasoning and $0.7/$2.8 without - a ≥25% reduction depending on workloads ⚙️ Model details: The model has a native context window of 256k tokens and is text-only, with no multimodal inputs or outputs. At only 80B parameters at FP8 the model fits on a single H200 GPU
32
98
703
160,744
OpenAI gave us early access to GPT-5: our independent benchmarks verify a new high for AI intelligence. We have tested all four GPT-5 reasoning effort levels, revealing 23x differences in token usage and cost between the ‘high’ and ‘minimal’ options and substantial differences in intelligence We have run our full suite of eight evaluations independently across all reasoning effort configurations of GPT-5 and are reporting benchmark results for intelligence, token usage, and end-to-end latency. What @OpenAI released: OpenAI has released a single endpoint for GPT-5, but different reasoning efforts offer vastly different intelligence. GPT-5 with reasoning effort “High” reaches a new intelligence frontier, while “Minimal” is near GPT-4.1 level (but more token efficient). Takeaways from our independent benchmarks: ⚙️ Reasoning effort configuration: GPT-5 offers four reasoning effort configurations: high, medium, low, and minimal. Reasoning effort options steer the model to “think” more or less hard for each query, driving large differences in intelligence, token usage, speed, and cost. 🧠 Intelligence achieved ranges from frontier to GPT-4.1 level: GPT-5 sets a new standard with a score of 68 on our Artificial Analysis Intelligence Index (MMLU-Pro, GPQA Diamond, Humanity’s Last Exam, LiveCodeBench, SciCode, AIME, IFBench & AA-LCR) at High reasoning effort. Medium (67) is close to o3, Low (64) sits between DeepSeek R1 and o3, and Minimal (44) is close to GPT-4.1. While High sets a new standard, the increase over o3 is not comparable to the jump from GPT-3 to GPT-4 or GPT-4o to o1. 💬 Token usage varies 23x between reasoning efforts: GPT-5 with High reasoning effort used more tokens than o3 (82M vs. 50M) to complete our Index, but still fewer than Gemini 2.5 Pro (98M) and DeepSeek R1 0528 (99M). However, Minimal reasoning effort used only 3.5M tokens which is substantially less than GPT-4.1, making GPT-5 Minimal significantly more token-efficient for similar intelligence. 📖 Long Context Reasoning: We released our own Long Context Reasoning (AA-LCR) benchmark earlier this week to test the reasoning capabilities of models across long sequence lengths (sets of documents ~100k tokens in total). GPT-5 stands out for its performance in AA-LCR, with GPT-5 in both High and Medium reasoning efforts topping the benchmark. 🤖 Agentic Capabilities: OpenAI also commented on improvements across capabilities increasingly important to how AI models are used, including agents (long horizon tool calling). We recently added IFBench to our Intelligence Index to cover instruction following and will be adding further evals to cover agentic tool calling to independently test these capabilities. 📡 Vibe checks: We’re testing the personality of the model through MicroEvals on our website which supports running the same prompt across models and comparing results. It’s free to use, we’ll provide an update with our perspective shortly but feel free to share your own! See below for further analysis:
44
125
708
105,393
Reflection Llama 3.1 70B independent eval results: We have been unable to replicate the eval results claimed in our independent testing and are seeing worse performance than Meta’s Llama 3.1 70B, not better. These evaluations were conducted using our standard methodology, including using our standard system prompt and accessing the model via DeepInfra’s API, which claims bf16 precision. Our evaluation methodology uses a 0-shot prompt with a think step by step instruction. This is not to say there is no merit in Reflective's prompting approach for achieving higher evaluation results as claimed. We are aware that the Glaive team has been updating the model, and we would be more than happy to test further releases. We also ran tests comparing our standard system prompt to Glaive’s provided system prompt and we did not observe any differences in the evaluation results on Reflection Llama 3.1 70B, Llama 3.1 70B, GPT-4o or Claude 3.5 Sonnet. This does not mean the claimed results were not achieved, but we look forward to hearing more about the evaluation approach that led to these results, particularly regarding the exact prompt used and how the evaluation answers were extracted.
28
74
688
124,222
Veo 3 is now the first model to top both the Image to Video and Text to Video leaderboards, outperforming Kling 2.0 and Runway Gen 4 to secure the #1 spot across both modalities! Veo 3 represents a significant leap in Image to Video generation, where Google's previous Veo 2 had lagged behind competitors like Kling 2.0. Despite its leading position, Veo 3 appears to still produce less realistic generations in image to video compared to its strong text to video performance, suggesting room for improvement when working from reference images. Veo 3 Preview is available through Google Cloud Vertex AI Studio, Flow (Google's new AI video editing app), and Gemini Advanced for Ultra subscribers in the US. See the thread below for Veo 3 generations compared to other leading models in our Video Arena 🧵
16
92
698
88,545
Reflection 70B update: Quick note on timeline and outstanding questions from our perspective Timeline: - We tested the initial Reflection 70B release and saw worse performance than Llama 3.1 70B. - We were given access to a private API which we tested and saw impressive performance but not to the level of the initial claims. As this testing was performed on a private API, we were not able to independently verify exactly what we were testing. - Since then, there have been additional HF releases which some providers have hosted. The latest version appears to be: huggingface.co/mattshumer/re…. We are seeing significantly worse results when benchmarking ref_70_e3 than what we saw via the private API. Outstanding questions: - We are not clear on why a version would be published which is not the version we tested via Reflection’s private API. - We are not clear why the model weights of the version we tested would not be released yet. As soon as the weights are released on Hugging Face, we plan to re-test and compare to our evaluation of the private endpoint.
37
54
657
186,825
xAI arrives at the frontier: Grok 3 is poised to be the world’s new leading model, likely only surpassed by OpenAI’s unreleased o3 model Key takeaways: ➤ Grok 3 is now the leading non-reasoning model, pushing pre-training to new limits ➤ Grok 3 Reasoning likely beats o3-mini and DeepSeek R1 to claim the mantle of the world’s leading reasoning model, pending the release of OpenAI’s full o3 model (still several weeks away) ➤ The Grok 3 models will first be available via the Grok chatbot on X and on grok.com; API access is expected in the coming weeks. Pricing is currently unknown ➤ xAI revealed Grok 3 was trained on their 100k H100 ‘Colossus’ cluster - likely the first model released that has been trained on 100k H100 GPUs (or equivalent). The impressive results suggest scaling laws are holding in delivering improvements (fantastic news for @nvidia) This is not a small deal - @elonmusk's xAI was founded less than two years and has already arrived at the frontier. xAI has joined world of OpenAI, Anthropic, Google and Meta and we would suggest should now be considered a ‘Big 5’ American AI lab. Please note the above analysis is based on xAI’s claimed scores and we have not yet been able to independently evaluate the Grok 3 family of models.
36
93
671
76,165
DeepSeek’s new R1-0528-Qwen3-8B is the most intelligent 8B parameter model yet, but not by much: Alibaba’s own Qwen3 8B is just one point behind Alongside last week’s R1-0528 launch, DeepSeek released a distilled 8B model that aims to bring the advanced reasoning capabilities from their flagship R1 model into smaller, more accessible models for on-device deployment. R1-0528-Qwen3-8B was trained on reasoning chain of thought examples from the full-size R1-0528. DeepSeek R1-0528-Qwen3-8B achieves a score of 52 on the Artificial Analysis Intelligence Index. 🧠 Similar intelligence to Qwen3 8B: DeepSeek’s new distill matches the intelligence of Qwen3 8B (Reasoning), Alibaba's post-trained version of the same base model - it scores one point higher in Artificial Analysis Intelligence Index but this is unlikely to translate to noticeable gains in real world use. Unlike Alibaba's hybrid approach to the Qwen3 series, DeepSeek's model does not support inference-time control of whether the model reasons. 📈 Huge leap from DeepSeek R1 (January) distilled models: This distillation achieves equivalent intelligence to original R1 distilled version of Qwen2.5 32B - this means that in just 5 months there is now an 8B model performing as well as Qwen2.5 32B distilled did in January. ⚙️ Efficiency advantages over the full DeepSeek R1: This model is only ~1.2% of DeepSeek R1's total size (8B vs. 671B) - although the comparison is less dramatic when looking only at DeepSeek R1’s active parameters, with R1-0528-Qwen3-8B activating 21.6% of the full R1’s parameters for each token (8B vs. 37B active). While it trails the larger R1 in raw intelligence, the much smaller size allows significantly faster inference (smaller number of active parameters) and requires dramatically less memory (smaller number of total parameters). Deployment and availability: ➤ DeepSeek R1-0528-Qwen3-8B is currently hosted by @novita_labs and is priced at $0.06/$0.09 per 1M input/output tokens ➤ Users can deploy this model on a single GPU (e.g. 1 x H100) or consumer-grade hardware, as it requires just ~16GB of memory in native BF16 precision for storing model weights (plus memory for KV cache and other overhead)
19
86
666
58,802
Today we’re updating Artificial Analysis Intelligence Index to V3, now incorporating agentic evaluations Terminal-Bench Hard and 𝜏²-Bench Telecom! Tool calling and agentic workflows are increasingly the norm for how language models are used by both developers and consumers. Adding Terminal-Bench and 𝜏²-Bench to our Intelligence Index reflects this trend and allows us to see where models have strengths for agentic use cases, compared to prior evaluations that are more focused on knowledge and reasoning. Artificial Analysis Intelligence Index is our synthesis metric for language model intelligence. Along with recent updates to include IFBench for instruction following and our own Artificial Analysis Long Context Reasoning (AA-LCR) benchmark, V3 of the Index captures more depth of model capabilities than ever. As always, we’ll continue to iterate on our methodology to keep the Index relevant as new models and use cases emerge. Key details of Intelligence Index V3: ➤ Terminal-Bench Hard: a terminal-based evaluation of agentic capability - we have incorporated the latest set of ‘hard’ tasks, and use the Terminus 2 reference harness from the Terminal-Bench team to standardize comparisons across models. Tasks include a breadth of subjects from system administration and software engineering to game-playing, and require execution across long horizons (up to 30+ steps) and the ability to interact and self-correct errors in a stateful real-world terminal environment ➤ 𝜏²-Bench Telecom: a conversational agent benchmark from @SierraPlatform. This evaluation tests the tool use, planning, and communication skills of models in an environment where models have to work with a simulated ‘user’ counterpart to solve their customer service issue. This approach mirrors real-world customer service interactions and requires models to interpret user requests and use tools and communication to discover and resolve the source of problems. We have implemented only the telecom scenario (recently added in the v2 release of 𝜏-Bench), as we assess it to be a more robust proxy for real-world agent deployments than the earlier airline and retail scenarios from the initial release of 𝜏-Bench ➤ Standardization and testing: as always, we’ve independently conducted benchmarking using a standardized methodology across models. Please see our methodology page below for additional details on specific settings and configuration Impact: This update brings the index to a composite of 10 equally-weighted evaluation scores, and slightly reduces the top score to 67. GPT-5 remains the top-performing model on our Index, and its low reasoning and mini variants move up the leaderboard on the back of their strong agentic performance. Please see below for further details on performance and patterns we see in these new evaluations. We thank the Terminal-Bench and Sierra teams for their pioneering work on these datasets and test harnesses that underpin our independent testing.
30
83
672
67,156
Kimi K2 0905 upgrade: Substantial improvement in agentic capabilities, modest change in overall intelligence Key takeaways: ➤ Intelligence increased +2 pts in our Artificial Analysis Intelligence Index ➤ Agentic capabilities substantially improved as shown by our two new agentic benchmarks. Kimi K2 0905 increased from 14 to 23% in Terminal-Bench Hard (Agentic coding) and 61% to 73% in Tau2-Bench Telecom (Agentic chat & function calling) ➤ As per the previous release the model is 1T parameters and has the same architecture as the original Kimi K2 release. A key change is the context length supported has doubled from 128k to 256k, increasing support for very input heavy workloads
20
67
655
80,969
While Moonshot AI’s Kimi k2 is the leading open weights non-reasoning model in the Artificial Analysis Intelligence Index, it outputs ~3x more tokens than other non-reasoning models, blurring the lines between reasoning & non-reasoning Kimi k2 is the largest major open weights model yet - 1T total parameters with 32B active (this requires a massive 1TB of memory at native FP8 to hold the weights). We have k2 at 57 in Artificial Analysis Intelligence Index, an impressive score that puts it above models like GPT-4.1 and DeepSeek V3, but behind leading reasoning models. Until now, there has been clear a distinction between reasoning model and non-reasoning models in our evals - defined not only by whether the model uses <reasoning> tags, but primarily by token usage. The median number of tokens used to answer all the evals in Artificial Analysis Intelligence Index is ~10x higher for reasoning models than for non-reasoning models. @Kimi_Moonshot's Kimi k2 uses ~3x the number of tokens that the median non-reasoning model uses. Its token usage is only up to 30% lower than Claude 4 Sonnet and Opus when run in their maximum budget extended thinking mode, and is nearly triple the token usage of both Claude 4 Sonnet and Opus with reasoning turned off. We therefore recommend that Kimi k2 be compared to Claude 4 Sonnet and Opus in their maximum budget extended thinking modes, not to the non-reasoning scores for the Claude 4 models. Kimi k2 is available on @Kimi_Moonshot’s first-party API as well as @FireworksAI_HQ, @togethercompute, @novita_labs, and @parasail_io. See below and on Artificial Analysis for further analysis 👇
20
81
666
63,588
Llama 4 independent evals: Maverick (402B total, 17B active) beats Claude 3.7 Sonnet, trails DeepSeek V3 but more efficient; Scout (109B total, 17B active) in-line with GPT-4o mini, ahead of Mistral Small 3.1 We have independently benchmarked Scout and Maverick as scoring 36 and 49 in Artificial Analysis Intelligence Index respectively. Key results: ➤ Maverick sits ahead of Claude 3.7 Sonnet but behind DeepSeek’s recent V3 0324 ➤ Scout sits in line with GPT-4o mini, ahead of Claude 3.5 Sonnet and Mistral Small 3.1 ➤ Compared to DeepSeek V3, Llama 4 Maverick has ~half the active parameters (17B vs 37B), and ~60% of the total parameters (402B vs 671B). This means that Maverick achieves its score much more efficiently than DeepSeek V3. Maverick also supports image inputs, while DeepSeek V3 does not ➤ Both Maverick and Scout place consistently across evals, with no obvious weaknesses across general reasoning, coding and maths Key model details: ➤ The Llama 4 ‘herd’ includes Scout, Maverick and Behemoth; all are large Mixture of Experts (MoE) models - the first time that Meta has released MoE models ➤ Behemoth (2T total, 288B active) is not being released today but Meta discloses that it was used for co-distillation into Scout and Maverick ➤ Multimodal: All three models take Text and Image input, natively trained on image inputs (this likely varies from Meta’s adapter approach in Llama 3.2). They can take multiple images, and Meta claims they should work well with up to 8 images - stay tuned for visual reasoning benchmarks next week! ➤ Pricing: We’re tracking 6 providers and are benchmarking a median price $0.24/$0.77 per million input/output tokens for Maverick, and $0.15/$0.4 for Scout lower than DeepSeek v3 and >10X cheaper than OpenAI’s leading GPT-4o endpoint ➤ Long context: Maverick supports a 1M token context window, Scout supports a 10M token context window - we will be monitoring availability of long context capabilities across providers and testing in greater detail in the coming days ➤ Style: In our early testing we have noticed responses are a lot more structured and uniform in their approach across prompts Key training details: ➤ Pre-training: Maverick is trained on ~22T tokens, and Scout on ~40T; Meta also shared the overall training dataset was >30T tokens (more than double Llama 3’s 15T, Llama 2 was only 1.8T) of more diverse data than previously (text, images, video stills) ➤ Post-training: Involved supervised fine-tuning, online reinforcement learning (RL), and direct preference optimization techniques to optimize performance. Meta shared that they achieved “a step change in performance” by filtering the dataset to focus on ‘hard’ prompts which improved coding, math and scientific reasoning capabilities ➤ Meta disclosed training consumed 1,999 tons of CO2, this represents ~99,950 oak tree-years 🌲 One note from our evals: we note that our results for multi-choice evals (MMLU Pro and GPQA Diamond) are materially lower than Meta’s claimed results. The key driver of the difference appears to be that Scout and Maverick frequently fail to follow our answer formatting instruction. We request an answer format of ‘Answer: A’. Full details of our prompts and answer extraction techniques are available in our methodology disclosure. Further analysis below 👇
27
87
629
127,251
Model demand change 2024 to 2025: Google (+49pts), DeepSeek (+53pts) and xAI (+31pts) have achieved massive gains in demand share over the past year @Google has transitioned from being an AI laggard to an AI leader with a ~2.5x increase in proportion of respondents using or considering the Gemini model series. A key driver of this has been Google making significant gains in intelligence: Gemini 2.5 Pro now sits at #3 in our Artificial Analysis Intelligence Index, compared to significantly lagging behind OpenAI and Anthropic in early 2024. @deepseek_ai in H1 2024 had only released DeepSeek 67B, a model that saw limited adoption and underperformed Llama 3 70B. DeepSeek first saw some uptake in late 2024 with the releases of their V2 model, and then saw rapid adoption in early 2025 with their V3 and R1 models that have taken them to leadership among open weights models. @xai released its first model Grok-1 in mid-H1 2024 and has since rapidly climbed to intelligence leadership across all models with successive releases, culminating in last week's launch of Grok 4. Source: Artificial Analysis AI Adoption Survey H1 2025 (report available on the Artificial Analysis website)
30
93
619
392,982
Alibaba’s upgraded Qwen3 235B Thinking 2507 is now the leading open weights model, beating DeepSeek R1 0528 on the Artificial Analysis Intelligence Index! Qwen3 235B 2507 (Reasoning) has jumped from the original Qwen3’s score of 62 to 69 in the Artificial Analysis Intelligence Index. This positions @Alibaba_Qwen's Qwen3 235B 2507 (Reasoning) one point above DeepSeek R1 0528: we would describe the new Qwen3 and R1 0528 as roughly equivalent in intelligence. Qwen3 235B 2507 sits only 1 point behind Gemini 2.5 Pro, o3 and o4-mini (high). Breakdown of the model’s improvement: 🧠 Intelligence increases across the board: Biggest jumps seen in LiveCodeBench (Code generation, +17 points), AIME 2024 (Competition Math, +10 points), GPQA Diamond (Scientific Reasoning, +9 points) with smaller gains across MATH-500 (Quantitative Reasoning, +6 points) and HLE (Reasoning & Knowledge, +3 points) ⚙️ Reasoning only model: Qwen3 235B 2507 (Reasoning) is a reasoning model (it is trained to ‘think’ before it answers). All of the initial release Qwen3 models were hybrid reasoning models that could be toggled to ‘think’ before answering - this version is no longer a hybrid model 🗯️ Increased token usage: Qwen3 235B 2507 (Reasoning) used 110 million tokens to run Artificial Analysis Intelligence, ~50% more than the 74 million tokens used by the original release of Qwen3 235B (Reasoning). It is equivalent to Grok4 usage and higher than DeepSeek R1 0528 (99M), Gemini 2.5 Pro (97M), o4-mini (high) (72M) and o3 (45M) 🇨🇳 China continues to lead the open-source AI race: with this release the top 3 open weights models in the world are all from Chinese labs. Key model details: ➤ Context window: 256K (May 2025 release supported 131K maximum) ➤ Total parameters: 235B (requires a minimum of ~500GB memory to run in native BF16 precision, can be run on 8xH100 node or more comfortably on an 8xH200 node) ➤ Active parameters: 22B ➤ Native BF16 training with an FP8 variant also made available by Alibaba ➤ Text only - no multimodal inputs or outputs ➤ Apache 2.0 License
15
71
597
53,878
Kling 2.0 is now the leading Image-to-Video Model, surpassing Veo 2 and Runway Gen 4 in the Artificial Analysis Video Gen Arena! Kling 2.0 is the latest video model by Kuaishou, who’s Kling 1.6 Pro had previously been leading the Image to Video Leaderboard. Kling 2.0 also excels in text to video, where it comes second only to Google’s Veo 2, surpassing OpenAI’s Sora, and the previous runner-up Kling 1.5 Pro. In our generations for Kling 2.0, we have observed strong prompt adherence and video quality, achieving realistic looking motion and physics relative to other models. Kling 2.0 is available through the @Kling_ai app or via public APIs. Although both fal and Replicate charge $2.80 for a 10-second clip (up substantially from $0.95 for Kling 1.6 Pro), it still costs less than its primary competitor: Google Veo 2, which is priced at $5.00 per 10 seconds on Google Vertex. See thread below for comparisons between Kling 2.0, Veo 2 and other leading models in our arena 🧵
21
85
611
102,811
o4-mini independent evals: o4-mini (high) claims the highest Artificial Analysis Intelligence Index score to date (o3 evals in-progress) and shows strong gains in coding ability Key takeaways: ➤ o4-mini is a clear upgrade to o3-mini, while not as dramatic as the leap from o1-mini to o3-mini (+12pts), the model, with reasoning effort set to high, achieves a +4pt gain the Artificial Analysis Intelligence Index ➤ o4-mini (high) made particular gains in coding intelligence, achieving the #1 position in our Coding Index. This was supported by a +7%pts gain in both LiveCodeBench and SciCode whereby o4-mini is now the clear leader ➤ Pricing: o4-mini is priced in-line with o3-mini ($1.10/$4.40 per 1M Input/Output tokens), though cached inputs are 1/2 the price of o3-mini ($0.275/1M Input tokens vs $0.55/1M) ➤ Context window: o4-mini’s context window of 200k tokens is the same as o3-mini. This is now notably smaller than 4.1’s massive 1M token context window ➤ Token usage: As a reasoning model, the model used a high amount of tokens compared to other models broadly, but marginally lower than o3-mini (72M for o4-mini (high), 77M for o3-mini (high)) Evals for o3 are in progress. While we expect o3 to offer greater intelligence, o3-mini may be the more practical choice for most developers considering the substantially lower price and lower end-to-end latency
15
72
603
86,363
Google Veo 2 has surpassed OpenAI’s Sora and Kling 1.5 Pro as the new leader in Artificial Analysis Video Arena! Google quietly launched their Veo 2 model via partner services @fal.ai and @freepik (not yet publicly accessible on Vertex). We have observed strengths in rendering people and realistic physics interactions. Key details regarding Veo 2: ➤ The model is able to generate minutes of 4K video. However, the limited release currently offers 720p video with a 8s maximum duration. ➤ Pricing is $0.50 per second of video generated. This is generally more expensive than other models - a right Google may have earned with its impressive model! ➤ Videos are watermarked with SynthID to ensure it is possible to identify the videos as AI-generated See thread below for comparisons between Veo 2 and other leading models in our arena 🧵
26
51
586
77,259
Claude 3.7 Sonnet is an impressive model. We have independently benchmarked it as the best non-reasoning model for coding (reasoning model results coming shortly). Across our coding evals SciCode and LiveCodeBench, Claude 3.7 Sonnet consistently outperformed other leading non-reasoning models including DeepSeek v3, Gemini 2.0 Pro and GPT-4o. However, in our Artificial Analysis Intelligence Index which includes generalist evals, Claude 3.7 Sonnet remains second to Gemini 2.0 Flash as the leading non-reasoning model that has been released and is accessible via an API for benchmarking (excludes Grok 3). We are currently running our evaluations on the reasoning model - stay tuned to see if it takes the lead in our Intelligence Index or Coding Index. Key details for Claude 3.7 (non-reasoning): ➤ Intelligence: Artificial Analysis Intelligence Index score of 48, landing alongside Google’s Gemini 2.0 Flash and Gemini 2.0 Pro Experimental - the current highest scoring non-reasoning models on Artificial Analysis Intelligence Index ➤ Speed: 85 output tokens/s, almost exactly equivalent to Claude 3.5 Sonnet ➤ Price: No change from Claude 3.5 Sonnet ($3 per 1M tokens input, $15 per 1M tokens output) ➤ Overall recommendation: Developers should generally upgrade from Claude 3.5 Sonnet to 3.7 immediately
25
72
580
72,774
Alibaba’s upgraded Qwen3 235B-A22B 2507 is now the most intelligent non-reasoning model - beating Kimi K2 and Claude 4 Opus (non-reasoning) on the Artificial Analysis Intelligence Index! Qwen3 235B 2507 is a non-reasoning model (it is not trained to ‘think’ before it answers). However, similar to Kimi K2, Qwen3 235B 2507 uses significantly more tokens than other non-reasoning models. Its token usage is comparable to the Claude 4 models with reasoning on - despite being a non-reasoning model, Qwen3 now uses more tokens than Claude 4 Opus with its maximum reasoning budget. Output token usage to run Artificial Analysis Intelligence is 3.5x of the total output tokens of the original release of Qwen3 235B (non-reasoning), but just 1/3x the usage of Qwen3 235B (reasoning). This is why Qwen3 235B 2507 shows up approximately in the middle of the original Qwen3 reasoning and non-reasoning models on our token count charts. All of the initial release Qwen3 models were hybrid reasoning models that could be toggled to ‘think’ before answering. Alibaba appear to have decided to go back to releasing separate instruct and reasoning variants of models to achieve better performance for each separately. Qwen3 235B 2507 scores 60 on the Artificial Analysis Intelligence Index, surpassing Claude 4 Opus and Kimi K2 (both 58), and DeepSeek V3 0324 and GPT-4.1 (both 53). This marks a 13-point leap over the May 2025 non-reasoning release and brings it within two points of the May 2025 reasoning variant. The model is already live in Alibaba’s Qwen Chat. It is accessible via @alibaba_cloud and is also available via several third-party providers including @togethercompute, @parasail_io, @FireworksAI_HQ , and @DeepInfra, with more expected soon - stay tuned for our upcoming provider benchmarks! Key model details: ➤ Context window: 256K (May 2025 release supported 131K maximum) ➤ Total parameters: 235B (requires ~500GB of GPU memory to run in native BF16 precision, can be run on 8xH200 node) ➤ Active parameters: 22B ➤ Native BF16 training with an FP8 variant also made available by Alibaba ➤ Text only - no multimodal inputs or outputs ➤ Apache 2.0 License
19
77
572
62,060
There is a new leader in open source AI. Our independent benchmarks show China-based DeepSeek’s V3 model ahead of all open weights models released to date, beating OpenAI’s GPT-4o (Aug) and approaching Anthropic’s Claude 3.5 Sonnet (Oct). DeepSeek V3 scores an Artificial Analysis Quality Index of 80, ahead of models like OpenAI’s GPT-4o and Meta’s Llama 3.3 70B. The only current models still ahead of DeepSeek are Google’s Gemini 2.0 Flash and OpenAI’s o1 series models. Landing ahead of Alibaba’s Qwen2.5 72B, DeepSeek is now 🇨🇳 China’s AI leader. DeepSeek V3 uses an MoE architecture with 671B total parameters (37B active). The total parameter count is ~2.8x larger than DeepSeek V2.5. Key benchmarking results: ➤ DeepSeek V3 outscores all leading open weights models in Artificial Analysis Quality Index, including Meta’s Llama 3.3 70B and Alibaba’s Qwen2.5 72B. ➤ DeepSeek V3 matches Anthropic’s Claude 3.5 Sonnet (Oct) and sits just below Google’s Gemini 2.0 Flash and OpenAI’s o1 series. Notably, DeepSeek V3 likely has particularly strong coding and mathematical reasoning capabilities with scores of 92% in HumanEval and 85% in MATH-500. ➤ DeepSeek’s first party API for V3 is fast, achieving an output speed of 89 tokens/sec — 4x faster than DeepSeek V2.5 (18 tokens/sec). In their Technical Report, DeepSeek discloses extensive inference optimization work they have undertaken to increase speed and efficiency for serving DeepSeek V3 on their H800 cluster. DeepSeek achieves this speed increase on a ~2.8x larger model, with only a modest increase in price (pricing details below). Key training details: ➤ DeepSeek V3 was trained on 14.8T tokens in just 2.788M NVIDIA H800 GPU hours - implying a cost of $5.6M (based on rental pricing of NVIDIA H800 at $2/hr). That’s just 57 days on DeepSeek’s 2048 H800 cluster. ➤ DeepSeek used their DeepSeek-R1 reasoning inference model for distillation. While reasoning models like OpenAI’s o1 series may not suit many use cases due to their cost and latency, this is less of a barrier for generating training data. DeepSeek’s approach of using R1 for this purpose likely has been and will be used by all major labs in 2025. ➤ DeepSeek V3 was trained on a cluster of 2048 NVIDIA H800 GPUs. As a Chinese company, DeepSeek is limited in their ability to use H100s and other NVIDIA chips by export controls. A key limitation of H800s is the reduced interconnect bandwidth (300 GB/s vs. 900 GB/s) which can impact training performance as node-to-node communication is a bottleneck. DeepSeek in their paper discussed various ways of optimizing training including through writing their own communication kernels rather than using tensor parallelism and using mixed precision (FP8) training. We assess DeepSeek V3 to be a highly significant release. It reflects @deepseek_ai's significant contribution to the open source AI community, as well as the continuation of the trend of Chinese AI labs ascending to a clear global second place behind the US. Further analysis below.
9
103
557
84,667
OpenAI has reduced o3 pricing by 80% - o3 is now competitive in price to Gemini 2.5 Pro & Claude 4 Sonnet and 8x cheaper than Claude 4 Opus o3 is now priced at $2/$8 per 1M input/output tokens (down from $8/$40) and offers a 75% discount for cached input tokens. Key implications: - ⚖️ o3 now matches GPT-4.1 per-token pricing: OpenAI's leading reasoning and non-reasoning models now have equivalent per-token pricing. However, since o3 outputs ~7x more tokens than GPT-4.1, it costs significantly more per query (demonstrated by the cost to run our intelligence index shared below) - 💸 o3 is comparable to Gemini 2.5 Pro and Claude 4 Sonnet Thinking: With the pricing update, o3 now matches Gemini 2.5 Pro in both pricing and our intelligence index, while offers higher intelligence than Claude 4 Sonnet Thinking at a lower per-token price. Further analysis below 👇
22
64
542
111,634
Kimi K2 Thinking takes back the open weights intelligence frontier for 🇨🇳 Chinese AI Labs gpt-oss-120b from 🇺🇸 US-based OpenAI previously held the title of most intelligent open weights model, MiniMax's M2 model reached parity and Kimi K2 Thinking has reached a new intelligence frontier. Intelligence measured based on our Artificial Analysis Intelligence Index of 10 evaluations that we run independently and like-for-like on all models. Link to further analysis below
21
64
540
59,237
Google's Gemini 2.5 Flash Image (Nano-Banana) takes the crown as the leading image editing model, beating GPT-4o and Qwen-Image-Edit in the Artificial Analysis Image Editing Arena! We were given early access and have been testing it in our arena under the pseudonym 'rex' for the last week. The model secured first place in image editing and ranks #3 overall for image generation. Gemini 2.5 Flash Image succeeds Gemini 2.0 Flash Preview as Google's flagship editing model, narrowly beating their latest dedicated Imagen 4 model in generation quality. See below for comparisons between Gemini 2.5 Flash Image and other leading models 🧵
11
52
536
54,059
NVIDIA has released Nemotron Nano 9B V2, a small 9B reasoning model that scores 43 on the Artificial Analysis Intelligence Index, the highest yet for <10B models Nemotron 9B V2 is the first Nemotron model pre-trained by @NVIDIA. Previous Nemotron models have been developed by post-training on Meta Llama models. Architecture & Training: The model uses a hybrid Mamba-Transformer architecture. NVIDIA pre-trained a 12B parameter base model and applied post-training with a range of techniques including RLHF and GRPO. The final 9B size was pruned from this model and re-trained with the base model as a teacher. Small-model frontier: with only 9B parameters, Nemotron Nano 9B V2 is placed ahead of Llama 4 Maverick on our leaderboard, equal to Solar Pro 2 with reasoning and trails just behind gpt-oss-20B (high). Along with this model, NVIDIA released a 6.6-trillion token subset of their pre-training data for public use on @huggingface Key model details: ➤ 128k token context window ➤ Supports reasoning and non-reasoning modes (with ‘/no_think’ settings in the system prompt) ➤ Released under the NVIDIA Open Model License, and not additionally covered by Meta’s Llama license like prior Nemotron models - this means that there is no limitation on use by large companies or requirement to keep ‘Nemotron’ in the name of derivative models ➤ No serverless inference providers are yet serving the model, but it is available now on Hugging Face for local inference or self-deployment See below for our full analysis and key announcement links from NVIDIA 👇
21
54
520
69,533
Qwen3 is a win for open weights & efficiency - hybrid reasoning models that approach DeepSeek R1’s GPQA score with 1/3 the total parameters and a range of smaller models suited for compute limited environments Today, Alibaba announced eight hybrid reasoning models of varying sizes and architectures (i.e. models that can be toggled to ‘think’ before answering), ranging in size from 0.6B dense to a 235B MoE with 22B active parameters. Our initial results show that all models appear to be competitive for their size class, with 253B-A22B coming close to DeepSeek R1 despite fewer parameters (compared to DeepSeek R1’s 671B total and 37B active). We’ve started running our evals and have completed GPQA Diamond across three models with reasoning on: ➤ Qwen3 235B-A22B (Reasoning): 70%, placing it inline with DeepSeek R1 and Gemini 2.5 Flash (Reasoning). This represents a significant leap from Alibaba’s previous leading model, QwQ-32B which scored 59% in our GPQA Diamond eval. ➤ Qwen3 30B-A3B (Reasoning): 62%, placing it just behind leading non-reasoning models DeepSeek V3 0324 and Llama 4 Maverick. This is very impressive considering this model has only 3B active parameters - peers are much larger (DeepSeek V3 03-24 has 671B total and 37B active, Llama 4 Maverick has 402B total and 17B active). Qwen3-32B dense is coming shortly. ➤ Qwen3-14B (Reasoning): 60% placing in line with Llama 4 Scout depspite having less total/active parameters (14B/14B vs 109B/17B for Scout) The wide range of model sizes will support a range of deployment environments from on-device (8B, 4B, 1.7B, 0.6B) to 8xH100 DGX nodes (235B). A major win for the open weights community. Stay tuned for our full suite of 7 evaluations across the entire Qwen3 family and with reasoning turned on & off! We will also continue to monitor the availability of these models across inference providers and share out performance benchmarks soon! Additional details include: ➤ Hybrid reasoning: Qwen3 models are the first set of models from Alibaba that feature a hybrid approach to problem solving supporting “Thinking” and “Non-Thinking” modes. We have seen commonly across new model releases, notably NVIDIA Nemotrons, Google Gemini Flash, xAI Grok 3 and Claude 3.7 Sonnet ➤ Multilingual support: Alibaba claims support for 119 languages and dialects ➤ Expanded pre-training: Qwen3 was trained on 36 trillion tokens. This is higher than 22 trillion training tokens used for Llama 4 Maverick but lower than 40 trillion training tokens used for Llama 4 Scout ➤ Open Weights: Models are available under the Apache 2.0 license Further analysis below
20
85
497
34,900
Veo 3 debuts on the Artificial Analysis Video Arena Leaderboard in first place, with a significant lead over Google’s own Veo 2 After a day of voting, we can confidently declare Veo 3 Preview to be substantially better than Veo 2, putting Google well ahead of both Kuaishou's Kling 2.0, and OpenAI's Sora in our text to video leaderboard. We have begun testing Veo 3's image to video capabilities and will be releasing those results in the next few days! Veo 3 Preview is now available in Google Cloud Vertex AI Studio, Flow (Google's new AI video editing app), and Gemini for AI Ultra subscribers in the US. See the thread below for Veo 3 generations compared to other leading models in our Video Gen Arena 🧵
11
58
502
57,546
Mistral delivers with a very impressive Mistral Small 3.2 release: GPT-4o level intelligence in a 24B open weights model that can be run on-device Key details: ➤ Substantial intelligence increase: Mistral Small 3.2 jumps from 35 to 42 in Artificial Analysis Intelligence Index. @MistralAI's upgrade leapfrogs Mistral Small 3.2 above Gemma 3 27B, GPT-4o, GPT 4.1 nano and Phi-4 ➤ 24B parameters (requires minimum of 48GB of ram at BF16 or 24GB at FP8) ➤ Support image input / vision capabilities ➤ Mistral has claimed improved instruction following and function calling capabilities See below for further analysis 👇
14
60
498
85,231
Qwen-Image-Edit is the new open weights leader in Image Editing, with quality comparable to GPT-4o and FLUX.1 Kontext [max] Qwen-Image-Edit is the image editing variant of the recent Qwen-Image release from Alibaba, also released under the Apache 2.0 license with weights available on @huggingface. Qwen-Image-Edit is priced at $30/1k images on both @fal and @replicate, similar to its open weights peers HiDream-E1.1 at ~$30/1k and FLUX.1 Kontext [dev] at $25/1k images, while much cheaper than FLUX.1 Kontext [max] at $80/1k images, and GPT-4o at ~$167/1k images. See below for comparisons between Qwen-Image-Edit and other leading models in our Image Editing Arena 🧵
14
63
487
130,228
Alibaba launches QwQ-32B, an open weights reasoning model that may approach DeepSeek R1’s level of intelligence We’ve been running evals on it all night and we’ve only gotten our scores back for GPQA Diamond and AIME 2024 thus far: ➤ GPQA Diamond: 59.5%, placing QwQ materially behind DeepSeek R1’s score of 71% and just behind Gemini 2.0 Flash’s score of 62% ➤ AIME 2024: 78%, matching Alibaba’s claims and placing QwQ-32B ahead of DeepSeek R1’s score, besting all other models we have tested except o3-mini-high Stay tuned for our suite of benchmarks soon! Further context in the meantime: ➤ QwQ-32B has 20x fewer parameters than DeepSeek R1’s 671B total parameter count, and even fewer than DeepSeek R1’s 37B active parameter count ➤ We should note, however, that QwQ-32B was trained and released in BF16, whereas DeepSeek R1 was trained and released natively in FP8 ➤ This means that the native versions of QwQ-32B and DeepSeek R1 take up 65GB and 671GB of storage respectively - but on hardware with native FP8 support like NVIDIA’s H100, DeepSeek R1 may actually use less effective compute per forward pass
10
65
487
285,739
There is a new leader in open weights intelligence! Qwen2.5 72B tops our independent evals amongst open weights models, including compared to the much larger Llama 3.1 405B Qwen 2.5 72B released yesterday by @Alibaba_Qwen has topped our Artificial Analysis Quality Index of evaluations. While MMLU is 1%ppt below Llama 3.1 405B in MMLU, it has strengths in Coding and Math where it challenges OpenAI's GPT-4o. Further, given the model is much smaller than Llama 3.1 405B it should also run faster on the same hardware. It is a dense model and supports a 128k context window, the same as the Llama 3.1 series, and 8k output tokens, double the Llama 3.1 series' 4k. @hyperbolic_labs and @DeepInfra have been quick to launch the model and are both offering the model at $0.4/M input & output tokens. This is ~10X cheaper than GPT-4o's price and the median price of Llama 3.1 405B across providers. See below for links to our analysis 👇
13
83
495
154,413
MiniMax launches their first reasoning model: MiniMax M1, the second most intelligent open weights model after DeepSeek R1, with a much longer 1M token context window @minimax_ai M1 is based on their Text-01 model (released 14 Jan 2025) - an MoE with 456B total and 45.9B active parameters. This makes M1’s total parameter count smaller than DeepSeek R1’s 671B total parameters but larger than Qwen3 235B-A22B. Both Text-01 and M1 only support text input and output. MiniMax M1 80K scores 63 on the Artificial Analysis Intelligence Index. This lags DeepSeek R1 0528, but is slightly ahead of Alibaba’s Qwen3 235B-A22B and NVIDIA’s Llama 3.1 Nemotron Ultra. MiniMax M1 is offered in two variants: M1 40K and M1 80K, offering 40k and 80k token thinking budgets respectively. MiniMax discloses that their full RL training on Text-01 to create M1 used 512 H800 GPUs for three weeks - equivalent to a rental cost of $0.53M. This number is an interesting datapoint for the current degree of scaling of reinforcement learning. We note that it is not comparable to DeepSeek’s famous $5.6M training cost claim for DeepSeek V3, as DeepSeek’s number referred to full pre-training of the model not the reinforcement learning step. MiniMax offers models across multiple modalities on their Talkie app and API, including their Artificial Analysis Speech Leaderboard topping Speech-02 model, and Video models (T2V-01, and I2V-01). MiniMax M1 is the first of five announcements in their MiniMax Week. Availability: ➤ MiniMax M1 is available via MiniMax’s first-party API, priced at $0.4/$2.1 per 1M input/output tokens for ≤200k input tokens. The price increases to $1.2/$2.1 per 1M input/output tokens for >200k input tokens ➤ M1 is also currently available on @Novita, priced at $0.55/$2.2 per 1M input/output tokens with a 128k token context window ➤ M1 40k and M1 80k are both open weights models released under the Apache 2.0 license and we expect to see more third-party APIs supporting these models
14
78
485
95,755
Google shared pre-release access for the new Gemini 2.5 Flash & Flash-Lite Preview 09-2025 models. We’ve independently benchmarked gains in intelligence (particularly for Flash-Lite), output speed and token efficiency compared to predecessors Key takeaways from our intelligence and performance benchmarking: ➤ 🧠 Gemini 2.5 Flash Preview 09-2025 scores 54 in reasoning mode on the Artificial Analysis Intelligence Index, and 47 in the non-reasoning mode, representing a 3 point and 8 point jump respectively compared to Gemini 2.5 Flash released in May 2025 ➤ 🧠 Gemini 2.5 Flash-Lite Preview 09-2025 scores 48 in reasoning mode on the Artificial Analysis Intelligence Index, representing a 8 point uplift compared to Gemini 2.5 Flash-Lite (Reasoning) released in June 2025. In non-reasoning, Gemini 2.5 Flash-Lite Preview 09-2025 scores 42, a 12 point uplift compared to the July version. ➤💲 In reasoning mode, Gemini 2.5 Flash and Flash-Lite Preview 09-2025 are more token-efficient, using fewer output tokens than their predecessors to run the Artificial Analysis Intelligence Index. Gemini 2.5 Flash-Lite Preview 09-2025 uses 50% fewer output tokens than its predecessor, while Gemini 2.5 Flash Preview 09-2025 uses 24% fewer output tokens. ➤⚡ Google Gemini 2.5 Flash-Lite Preview 09-2025 (Reasoning) is ~40% faster than the prior July release, delivering ~887 output tokens/s on Google AI Studio in our API endpoint performance benchmarking. This makes the new Gemini 2.5 Flash-Lite the fastest proprietary model we have benchmarked on the Artificial Analysis website Key model information: ➤ Hybrid reasoning/non-reasoning modes with variable thinking budget ➤ Tool support (e.g. Google Search, code execution) ➤ 1M token context window ➤ Multimodal input (text, audio, image and video) and text only output ➤ Gemini 2.5 Flash-Lite 09-2025 is priced at $0.1/$0.4 per 1M input/output tokens and Gemini 2.5 Flash 09-2025 is priced at $0.3/$2.5 per 1M input/output tokens
24
37
496
53,409
ServiceNow has released Apriel-v1.5-15B-Thinker, a 15B open weights reasoning model that leads our Small Models category (<40B parameters) 💼 Overview: Apriel-v1.5-15B-Thinker is a dense, 15B parameter open weights reasoning model. This is not the first model ServiceNow has released but is a substantial jump in intelligence achieved compared to past releases 🧠 Intelligence: The model scores 52 in the Artificial Analysis Intelligence Index. This puts it on par with DeepSeek R1 0528, which has a much larger 685B parameter architecture. ServiceNow’s model scores particularly well within important behaviors for enterprise agents, such as instruction following (62% in IFBench, ahead of gpt-oss-20B, reasoning) and multi-turn conversions & tool use (68% in 𝜏²-Bench Telecom, ahead of gpt-oss-120B, reasoning). This makes it particularly well-suited to agentic use cases, which was likely a focus given ServiceNow is active in the enterprise agents space ⚙️ Output tokens and verbosity: The model produces a large number of output tokens even among reasoning models - using ~110M combined reasoning and answer tokens to complete the Artificial Analysis Intelligence Index 🖥️ Access: No serverless inference providers are yet serving the model, but it is available now on Hugging Face for local inference or self-deployment. The model has been released under an MIT license, supporting unrestricted commercial use ℹ️ Context window: The model has a native context window of 128k tokens. Congratulations to @ServiceNowRSRCH on this impressive result!
19
60
492
65,605
gpt-oss-120b is now the leading 🇺🇸 US open weights model. Qwen3 235B from Alibaba is the leading 🇨🇳 Chinese model and offers greater intelligence, but is much larger in size (235B total parameters, 22B active, vs gpt-oss-120B's 117B total, 5B active) Link below to further analysis 👇
11
59
479
44,810
DeepSeek’s updated V3.1 Terminus ties with gpt-oss-120b (high) as the most intelligent open weights model and offers increased instruction following and long context reasoning capabilities 🧠 Our benchmarking results indicate DeepSeek V3.1 Terminus shows a greater intelligence uplift over DeepSeek V3.1 in reasoning mode compared to non-reasoning mode: ➤ DeepSeek V3.1 Terminus scores 58 in reasoning mode on the Artificial Analysis Intelligence Index, up from V3.1’s score of 54 in reasoning mode. The largest improvements are seen across instruction following (increase of 15 percentage points in IFBench), long context performance (increase of 12 p.p. in AA-LCR) and agentic coding & terminal use (increase of 4 p.p in Terminal-Bench Hard) ➤ In non-reasoning mode, DeepSeek V3.1 Terminus achieves a score of 46, a slight increase over the earlier V3.1 score of 45. 🤖 Other benchmarking takeaways: ➤ Function calling / tool use: Similar to DeepSeek V3.1, V3.1 Terminus does not support function calling when in reasoning mode. This is likely to substantially limit its ability to support agentic workflows with intelligence requirements, including in coding agents. ➤ Token usage: DeepSeek V3.1 Terminus scores higher in reasoning mode than V3.1, and uses more tokens across the evals in the Artificial Analysis Intelligence Index (67M for V3.1 Terminus in reasoning mode vs. 63M for V3.1 in reasoning mode). In non-reasoning mode, V3.1 Terminus uses fewer tokens than V3.1 (11M and 14M respectively). Both V3.1 Terminus and V3.1 use fewer tokens in reasoning mode than DeepSeek’s earlier R1 and R1 0528 reasoning models ➤ Availability: DeepSeek’s first party API now serves the new DeepSeek V3.1 Terminus model on both their chat and reasoning endpoints ➤ Architecture: DeepSeek V3.1 Terminus is architecturally identical to prior V3 and R1 models, with 671B total parameters and 37B active parameters ➤ Providers: There are select third-party providers that are hosting this model such as @DeepInfra (FP4 quantized) and @novita_labs (FP8 quantized)
16
45
479
74,855
Google's Veo 3, released today, is the first major video generation model to support native audio generation - allowing users to create dialogue, sound effects, music ambient noise and more. Veo 3 is now live in the Artificial Analysis Video Arena - get voting to see where it will land! Our initial testing confirms Google's claims of improved physics, prompt understanding, and overall video quality. Veo 3 also continues Google’s strict moderation of video prompts which cannot be turned off. Veo 3 is priced at $0.5 per second of video - the same price as Veo 2, and rises to $0.75 per second of video with audio. We show video-only generations in the Artificial Analysis Video Arena to allow it to be compared with other video models that generate no audio. Veo 3 is available today via: Vertex AI API on Google Cloud (although confusingly not in the Vertex AI Studio - which is also different from Google’s standalone AI Studio and AI Studio API) Flow, Google’s new AI Video editing app that integrates image, video, and music generation The Gemini consumer app, although only for Ultra subscribers in the US See below for Veo 3 generations with audio, and compare it to other leading models in our Video Gen Arena 🧵
47
63
464
70,060
OpenAI’s GPT-4o Image Generation debuts with an ELO score in equal first-place in the Artificial Analysis Image Arena, outperforming Recraft V3, FLUX 1.1 [pro] and Gemini 2.0 Flash @OpenAI last week launched GPT-4o Image Generation, upgrading ChatGPT’s built-in image generation from the previous system that used OpenAI’s DALL-E dedicated image generator. GPT-4o Image Generation supports both text and image prompt input, allowing image editing with instruction prompting. In our category breakdowns, the model excels particularly at Text & Typography, People: Portraits, Anime and SciFi whereby it holds the top ranking. OpenAI has disclosed that 4o Image Generation is an “autoregressive model natively embedded” within the GPT-4o model used by ChatGPT. However, in their launch ‘demo’ images, OpenAI hints at a hybrid architecture. This could look like an autoregressive transformer generating latent space representations, which are then converted into pixels using diffusion techniques. OpenAI first demonstrated GPT-4o’s ability to output images in May 2024 when GPT-4o was first launched. Google beat OpenAI to a public release of native image generation capability in a modern language model with their Gemini 2.0 Flash native image generation in early March. However, Gemini 2.0 Flash is ranked 27th to GPT-4o’s 2nd in Image Arena. Beyond image generation, we have anecdotally found Gemini 2.0 Flash to be better than GPT-4o for certain image editing tasks where keeping an input image consistent is critical. Link below to see the leaderboard and participate in our Image Arena 🔽
15
42
460
95,632
Z ai’s updated GLM 4.6 (Reasoning) is one of the most intelligent open weights models, with near DeepSeek V3.1 (Reasoning) and Qwen3 235B 2507 (Reasoning) level intelligence 🧠 Key intelligence benchmarking takeaways: ➤ Reasoning Model Performance: GLM 4.6 (Reasoning) scores 56 on the Artificial Analysis Intelligence Index, up from GLM 4.5’s score of 51 in reasoning mode ➤ Non-Reasoning Model Performance: In non-reasoning mode, GLM 4.6 achieves a score of 45, placing it 2 points ahead of GPT-5 (minimal, non-reasoning) ➤ Token efficiency: Z ai has increased GLM’s evaluation scores while decreasing output tokens. For GLM 4.6 (Reasoning), we see a material decrease of 14% in token usage to run Artificial Analysis Intelligence Index from 100M to 86M, compared to GLM 4.5 (Reasoning). This is different from other model upgrades we have seen where increase in intelligence is often correlated with increase in output token usage. In non-reasoning mode, GLM 4.6 uses 12M output tokens for the Artificial Analysis Intelligence Index Other model details: ➤🪙 Context Window: 200K token context. This is larger compared to GLM 4.5’s context window of 128K tokens ➤📏 Size: GLM 4.6 has 355B total parameters and 32B active parameters - this is the same as GLM 4.5. For self-deployment, GLM 4.6 will require ~710GB of memory to store the weights in native BF16 precision and cannot be deployed on a single NVIDIA 8xH100 node (~640GB of memory) ➤©️ Licensing: GLM 4.6 is available under the MIT License ➤🌐 Availability: GLM 4.6 is available on Z ai’s first-party API and several third-party APIs such as DeepInfra (FP8), Novita (BF16), GMI Cloud (BF16) and Parasail (FP8)
9
53
454
38,250
GPT-5 occupies both the #1 and #2 positions in our long context reasoning benchmark (AA-LCR) 🤯 AA-LCR tests long context performance through testing reasoning capabilities across multiple long documents (~100k tokens). Questions typically require considering multiple documents in the set and require analysis to reach the answer. More information on our eval below 👇
22
60
453
100,732
Google has released an updated version of Gemini 2.5 Flash today at I/O 2025! In reasoning mode, the model’s intelligence rises above Qwen3 235B-A22B, and in non-reasoning mode it is now equivalent to GPT-4.1 and DeepSeek V3 Gemini 2.5 Flash Preview (05-20), is an upgraded version of 2.5 Flash Preview (04-17) which was released in April 2025. Gemini 2.5 Flash is a hybrid model, meaning users can choose to enable reasoning mode or not (when in reasoning mode the model outputs tokens to 'think' before answering) Results from our independent benchmarking: ➤ Gemini 2.5 Flash (Reasoning) scores 65, up from 60. This places the model ahead of Qwen3 235B A22B (Reasoning) and Llama 3.1 Nemotron Ultra Reasoning, but remains less intelligence than Google’s leading model, Gemini 2.5 Pro, which scores 69 ➤ Gemini 2.5 Flash (Non Reasoning) scores 53, up from 49, leapfrogging as one of the leading non-reasoning models. It is now equivalent to DeepSeek V3 0324 and OpenAI’s GPT 4.1 in intelligence. This is the first major jump in non-reasoning intelligence for Google’s Gemini Flash models since Gemini 2.0 Flash - the earlier 04-17 preview of Gemini 2.5 Flash did not significantly improve from Gemini 2.0 Flash’s non-reasoning score Google has not changed the pricing for this new release. Like the earlier release, it is priced differently based on if the model is in reasoning mode or not: ➤ Gemini 2.5 Flash (Reasoning): $0.15/$3.50 per million input/output tokens ➤ Gemini 2.5 Flash (Non Reasoning): $0.15/$0.6 per million input/output tokens Analysis of Gemma 3n is coming soon 👀
25
54
439
48,353
Announcing Artificial Analysis Intelligence Index V2 - the biggest upgrade to our eval suite yet Summary of Intelligence Index V2: ➤ Harder evals: MMLU-Pro, HLE (Humanity's Last Exam), GPQA Diamond, MATH-500, AIME 2024, SciCode, and LiveCodeBench - see below for a description of each evaluation. ➤ Independent: As always, Artificial Analysis has independently run every eval on every model - no inconsistent lab-claim results anywhere to be seen ➤ Standardized: We evaluate models under identical conditions with consistent prompting, temperature settings and answer extraction techniques ➤ Extensive sensitivity testing: We’ve run every eval in Index V2 dozens of times in our pre-launch assessment phase to understand variability, and set the number of repeats we use to achieve our target confidence intervals ➤ More robust software stack: This one is a little inside baseball but is actually a pretty big deal - we’re running tens of thousands of queries on hundreds of models so our entire benchmarking stack has to be extremely robust, and allow our team to monitor evals for errors and anomalies so we can have confidence in every number published Artificial Analysis has independently run thousands of evals across hundreds of models to support this launch - today, we already have Intelligence Index scores for all leading models published on our updated website. For further information regarding how models perform, the evals we have chosen to include and our methodology, see below.
26
65
442
45,742
Amazon has launched Nova, a highly competitive family of foundation models. Nova Pro, Lite and Flash set new standards for the intelligence that can be accessed at the price and speed these models are offered at. Nova Pro, the flagship model, ranks amongst the leading frontier models in the Artificial Analysis Quality Index. With a score of 75, Pro ranks higher than GPT-4o (November release), Mistral Large 2 and Llama 3.1 405B. Access is priced competitively at $0.8/1M Input tokens and $3.2/1M output tokens, ~1/3 the cost of GPT-4o ($2.5/$10). Nova Lite and Micro are smaller and faster models that offer competitive intelligence for their price class. Micro can be accessed at 157 output tokens/s, faster than Gemini 1.5 Flash, Llama 3.1 8B (median of providers) and GPT-4o mini. Lite and Micro are competitively priced at $0.06 and $0.1 per 1M tokens respectively (blended 3:1, input:output) positioning them well for speed and/or price-sensitive use-cases. See below for deep dives on the performance and capabilities of these models.
11
84
438
67,582
Announcing Artificial Analysis State of AI: China 🇨🇳 For the first time, we are charting the rise of China’s top AI companies to the AI frontier. In this benchmarking report, we map the Chinese AI ecosystem and present comparisons to leading US models. DeepSeek R1 didn’t happen overnight - we’ve been benchmarking each release over the previous year. One year ago the AI frontier was overwhelmingly dominated by US companies. Today, nearly a dozen Chinese companies have models matching or exceeding current generation models from most US labs. Key topics addressed: ‣ The great catch-up: Over the past year, Chinese AI labs have mostly caught up to US labs in terms of the intelligence of their released models ‣ Leading with open weights: Many of the leading models from Chinese AI labs, including DeepSeek’s R1 and V3, are open weights. This contrasts with leading US labs whose models are predominately closed source ‣ Regulation: The timeline of US export controls and the resulting restrictions on which NVIDIA chips can be exported to China
18
106
434
59,496
OpenAI’s GPT-4.1 series is a solid upgrade: smarter and cheaper across the board than the GPT-4o series @OpenAI's GPT-4.1 family includes three models: GPT-4.1, GPT-4.1-mini and GPT-4.1 nano. We have independently benchmarked these with our Artificial Analysis Intelligence Index and the results are impressive: ➤ GPT-4.1 scores 53 - beating out Llama 4 Maverick, Claude 3.7 and GPT-4o to score identically to DeepSeek V3 0324. ➤ GPT-4.1 mini, likely a smaller model, actually matches GPT-4.1’s Intelligence Index score while being faster and cheaper. Across our benchmarking, we found that GPT-4.1 mini performs marginally better than GPT-4.1 across coding tasks (scoring equivalent highest on SciCode and matching leading reasoning models). ➤ GPT-4.1 nano scores 41 on Intelligence Index, approximately in line with Llama 3.3 70B and Llama 4 Scout. This release represents a material upgrade over GPT 4o-mini which scores 36. Developers using GPT-4o and GPT-4o mini should consider immediately upgrading to get the benefits of greater intelligence at lower prices. Further model details: ➤ Long context: All versions of GPT-4.1 support a 1 million token context window, the largest that we have seen across OpenAI models and in line with a new emerging 1M token standard, matching Gemini 2.0 Flash, Gemini 2.5 Pro and Llama 4 Maverick (Llama 4 Scout supports up to a 10M context window but this capability is not widely available via cloud providers yet). ➤ Pricing: OpenAI has priced the new models competitively at $2/$8 per million input/output tokens for GPT-4.1, $0.40/$1.60 for GPT-4.1 mini and $0.10/$0.40 for GPT-4.1 nano. These prices represent a significant cut to overall API pricing for OpenAI’s highest volume models.
27
41
430
93,828
Grok 4 scores higher in Artificial Analysis Intelligence Index than any other model. Its pricing is higher than OpenAI’s o3, Google’s Gemini 2.5 Pro and Anthropic’s Claude 4 Sonnet - but lower than Anthropic’s Claude 4 Opus and OpenAI’s o3-pro.
8
24
418
60,218
Grok 3 mini Reasoning’s recently launched API has highly compelling Intelligence vs. Price positioning @xai has recently launched APIs for Grok 3 and Grok 3 mini, after initially only making the models available via the Grok chat interface at launch. We have now completed benchmarking for both Grok 3 and Grok 3 mini. Grok 3 mini (high reasoning) stands out in the top left of our Intelligence vs Price chart, achieving one of the highest Artificial Analysis Intelligence Index scores ever, with pricing well below even DeepSeek R1. Grok 3 family overview: We’re initiating coverage of 6 versions of the Grok 3 family: Grok 3, Grok 3 Fast, Grok 3 mini (low reasoning), Grok 3 mini Fast (low reasoning), Grok 3 mini (high reasoning) and Grok 3 mini Fast (high reasoning). Today’s post focuses on our intelligence results; performance breakdown of all 6 versions will be available soon. Intelligence: As we highlighted at launch, Grok 3 marks xAI’s arrival at the AI frontier. Both Grok 3 mini Reasoning (reasoning effort high) and Grok 3 are amongst the top 5 models in their reasoning and non-reasoning categories respectively. Grok 3 mini Reasoning (high) ranks higher in our Artificial Analysis Intelligence Index than DeepSeek R1 and Claude 3.7 Sonnet (64k reasoning budget). Pricing: Grok 3 mini Reasoning is almost an order of magnitude cheaper than other models of similar intelligence (o4-mini, Gemini 2.5 Pro) at $0.3/$0.5 per 1M Input/Output tokens for the ‘base’ speed version. xAI are charging $0.6/$4 for the faster Grok 3 mini endpoint. OpenAI’s o4-mini is $1.1/$4.4 and Gemini 2.5 Pro $1.25/$10. Grok 3 is more expensive at $3/$15 per 1M Input/Output tokens. End to End Latency: Grok 3 offers significantly faster response times compared to Grok 3 mini Reasoning due to ‘thinking’ time. We measure Grok 3 as taking 9.5s to output 500 tokens, and Grok 3 mini Reasoning as taking 27.4s to complete reasoning and output 500 tokens (standard speed endpoints, comparison to Fast endpoints coming soon).
14
44
422
92,848
Alibaba has released Qwen3-Max: the second most intelligent non-reasoning model. At over 1T parameters, it is Alibaba’s largest model to date, though the weights have not been released Key takeaways: 🔒 Proprietary: Qwen3-Max (Preview) is a proprietary model as Alibaba has not released the weights. All other models in the Qwen3 family are open weights 🧠 Intelligence: Qwen3-Max scores 49 on the Artificial Analysis Intelligence Index. This places the model ahead of DeepSeek V3.1 (Non Reasoning Mode) but equivalent to Kimi K2 0905 released last week. Qwen3-Max scores higher than Qwen3 235B 2507 (Non Reasoning), which scores 45 ⚙️ Output tokens and verbosity: Uses ~14M output tokens for the Intelligence Index, less than Qwen3-235B-2507 (~24M) and in line with Kimi K2 0905 (16M) and DeepSeek V3.1 (14M, Non-Reasoning) 💲Pricing model: Per token pricing scales with context at $1.2/$6 per 1M input/output tokens (0-32K input tokens), $2.4/$12 per 1M tokens (32K-128K), and $3/$15 per 1M tokens (128K-252K context windows). Even at the lowest pricing tier, Qwen3-Max is more expensive on a per token basis compared to Qwen3 235B 2507 (Non Reasoning) which is priced at $0.7/$2.8 per 1M input/output tokens ⚙️ Model details: The model has a context window of 256k context and is text-only, with no multimodal inputs or outputs Qwen3-Max (Preview) is currently available in Qwen Chat and via Alibaba Cloud.
13
50
412
30,960
DeepSeek has launched V3.2 Exp with their new DeepSeek Sparse Attention (DSA) architecture that claims to reduce the impact of the quadratic scaling of compute with context length We’ve independently benchmarked V3.2 Exp as achieving similar intelligence to DeepSeek V3.1 Terminus; DeepSeek have switched to using V3.2 for their main API endpoint and have reduced API pricing by >50%. With DeepSeek’s updated first party API pricing, cost to run Artificial Analysis Intelligence Index falls from $114 to $41. DeepSeek claims to have “deliberately aligned” the training configurations of V3.1 Terminus and V3.2 Exp. Matching V3.1 Terminus’ performance appears to demonstrate that the performance benefits of the DeepSeek Spare Attention architecture do not come at a cost to intelligence. Key benchmarking takeaways: ➤🧠  No change in aggregate intelligence: In reasoning mode, DeepSeek V3.2 Exp scores 57 on the Artificial Analysis Intelligence Index. We see this as equivalent in intelligence to DeepSeek V3.1 Terminus (Reasoning) ➤📈 No decline in long context reasoning: Despite DeepSeek’s architecture changes, V3.2 Exp (Reasoning) appears not to exhibit any decline in long context reasoning - scoring a slight uplift in AA-LCR. ➤⚡ Non-reasoning performance: In non-reasoning mode, DeepSeek V3.2 Exp shows no degradation in intelligence, matching DeepSeek V3.1 Terminus with a score of 46 on the Artificial Analysis Intelligence Index ➤⚙️ Token efficiency: For DeepSeek V3.2 Exp (Reasoning), token usage to run the Artificial Analysis Intelligence Index decreases slightly from 67M to 62M compared to V3.1 Terminus. Token usage remains unchanged for the non-reasoning variant ➤💲Pricing: DeepSeek has significantly reduced the per token pricing for their first-party API from $0.56/$1.68 to $0.28/$0.42 per 1M input/output tokens - a 50% and 75% reduction in pricing of input and output tokens respectively. Other model details: ➤©️ Licensing: DeepSeek V3.2 Exp is available under the MIT License ➤🌐 Availability: DeepSeek V3.2 Exp is available via DeepSeek API, which has replaced DeepSeek V3.1 Terminus. Users can still access DeepSeek V3.1 Terminus via a temporary DeepSeek API until 15th October ➤📏 Size: DeepSeek V3.2 Exp has 671B total parameters and 37B active parameters. This is the same as all previous models in the DeepSeek V3 and R1 series
13
37
403
32,241
OpenAI’s o3-mini is here - a significant jump forward from o1-mini Initial results (full benchmarking coming soon): ➤ Artificial Analysis Quality Index of 89, matching DeepSeek R1 and just below o1 ➤ Cheaper - $1.1/$4.4 input/output pricing per million tokens, lower than many DeepSeek R1 APIs (higher than DeepSeek’s first party R1 API) ➤ Fast - similar speed to o1-mini at 170 tokens/s, although that means 2000 tokens of ‘thinking’ time will still take ~12 seconds
24
59
400
79,770
A new leader in the Artificial Analysis Image Arena? 👀
28
31
393
86,811
OpenAI’s o1-preview is the first model to substantially push the frontier of language model intelligence since the original GPT-4 over 18 months ago Since @OpenAI GPT-4’s March 2023 launch, we’ve seen dozens of companies scramble to catch up with OpenAI. But until o1, we haven’t seen the next leap forward for intelligence. Our independent evaluations on o1-preview confirm that o1 achieves the most substantial leap in Artificial Analysis Intelligence Index since GPT-4. We continue to believe that o1 is generally unsuitable for the majority of production use-cases today due to speed and cost trade-offs (links below to previous detailed coverage), but the significance of OpenAI taking back the intelligence lead should not be understated. Our Frontier Language Models Over Time chart below tracks the Artificial Analysis Intelligence Index scores of the leading frontier-class model from OpenAI, Anthropic, Google, Mistral and Meta.
11
66
396
54,023
Whisper no longer wears the open weights transcription accuracy crown with new entrants achieving better Artificial Analysis Word Error Rate scores Once considered the default choice for open weights transcription, OpenAI’s Whisper has now been surpassed by newer open weights models on the Artificial Analysis Word Error Rate (AA-WER) benchmark measuring transcription accuracy. AA-WER comprises three challenging datasets aligned with real-world use cases: AMI-SDM (multi-speaker meetings), Earnings-22 (earnings calls), and VoxPopuli (parliamentary proceedings). Top open weights performers: @NVIDIA’s Canary Qwen 2.5B and Parakeet TDT 0.6B V2, followed by @Mistral’s Voxtral Small and Mini, and @IBM Granite Speech 3.3 8B. Open weights Speech to Text models offer deployment flexibility, cost benefits, the potential for customization/fine-tuning, and enable use-cases such as privacy-sensitive workloads that need to run locally.
26
52
401
41,425
We benchmarked Apple's new On-Device model: trails most Gemma and Qwen on-device suitable models but still very useful GPQA Diamond performance trailed models that are suitable for on-device use such as the smaller Gemma models (3n E4B, 4B, 12B) and Qwen3 models (1.7B, 4B, 8B). The model is also quite slow, we measured ~15 output tokens/s on a M1 Pro 8 core with 16GB RAM (GPT-4o, the default in ChatGPT, is ~170 output tokens/s). 🚨 However, despite being 3B parameters, the model’s weights are stored at just 2 bits (core decoder layers), with 4-bit embeddings and an 8-bit KV cache. As such, the memory footprint is much closer to Gemma 1B or Gemma 3n E2B than to any 4B model. Furthermore, this model will be baked into future MacOS and iOS (Apple Intelligence) releases. As such, while it likely isn't the optimal solution as a primary AI assistant, it should be more than capable of many background tasks (expect this will represent most use) and facilitating interactions with a device (schedule appointment, initiate actions in various software, use by software for offline tasks, etc).
33
41
391
64,349
Meta launches Llama 3.3 70B, achieving a level of intelligence previously reserved for Llama 3.1 405B and leapfrogging the November release of GPT-4o We have completed our first round of independent evals on Llama 3.3 70B and are seeing a jump in Artificial Analysis Quality Index from 68 to 74, now matching Llama 3.1 405B’s score. Congratulations to @AIatMeta on an excellent update! Details: ➤ Biggest increases in MATH-500 (64% to 76%), GPQA Diamond (43% to 49%) and HumanEval (80% to 85%) ➤ Smaller increase in MMLU (84% to 86%) ➤ Llama 3.3 70B now leads Llama 3.1 405B in Math-500, and scores nearly equal to 405B in MMLU, GPQA Diamond and HumanEval ➤ With no change to model size, we anticipate that most providers serving Llama 3.1 70B APIs will imminently launch Llama 3.3 70B endpoints at equivalent price and speed to the 3.1 70B endpoints See below for a full breakdown.
12
51
382
68,238
OpenAI's Sora is now the leader in the Artificial Analysis Video Generation Arena! After 3,710 appearances or 'battles' in the arena over the past 2 days, Sora now has an ELO score of 1,151. This places it as the clear #1 in the Artificial Analysis Video Generation Arena Leaderboard. See the below posts for comparisons taken from the arena between @OpenAI 's Sora and the other top 3 models, @Kling_ai 1.5, @Hailuo_AI and @genmoai 's Mochi 1. Comparisons below 🔽
14
51
375
95,594
Seedream 4.0 is #1 in the Artificial Analysis Image Editing Leaderboard, and has pushed forward the state of the art along with the recent Gemini 2.5 Flash Image (Nano-Banana) release! We've compiled some examples so you can see for yourself just how much they've improved over GPT-4o
5
43
380
31,004
Inception Labs has launched the first production-ready Diffusion LLMs. Mercury Coder Mini achieves >1,000 output tokens/s on coding tasks while running on NVIDIA H100s - over 5x faster than competitive autoregressive LLMs on similar hardware Inception’s Diffusion LLMs (“dLLMs”) use a new architecture compared to traditional autoregressive LLMs. This launch signifies a major step forward for LLM architectures and we expect it to accelerate research into this new paradigm. Previous research has demonstrated the potential of Diffusion LLMs but this is the first time we are seeing models released with production-ready performance and intelligence. Inception’s Mercury Coder Mini and Mercury Coder Small are ready now for latency-sensitive coding use cases. Architecture breakdown: ➤ Traditional autoregressive LLMs generate text one token at a time, while diffusion language models use a "coarse-to-fine" approach, refining the entire output through multiple denoising steps. This approach is similar to how most image and video models work. ➤ Critically, this allows parallelization of output token generation - allowing faster output speeds because many output tokens are generated at the same time. For autoregressive LLMs, GPUs can process input tokens in parallel but have to generate output tokens one-by-one - and are therefore face a memory bandwidth bottleneck on their maximum output speed. ➤ The Diffusion LLM generation process starts with noise and iteratively refines the output using a Transformer model that can modify multiple tokens in parallel. Model details: ➤ Inception has launched two models under the name of ‘Mercury’: Mercury Coder Small and Mercury Coder Mini. ➤ These models are fast. For coding workloads, we have benchmarked Mercury Small and Mini as achieving 737 and 1,109 output tokens/s respectively. These speeds have historically been reserved for non-GPU architectures, such as on Groq, SambaNova and Cerebras. ➤ The models are focused on coding and are positioned well for latency-sensitive coding use cases. Mercury Coder Small and Mini achieve scores of 23 and 16 in the Artificial Analysis Coding Index (average across LiveCodeBench and SciCode), positioning them as competitive with Gemini 2.0 Flash-Lite and GPT-4o mini. Inception Labs background: ➤ This is the first release from Inception Labs. The founders were previously professors from Stanford, UCLA, and Cornell and have contributed to AI research & technologies including Flash Attention, Decision Transformers, and Direct Preference Optimization. See below for further analysis and a link to try out the models yourself.
8
53
367
79,851
Alibaba has released Qwen3 Omni and Qwen3 Omni Realtime - two natively end-to-end "omni"-modal models that process text, images, audio, and video in a single unified architecture. Artificial Analysis benchmarking shows competitive Speech to Speech performance, as well as high-speeds for Realtime version. Qwen3 Omni and Qwen3 Omni Realtime process audio, video, and text inputs directly in a unified architecture, enabling native multimodal reasoning and simultaneous generation of both text and natural speech responses. The architecture uses a "Thinker" MoE and "Talker" MoE, with the Talker decoupled from the Thinker's text representations to enable independent control of response style and audio characteristics. Both models supports 119 text languages, 19 speech input languages, and 10 speech output languages. Speech to Speech capabilities: ➤ Reasoning: Qwen3 Omni 30B scores 58% on Big Bench Audio while Qwen3 Omni Realtime scores 59% - both ahead of Gemini 2.0 Flash (36%) but trailing GPT-4o Realtime (68%) ➤ Latency: Time to first audio averages 4.8 seconds for Qwen3 Omni 30B and 0.9 seconds for Qwen3 Omni Realtime, compared to leading models at 0.6s. There is still a way to go for models to reach human levels of responsiveness, averaging 0.2-0.3 seconds Additional model details: ➤ Availability: Qwen3 Omni 30B is available via Alibaba Cloud DashScope API Qwen3-Omni-Flash endpoint. Qwen3-Omni-30B-A3B model weights (Instruct, Thinking, and Captioner variants) are available on Hugging Face and ModelScope under Apache 2.0 license ➤ Voice options: 17 voice types available via API with 24kHz audio output
8
36
359
82,092
Qwen3 model family overview: full benchmarks for all 8 Qwen3 models in both reasoning and non-reasoning modes Key results: ➤ Qwen3 235B-A22B (Reasoning): The largest Qwen3 model scores 62 on the Artificial Analysis Intelligence Index, becoming the most intelligent open weights model ever. This is very impressive considering the model has only 22B active parameters with 235B total, very few compared to its nearest competitors - NVIDIA’s Llama Nemotron Ultra (dense, 253B) and DeepSeek R1 (37B active, 671B total). One thing Qwen3 is missing is multimodal inputs - Llama 4 and Gemma 3 remain the best open weights models for vision capability. ➤ Qwen3 32B (Reasoning): The largest dense model in the Qwen3 family scores 59 on our Intelligence Index, just behind DeepSeek R1. While 235B-A22B will be both more intelligent and efficient for large scale inference, the 32B is highly compelling for deployments constrained by total memory (including local inference). ➤ Qwen3 30B-A3B (Reasoning): The smaller MoE scores 56 in Intelligence Index, matching the dense 14B. With just 3B active parameters, this model can achieve incredible speed compared to other models of similar intelligence. ➤ Smaller Qwen3 models: 0.6B, 1.7B, 4B and 8B are each independently strong models for their size when used in reasoning mode. These are particularly compelling for on-device use cases. ➤ Non-reasoning performance: We tested all 8 Qwen3 models in non-reasoning mode (using the /no_think soft switch) and overall find that while the models remain effective in non-reasoning mode, they are generally not in a clear leadership position compared to competing non-reasoning models. This may indicate that there continues to be a real cost of a hybrid reasoning approach, as opposed to separate dedicated models. Observations from our detailed analysis of the Qwen3 models: ➤ Consistent uplift from reasoning: we see a significant jump for all models, resulting in interesting consequences like 4B (reasoning) matching the score of 235B-A22B (non-reasoning). We would caution that 235B-A22B is likely to outperform significantly in real world use where reasoning provides a less consistent uplift ➤ Clear demonstration of benefits of MoE models: on the Active Parameters chart, the two MoE models (235B-A22B and 30B-A3B) clearly sit above the trendline formed by the dense models Detailed breakdowns of the full Qwen3 family follow - including token usage.
9
55
363
35,693
Announcing Artificial Analysis Long Context Reasoning (AA-LCR), a new benchmark to evaluate long context performance through testing reasoning capabilities across multiple long documents (~100k tokens) The focus of AA-LCR is to replicate real knowledge work and reasoning tasks, testing capability critical to modern AI applications spanning document analysis, codebase understanding, and complex multi-step workflows. AA-LCR is 100 hard text-based questions that require reasoning across multiple real-world documents that represent ~100k input tokens. Questions are designed so answers cannot be directly found but must be reasoned from multiple information sources, with human testing verifying that each question requires genuine inference rather than retrieval. Key takeaways: ➤ Today’s leading models achieve ~70% accuracy: the top three places go to OpenAI o3 (69%), xAI Grok 4 (68%) and Qwen3 235B 2507 Thinking (67%) ➤👀 We also already have gpt-oss results! 120B performs close to o4-mini (high), in-line with OpenAI claims regarding model performance. We will be following up shortly with a Intelligence Index for the models. ➤ 100 hard text-based questions spanning 7 categories of documents (Company Reports, Industry Reports, Government Consultations, Academia, Legal, Marketing Materials and Survey Reports) ➤ ~100k tokens of input per question, requiring models to support a minimum 128K context window to score on this benchmark ➤ ~3M total unique input tokens spanning ~230 documents to run the benchmark (output tokens typically vary by model) ➤ Link to dataset on 🤗 @HuggingFace is below We’re adding AA-LCR to the Artificial Analysis Intelligence Index, and taking the version number to v2.2. Artificial Analysis Intelligence Index v2.2 now includes: MMLU-Pro, GPQA Diamond, AIME 2025, IFBench, LiveCodeBench, SciCode and AA-LCR. All numbers are updated on the site now. Find out which models Artificial Analysis Intelligence Index v2.2 👇
14
31
367
29,890
Nvidia has broken through prior barriers with their B200 GPUs We have conducted independent benchmarking and are seeing >1,000 output tokens/s on Llama 4 Maverick, >10X the speed of some other providers. This represents the fastest Maverick endpoint that we have benchmarked yet. Exciting times ahead for developers when B200-based APIs are publicly available.
10
29
354
65,319
Recent open weights releases are reducing the gap to proprietary frontier models on agentic workflows On the Terminal-Bench Hard evaluation for agentic coding and terminal use, open-weights models such as DeepSeek V3.2 Exp, Kimi K2 0905, and GLM-4.6 have made large strides, with DeepSeek surpassing Gemini 2.5 Pro. These advances reflect significantly higher capability for use in coding and other agent use cases, and developers have a wider range of model options than ever for these applications. See below for our analysis of the price and performance of providers to help you make use of these leading models 👇
13
35
352
96,888
Anthropic launches their first Haiku model in 11 months - Claude 4.5 Haiku jumps 35 points in Artificial Analysis Intelligence Index to become relevant again Claude 4.5 Haiku is 3x cheaper per token than Claude 4.5 Sonnet. Running the Artificial Analysis Intelligence Index costs $262 with Haiku vs. $817 with Sonnet (~3x more) - full cost breakdown in the thread below. There is a stronger case for 4.5 Haiku’s intelligence/cost positioning than 3.5 Haiku, but there is a competitive field of cheaper alternatives with similar intelligence that may make more sense for developers without a reason to need a Claude model specifically - including gpt-oss-120b (reasoning high, ~$75) and Grok 4 Fast (~$40). Claude 4.5 Haiku is the most cost-effective way to access ‘Claude’ but is not the most cost effective model for its level of intelligence. Key benchmarking results: ➤🧠 Model Intelligence: In reasoning mode, Claude 4.5 Haiku scores 55 on the Artificial Analysis Intelligence Index. This is 8 points lower than Claude 4.5 Sonnet (Thinking) and 4 points lower than Claude 4.1 Opus (Thinking). Claude 4.5 Haiku (Thinking) places marginally ahead of Gemini 2.5 Flash (Reasoning, 54) but behind other reasoning models such as Qwen3 235B 2507 (57), DeepSeek V3.2 Exp (57), and GLM 4.6 (56) ➤📈 Intelligence Uplift: Anthropic released Claude 3.5 Haiku in November 2024, ~11 months before the release of Claude 4.5 Haiku. Claude 4.5 Haiku is a significant improvement in intelligence compared to Claude 3.5 Haiku, scoring 67% on GPQA Diamond in reasoning mode (compared to 41% for Claude 3.5 Haiku) ➤⚙️ Notable Strengths: In reasonign mode, Claude 4.5 Haiku performs well in long-context reasoning (70% on AA-LCR, behind only GPT-5 High) and coding (43%, matching GPT-5 High and Gemini 2.5 Pro) ➤⚡ Non-Reasoning Performance: In non-reasoning mode, Claude 4.5 Haiku scores 42 on the Artificial Analysis Intelligence Index. This places the model in line with GPT-5 (minimal, 43) but behind Gemini 2.5 Flash (non-reasoning, 47) ➤💲 Pricing: Claude 4.5 Haiku is priced at $1/$5 per 1M input/output tokens, which makes it 3x cheaper compared to Claude 4.5 Sonnet (priced at $3/$15 per 1M input/output tokens) ➤⚙️ Token Efficiency: Anthropic’s Claude models continue to be more token-efficient than all other reasoning models. For Claude 4.5 Haiku (Thinking) -evaluated with a maximum reasoning budget of 64k tokens - we see the model use 39M output tokens to run the Artificial Analysis Intelligence Index. Its token usage is lower than Claude 4.5 Sonnet (42M) but higher than Claude 4.1 Opus (30M) in thinking mode Key model details: ➤📏 Context Window: 200K tokens. This is equivalent to Claude 4.5 Sonnet ➤🌐 Availability: Claude 4.5 Haiku is available via Anthropic‘s API, Google Vertex and Amazon Bedrock. Claude 4.5 Haiku is also available via Claude, and Claude Code
11
40
360
33,258
Kling 2.5 Turbo takes the top spot in both Text to Video and Image to Video in the Artificial Analysis Video Arena, surpassing Hailuo 02 Pro, Google’s Veo 3, and Luma Labs’ Ray 3! Kling 2.5 Turbo is the latest release from @Kling_ai , representing a significant leap from Kling 2.1. The model supports 5s and 10s video generations at resolutions up to 1080p. It's available directly on the Kling AI app at 25 Credits for a 5s video, with videos costing approximately 15c each on the highest "Ultra" plan. The model is also accessible via API on major API platforms. At $4.20 per minute of video on @fal , Kling 2.5 Turbo is slightly cheaper than its primary competitors - Hailuo 02 Pro at $4.90 per minute and Seedance 1.0 at approximately $7.32 per minute - while delivering superior quality. See below for comparisons between Kling 2.5 Turbo and other leading models in our Artificial Analysis Video Arena 🧵
10
43
353
55,498
Base, with no tools. We have not tested Grok 4 Heavy yet.
8
7
348
21,736
🇰🇷 LG recently launched EXAONE 4.0 32B - it scores 62 on Artificial Analysis Intelligence Index, the highest score for a 32B model yet @LG_AI_Research's EXAONE 4.0 is released in two variants: the 32B hybrid reasoning model we’re reporting benchmarking results for here, and a smaller 1.2B model designed for on-device applications that we have not benchmarked yet. Alongside Upstage's recent Solar Pro 2 release, it's exciting to see Korean AI labs join the US and China near the top of the intelligence charts. Key results: ➤ 🧠 EXAONE 4.0 32B (Reasoning): In reasoning mode, EXAONE 4.0 scores 62 on the Artificial Analysis Intelligence Index. This matches Claude 4 Opus and the new Llama Nemotron Super 49B v1.5 from NVIDIA, and sits only 1 point behind Gemini 2.5 Flash ➤ ⚡ EXAONE 4.0 32B (Non-Reasoning): In non-reasoning mode, EXAONE 4.0 scores 51 on the Artificial Analysis Intelligence Index. It matches Llama 4 Maverick in intelligence despite having only ~1/4th total parameters (although has ~2x the active parameters) ➤ ⚙️ Output tokens and verbosity: In reasoning mode, EXAONE 4.0 used 100M output tokens for the Artificial Analysis Intelligence Index. This is higher than some other frontier models, but aligns with recent trends of reasoning models using more output tokens to 'think more' - similar to Llama Nemotron Super 49B v1.5, Grok 4, and Qwen3 235B 2507 Reasoning. In non-reasoning mode, EXAONE 4.0 used 15M tokens - high for a non-reasoner, but not as high as Kimi K2’s 30M. Key details: ➤ Hybrid reasoning: The model offers optionality between 'reasoning' mode and 'non-reasoning' mode ➤ Availability: Hosted by @friendliai currently, and competitively priced (especially compared to proprietary options) by FriendliAI at $1 per 1M input and output tokens ➤ Open weights: EXAONE 4.0 is an open weights model available under the EXAONE AI Model License Agreement 1.2. The license limits commercial use. ➤ Multimodality: Text only input and output ➤ Context window: 131k tokens ➤ Parameters: 32B active and total parameters, available in 16bit and 8bit precision (means the model can be run on a single H100 chip in full precision)
12
63
353
45,727
Kimi K2 Providers: Groq is serving Kimi K2 at >400 output tokens/s, 40X faster than Moonshot’s first-party API Congratulations to a number of providers to being quick to launch APIs for Kimi K2, including @GroqInc , @baseten , @togethercompute, @FireworksAI_HQ, @parasail_io, @novita_labs, @DeepInfra, and of course @Kimi_Moonshot. This is impressive considering the size of the model at 1 trillion total parameters. Groq stands out for blazing fast speed. DeepInfra, Novita and Baseten stand out for their pricing, being the only providers pricing similarly to or more cheaply than Moonshot’s first party API. See below for further comparisons between the providers. We’re expecting fast increases in speed across some providers as teams optimize for the K2 model - our numbers below show median speeds over the last 72 hours but we’re already seeing DeepInfra jump up to 62 tokens/s in today’s measurements
17
28
340
83,284
HiDream-I1-Dev is the new leading open-weights image generation model, overtaking FLUX1.1 [pro] in the Artificial Analysis Image Arena! HiDream, the Chinese company behind @vivago_ai, has just open-sourced their HiDream-I1 family of models under the MIT License. This impressive 17B parameter model comes in three variants: Full, Dev, and Fast HiDream is launching their API soon and we will provide coverage of their endpoints once it launches. See the below thread for image comparisons or see the generations for yourself in the Artificial Analysis Image Arena!
16
50
339
36,087
We’re updating the Artificial Analysis Intelligence Index! We’ve added IFBench to cover instruction following, done some housekeeping and are adding more benchmarks over the next couple of weeks 👀 The focus of the Artificial Analysis Intelligence Index is to provide a useful synthesis metric to compare the overall intelligence of language models. We look to continually keep it up to date to ensure it stays relevant to developers and in this context we are making some upgrades over the next couple of weeks. How we’re approaching these upgrades: ➤ IFBench: the @allen_ai team released IFBench in July 2025 to test how well models follow user instructions in single- and multi-turn contexts. We see instruction following as a key usability factor that’s not always reflected in standard intelligence-focused tests, so IFBench is a great supplement as we try to better capture model capabilities beyond raw intelligence. Link to IFBench research paper by Allen Institute team: github.com/allenai/IFBench/b… ➤ Housekeeping: MATH-500 has been a useful part of our quantitative reasoning tests, but given frontier models are consistently reaching near-perfect scores, we have decided to sunset it as part of the Intelligence Index. At the same time, we have updated from AIME 2024 to AIME 2025, reflecting the latest questions and reducing data contamination risks as we evaluate models. ➤ Looking forward to coverage of more capabilities: over the next couple of weeks, we’ll be further expanding our benchmark coverage to how models are evolving. In particular, we’ll be looking to add coverage of agentic functionality (incl. tool calling) and long context reasoning Impact: This update reduces frontier scores to <70 while generally maintaining model rankings. Grok 4 remains our top-rated model on the Index, and OpenAI’s o3 and o4-mini move ahead of Gemini 2.5 Pro compared to our prior index formulation. Link to our methodology: artificialanalysis.ai/method…
25
29
335
1,648,906
NVIDIA has released the latest member of its Nemotron language model family, Llama Nemotron Super (49B) v1.5, reaching a score of 64 on the Artificial Analysis Intelligence Index. The model is an evolution of Super 49B v1 from earlier this year, with advances from post-training on new reasoning datasets generating a 13-point increase in the Intelligence Index. This puts @NVIDIA’s latest Super 49B release ahead of their previous Ultra 253B parameter model, despite having less than 1/4 the parameters. Leading dense model performance: with this latest iteration, Nemotron Super 49B v1.5 is the only dense model in the top 5 open weights models, competitive with much larger recent MoEs from Alibaba, Deepseek and MiniMax. Key model details: ➤ Retains the same 131k context window as Nemotron Super v1 ➤ Supports reasoning or non-reasoning modes with ‘/no_think’ settings in the system prompt ➤ Released under the NVIDIA Open Model License, as with previous Nemotron models
13
45
335
29,211
Google has quietly upgraded Imagen 4! Imagen 4 Ultra now ranks #3 in the Artificial Analysis Image Arena, rivaling GPT-4o and Seedream 3.0 as one of the world's best image models! This substantial update brings Imagen 4 much closer to the leading models in our arena. We continue to observe that Imagen 4 Ultra and Standard often produce very similar outputs, though the difference is more pronounced than in previous versions. Key details: ➤ Imagen 4 remains more affordable than GPT-4o at $40/1k images (Standard) and $60/1k images (Ultra), compared to GPT-4o's ~$167/1k images, while being slightly above Seedream 3.0's $30/1k images ➤ Imagen 4 Ultra generates in ~9.5s vs GPT-4o's ~53s and Seedream 3.0's ~4.5s ➤ You can access Imagen 4 via the Gemini app, Vertex AI, @fal, and @replicate See below for comparisons between the updated Imagen 4 and other leading models in our Artificial Analysis Image Arena 🧵
6
35
333
54,257
Tencent’s latest open weights model Hunyuan-A13B (80B total, 13B active) achieves an Artificial Analysis Intelligence Index of 56 - impressive given that it can be run on a single H200 (in FP8 precision) Hunyuan-A13B scores ahead of Qwen3 8B and 14B, but behind larger models like Qwen3 235B (235B total, 22B active) and DeepSeek R1 (671B total, 37B active). Hunyuan-A13B supports a 256K context window - larger than Qwen3’s 128K context window but smaller than MiniMax M1’s 1M context window. Hunyuan-A13B can be run in FP8 precision on just one H200 or two H100s - a much smaller deployment than the 8xH100 deployment required for Qwen3 235B or MiniMax M1 (456B).
14
33
331
31,022