Wenhu Chen · May 1, 2024 · 2:44 AM UTC

Wenhu Chen

1 May 2024

If your model is weak, your paper might end up getting more citations because people are always happy to include your model as a baseline.

144

2,185

184,578

Wenhu Chen · Sep 2, 2025 · 3:55 AM UTC

Wenhu Chen @WenhuChen

2 Sep 2025

Lots of my Chinese friends came to US due to the fear of 996 culture in Chinese tech companies. Now it seems that US companies (especially the bay area) are gradually adopting the 996 culture.

1,378

211,087

Wenhu Chen · Jan 1, 2025 · 3:18 PM UTC

Wenhu Chen @WenhuChen

1 Jan 2025

Another Microsoft paper revealing the size of GPT-4, GPT-o1 and Claude Sonnet. I'm not sure how trustworthy these numbers are, but they do make a lot of sense to me. Source: arxiv.org/pdf/2412.19260

131

1,125

207,840

Wenhu Chen · Jul 2, 2023 · 2:46 AM UTC

Wenhu Chen @WenhuChen

2 Jul 2023

I honestly think the undergrads should spend more time on understanding the foundamentals of math and science instead of chasing all these AI hypes. Knowing how Vicuna differs from Alpaca probably won't matter much in a year, but knowing how to do SVD will always help you ...

Yi Ma

@YiMaTweets

1 Jul 2023

Mathematics is the art of giving the same name to different things(Henri Poincaré). Machine learning is the art of giving different names to the same thing.

130

1,068

364,896

Wenhu Chen · Jan 12, 2025 · 4:28 AM UTC

Wenhu Chen @WenhuChen

12 Jan 2025

I spent the weekend reading some recent great math+reasoning papers: 1. AceMath (arxiv.org/abs/2412.15084) 2. rStar-Math (arxiv.org/pdf/2501.04519) 3. PRIME (arxiv.org/abs/2412.01981) Here are some of my naive thoughts! It could be wrong. All of these papers are showing possible ways to reach o1. The secret source is pretty much the same thing: **high-quality/difficult prompt with verifiable answer** 1. AceMath takes a simple approach (rejection-fine-tuning -> RFT) to scale up all the SFT dataset to massive size based on the verifiable answer matching. No RM is necessary, but you can still use outcome RM to help boost the performance. 2. rStarMath uses self-evolving SFT approach to gradually boost the data quality and process preference model (PPM) performance. rStarMath is still an RFT, where the samples are coming from MCTS guided with PPM. Still, it requires strong supervision from the verifiable reward in the end. rStarMath also scales up inference compute by utilizing the PPM at each step. 3. PRIME takes a very different angle! PRIME actually uses PPO to train the model, but the major contribution is on how to assign the outcome's reward to each intermediate steps. It also relies heavily on using the verifiable answer to obtain the "correct" on-policy model outputs. The results are quite interesting. It seems that all these approaches are reaching similar results. Eurus-2 might seem weaker due to its smaller training set size. These results are all somewhat on par with o1-mini already/ Given some leakage that o1-mini is ~20B, it basically says that on there is no gap with o1 at least on math problems now. However, o1-mini might win significantly in other broader reasoning tasks, like physics, puzzles, etc. These results might reveal that reaching o1 is more of a data or infra problem than an algorithm problem. As we find great ways to scale up the (good and difficult prompt, verifiable answer) pairs from different domains, the actual algorithm might not influence too much. Some algorithms are more data efficient than the others, but many of them will take us to o1 or even o3.

156

917

81,667

Wenhu Chen · Sep 15, 2025 · 10:46 PM UTC

Wenhu Chen @WenhuChen

15 Sep 2025

The Internet latency is no joke. It took three years to open an Arxiv link.

This Post is from a suspended account.

864

93,888

Wenhu Chen · Feb 23, 2024 · 4:12 PM UTC

Wenhu Chen @WenhuChen

23 Feb 2024

My course "Recent Advances on Foundation Models" at Waterloo is public. Check out cs.uwaterloo.ca/~wenhuche/te…. In the course, we cover lots of interesting topics including transformers, LLM, pre-training, quantization, sparse attention, instruction tuning, RLHF, prompting, Vision transformers, diffusion models, multimodal models, agents, RAG, etc. I will continue to upload the slides (ppt) to the website. Some of them will also have recorded videos soon. There are already 12 lecture slides available now. These slides are made by the awesome attendees of the course.

157

730

104,045

Wenhu Chen · Dec 7, 2024 · 2:03 AM UTC

Wenhu Chen @WenhuChen

7 Dec 2024

Canada is losing its leadership in AI because your immigration office blocked almost all the talented grad students from China and other places. Cohere would have been much stronger if they could hire these talented grad students.

Justin Trudeau

@JustinTrudeau

6 Dec 2024

Canada is a leader in AI because of companies like @Cohere. We are working with Cohere to build a cutting-edge AI data centre here at home — essential infrastructure for powering AI.

672

175,356

Wenhu Chen · Jan 30, 2025 · 8:20 PM UTC

Wenhu Chen @WenhuChen

30 Jan 2025

Everyone is talking about RL these days. But are we done with SFT? The answer is NO. If we revive SFT in another form, it can even beat RL! Very happy to introduce Critique Fine-Tuning, a new form of SFT, which can more efficiently activate language models' reasoning capabilities. The basic idea is simple: training the base language models to critique given noisy responses instead of imitating the correct responses. This is inspired by the human learning process, which emphasizes deep analysis and critical thinking. By training Qwen2.5-Math-base with CFT for 8 GPU hours (1 hour on 8xH100) on 50K examples, we can improve it by 22 absolute points on six math reasoning datasets! It significantly outperforms standard SFT. It can match or outperform Qwen2.5-Math-Instruct trained with 2M+ examples. We tried CFT on other backbones like Qwen2.5-base, the gain of CFT over SFT is even larger. What's more, CFT training can even match SimpleRL (github.com/hkust-nlp/simpleR…), which is the open replication of R1 training. Note that CFT training only requires 8 GPU hours while SimpleRL requires ~1152 GPU hours. We believe there is a lot variants of SFT we should explore. CFT is only be one of them. Paper: arxiv.org/abs/2501.17703 Code: github.com/TIGER-AI-Lab/Crit… Website & Models & Data Link: tiger-ai-lab.github.io/Criti…

692

73,982

Wenhu Chen · Mar 14, 2025 · 10:01 PM UTC

Wenhu Chen @WenhuChen

14 Mar 2025

I have very mixed feelings when reading these recent search + rl papers. I understand the LLM folks are too young to read the 2020 RAG papers. But claims like "RAG lacks the flexibility for multi-turn, multi-query retrieval" and "We are the SoTA by achieving NQ = 41%, HotpotQA=38%" really baffle me. 1. Iterative RAG has been studied for quite a long time. There are plenty of great work done in this space. 2. The SoTA for NQ and HotpotQA are both around 70%! There are some good curated RAG paper list in github.com/coree/awesome-rag and github.com/hymie122/RAG-Surv….

667

133,380

Wenhu Chen · May 15, 2024 · 4:19 AM UTC

Wenhu Chen @WenhuChen

15 May 2024

Tired of MMLU? The current models already hit the ceiling? It's time to upgrade MMLU! Introducing our new benchmark MMLU-Pro, a more robust and challenging massive multi-task language understanding benchmark with 12K questions. What's New? 1. MMLU-Pro uses 10 options instead of 4 options. So there is less room for random guessing. 2. MMLU-Pro significantly increases the complexity level by adding more college-level problems across different disciplines. 3. MMLU-Pro is also more robust and less sensitive to different prompts. We show our preview evaluation results in: huggingface.co/datasets/TIGE… We found that GPT-4o (71%) actually improves GPT-4-turbo (62%) by 9%! On the original MMLU, the improvement is only around 2%.

124

654

173,378

Wenhu Chen · Feb 5, 2024 · 9:59 PM UTC

Wenhu Chen @WenhuChen

5 Feb 2024

Working on LLMs is really stressing. I have seen my research ideas being scooped almost on a weekly basis 🥲.

609

104,664

Wenhu Chen · Feb 25, 2025 · 12:42 PM UTC

Wenhu Chen @WenhuChen

25 Feb 2025

Yay! Something to celebrate.

613

54,906

Wenhu Chen · Oct 23, 2025 · 4:20 AM UTC

Wenhu Chen @WenhuChen

23 Oct 2025

For people without a green card, being laid off is likely one of the most devastating experiences. I can certainly feel the pain.

567

49,782

Wenhu Chen · Oct 8, 2024 · 2:36 PM UTC

Wenhu Chen @WenhuChen

8 Oct 2024

Just updated my profile.

553

51,176

Wenhu Chen · Mar 22, 2025 · 3:10 PM UTC

Wenhu Chen @WenhuChen

22 Mar 2025

This paper provides some really interesting insights: 1. Previously, people found that Qwen base models are particularly good at R1 training to show strong exploration skills. - This paper shows that there is no magic about Qwen base models. It's likely pre-trained with concatenated Q+A data. Therefore, the base models will automatically answer questions instead of completing it. Therefore, pre-training LLama-3.2 a bit on similar concatenated Q+A data can also trigger it to explore and achieve better performance. 2. Previously, there is a common belief that "Aha Moment" is a result from RL training. - This paper shows that some base models already exhibit amount of self-reflection. RL is simply enhancing this behavior. 3. Previously, the increased output length is believed to be the key for performance improvement. - This paper argues that it's not the case. The responses with self-reflection get lower accuracy than the ones without self-reflection. 4. Previously, people are obsessed with length increase with GRPO algorithm. - This paper argues that this phenomenon is simply due to the length bias in GRPO. Basically, by dividing the advantage with the total length, the wrong responses are penalized less than shorter responses in GRPO. By removing the length normalization term, the length won't increase dramatically while the performance even increases slightly. In a nutshell, this paper provides a critical perspective about the "obsession of long coT".

Zichen Liu

@zzlccc

21 Mar 2025

🪂Understanding R1-Zero-Like Training: A Critical Perspective * DeepSeek-V3-Base already exhibits "Aha moment" before RL-tuning?? * The ever-increasing output length in RL-tuning might be due to a BIAS in GRPO?? * Getting GRPO Done Right, we achieve a 7B AIME sota! 🧵 📜Full details: github.com/sail-sg/understan… 🛠️Code: github.com/sail-sg/understan…

568

70,354

Wenhu Chen · May 15, 2022 · 2:10 PM UTC

Wenhu Chen @WenhuChen

15 May 2022

As everyone is celebrating their ICML acceptance, I am here to celebrate the birth of my first baby. Highest respect to all the moms.

510

Wenhu Chen · Jan 15, 2024 · 9:27 PM UTC

Wenhu Chen @WenhuChen

15 Jan 2024

Improving LLMs' math reasoning ability is actually not as easy as what people anticipated. There are some interesting observations: 1. Using textbooks does not improve models' reasoning a lot. We tried to use the 800 high-quality math textbooks from Mathpile to continue training

492

110,715

Wenhu Chen · May 23, 2023 · 2:19 AM UTC

Wenhu Chen @WenhuChen

23 May 2023

New Arxiv: arxiv.org/abs/2305.12524 GPT-4/PaLM-2 have both shown almost perfect performance on existing grade school math dataset. What about more challenging STEM questions, especially the ones which require specific theorems, like Stoke's theorem, Wiener Process, etc?

111

475

152,713

Wenhu Chen · Mar 1, 2025 · 2:47 AM UTC

Wenhu Chen @WenhuChen

1 Mar 2025

As a researcher, it's easy to get distracted by what others are working on. I've seen many people conducting research on problems they don't genuinely care about—just because the community values them (e.g., solving Math Olympiad problems). It's important to focus on research that truly matters to you and aligns with what you genuinely believe in.

434

40,105

Wenhu Chen · Sep 18, 2025 · 3:00 PM UTC

Wenhu Chen @WenhuChen

18 Sep 2025

Thrilled to announce that TIGER-Lab has 5 papers accepted to NeurIPS main track, with THREE spotlights! Congrats to all the lead student authors and collaborators! A (late) birthday gift to myself!

439

40,806

Wenhu Chen · Apr 29, 2021 · 2:41 PM UTC

Wenhu Chen @WenhuChen

29 Apr 2021

Personal news: I'm thrilled to be joining the Waterloo @UWCheritonCS and Vector @VectorInst as an Assistant Professor in 2022 Fall! Before that, I will work in @GoogleAI as a researcher in the gap year. Ping me if you are interested in working on NLP, Deep Learning.

401

Wenhu Chen · Dec 29, 2023 · 8:28 PM UTC

Wenhu Chen @WenhuChen

29 Dec 2023

The GPU crisis in academia is becoming more serious than ever. At NeurIPS, the most popular conversation between different faculty is probably "how did you get your GPUs?". Shameless plug: if you have free gpu (A100/H100) cycles and you need (good) papers, please contact me 😃

380

79,592

Wenhu Chen · Sep 10, 2025 · 2:22 PM UTC

Wenhu Chen @WenhuChen

10 Sep 2025

Ever wonder what's really happening when we use RL to teach LLMs to reason? 🤔 The process is full of mysteries. 🤯 What causes those sudden "aha moments" in training? 📏 Why does better reasoning often lead to longer answers ("length-scaling")? 📉 Why does token entropy often drop, even as the model gets smarter? These aren't random quirks. Our new paper reveals they're all signs of a single, coherent process: RL forges an emergent, human-like reasoning hierarchy in LLMs. 🧠 🌐 Project Page: tiger-ai-lab.github.io/Hiera… 📝 Paper: arxiv.org/abs/2509.03646

elvis

@omarsar0

9 Sep 2025

Emergent Hierarchical Reasoning in LLMs The paper argues that RL improves LLM reasoning via an emergent two-phase hierarchy. First, the model firms up low-level execution, then progress hinges on exploring high-level planning. More on this interesting analysis:

398

40,200

Wenhu Chen · Feb 6, 2024 · 3:42 AM UTC

Wenhu Chen @WenhuChen

6 Feb 2024

What the heck! This is the BEST paper I have seen in 2024 so far. Highly recommend it.

DeepSeek

@deepseek_ai

6 Feb 2024

🚀 DeepSeekMath: Approaching Mathematical Reasoning Capability of GPT-4 with a 7B Model. Highlights: - Continue pre-training DeepSeek-Coder-Base-v1.5 7B with 120B math tokens from Common Crawl. - Introduce GRPO, a variant of PPO, that enhances mathematical reasoning and reduces training resources. More Details：arxiv.org/abs/2402.03300 Model Download：huggingface.co/deepseek-ai GitHub Repo：github.com/deepseek-ai/DeepS… #DeepSeek #DeepSeekMath

385

58,839

Wenhu Chen · May 23, 2025 · 3:35 PM UTC

Wenhu Chen @WenhuChen

23 May 2025

🚀 New Paper: Pixel Reasoner 🧠🖼️ How can Vision-Language Models (VLMs) perform chain-of-thought reasoning within the image itself? We introduce Pixel Reasoner, the first open-source framework that enables VLMs to “think in pixel space” through curiosity-driven reinforcement learning. Current VLMs reason only in text — even when grounded in rich images or videos, their logical steps are verbalized in natural language. This restricts their ability to interrogate visual evidence and demonstrate how conclusions are drawn. 🔍 So we ask: What if we could make VLMs "show their work" by reasoning directly in the pixel space? Inspired by GPT-o3’s "think-in-image" ability, we propose a framework where VLMs use interactive visual operations — zoom, select-frame, highlight — to reason through complex visual inputs. To do this, we design a two-stage training process: Instruction tuning with synthesized visual reasoning traces. Reinforcement learning with curiosity-driven reward to balance exploration between pixel and text reasoning ✨ With this, Pixel Reasoner achieves near-SoTA performance on many information-rich multimodal benchmarks: 📊 84% on InfographicsVQA 🧠 84% on V* benchmark 🧩 74% on TallyQA-Complex It also achieves strong accuracy of 68% on MVBench (a video benchmark). Website: tiger-ai-lab.github.io/Pixel… Paper: arxiv.org/abs/2505.15966 Code: github.com/TIGER-AI-Lab/Pixe… Demo: huggingface.co/spaces/TIGER-… (coming soon)

391

82,845

Wenhu Chen · Apr 12, 2025 · 7:42 PM UTC

Wenhu Chen @WenhuChen

12 Apr 2025

arxiv.org/pdf/2504.07086 is quite interesting. It standardizes the evaluation of all the existing math reasoning models and re-evaluate these models. Takeaway 1: Most RL-trained variants of the DeepSeek R1-Distill model do not yield meaningful performance improvements (except DeepscaleR), suggesting that a reliable and scalable RL training recipes are still lacking. Takeaway 2 While RL-trained methods can often substantially improve base model performance, instruction tuning remains superior (except Open Reasoner Zero), suggesting again that a reliable and scalable RL training recipes are still lacking. They propose to maintain a third-party evaluation of math reasoning models at bethgelab.github.io/sober-re…. This effort is really applaudable.

372

82,662

Wenhu Chen · Sep 13, 2024 · 2:49 PM UTC

Wenhu Chen @WenhuChen

13 Sep 2024

Crazy bump of o1-review on MMLU-Pro math subtask! It brings the previous highest score from 79% to 91%. I am still waiting the other tasks as my api quota for o1 is pretty low. This result also confirms the annotation quality of our MMLU-Pro dataset😃

359

134,146

Wenhu Chen · Feb 9, 2025 · 8:48 PM UTC

Wenhu Chen @WenhuChen

9 Feb 2025

I spent some time evaluating the frontier math models on AIME24 and AIME25 to see how they "Generalize". An interesting trend I found is that SFT on minimum data can also generalize quite well if you pick the right data. See LIMO-32B. Training with RL does not necessarily lead to better generalization than distillation. See the last two row.

352

74,789

Wenhu Chen · Oct 2, 2025 · 2:56 PM UTC

Wenhu Chen @WenhuChen

2 Oct 2025

What’s preventing us from training open-source image editing models like Nano-Banana or Seedream? The main barrier is the lack of high-quality training data for image editing. Most existing image editing datasets are synthesized using weak reward models or poor quality filters—for example, by prompting GPT-4o or other VLMs. To address this problem, we built the most powerful image editing reward model available. We first curated a large-scale image editing preference dataset and then trained EditReward on top of it. Our best EditReward model, trained from Mimo-7B, achieves the highest agreement with human experts. We also applied EditReward to filter existing noisy datasets and demonstrated significant improvements. Paper: arxiv.org/abs/2509.26346 Website: tiger-ai-lab.github.io/EditR… Code: github.com/TIGER-AI-Lab/Edit… HF: huggingface.co/collections/T…

Keming (Charles) Wu @Keming_Charles

2 Oct 2025

Why do open-source image editing models lag behind closed-source giants like GPT-Image-1, Seedream, & Google-Nano-Banana? 🤔 It’s mainly due to the quality of the training reward signal. We’re bridging the gap. Meet EditReward! 🏆

362

36,419

Wenhu Chen · Nov 30, 2023 · 8:44 PM UTC

Wenhu Chen @WenhuChen

30 Nov 2023

Thrilled to introduce UniIR, the first unified retriever to handle all types of information seeking needs: 1. text -> text 2. text -> image 3. text -> image + text 4. image -> image 5. image -> text 6. image + text -> text 6. image + text -> image 7. image + text -> image + text.

Cong Wei

@CongWei1230

30 Nov 2023

🚀 Introduce UniIR, a unified instruction-guided multimodal retriever handles diverse tasks. - 1️⃣model for 8️⃣ retrieval tasks (SoTA w/ Instruction-tuning) - Generalizes to unseen retrieval tasks. - M-BEIR: multimodal retrieval benchmark w/ 10 datasets, 1.1M queries, 5.6M cands.

351

64,154

Wenhu Chen · Mar 23, 2025 · 3:54 AM UTC

Wenhu Chen @WenhuChen

23 Mar 2025

I've seen impressive recent results from hybrid Mamba-Transformer architectures, which show significant progress compared to earlier efforts. These hybrid models excel at handling long-context inputs and enable higher throughput. Generally, there are two effective approaches to integrating these architectures: 1. Layer-wise Mixing: Alternating Transformer and Mamba layers within the architecture. 2. Sequence-wise Mixing: Using Mamba for encoding long input sequence part and feed the encoded states to cross-attention layers. Both strategies have demonstrated strong performance and efficiency, particularly in tasks involving extensive context. They basically

318

31,109

Wenhu Chen · Apr 15, 2025 · 8:30 PM UTC

Wenhu Chen @WenhuChen

15 Apr 2025

🚀 General-Reasoner: Generalizing LLM Reasoning Across All Domains (Beyond Math) Most recent RL/R1 works focus on math reasoning—but math-only tuning doesn't generalize to general reasoning (e.g. drop on MMLU-Pro and SuperGPQA). Why are we limited to math reasoning? 1. Existing Rule-based verifiers work only for numeric/math answers—can’t verify latex expression, matrices, arrays, and short statement. 2. No high-quality verifiable data outside math. 📢 We're excited to introduce General-Reasoner, a novel framework that expands LLM reasoning to math, physics, chemistry, finance, business, and more! ✨ Key ideas: - A new dataset **WebInstruct-verified** of verifiable reasoning data across many disciplines. - A model-based generative verifier that can verify short answers like latex expression, matrices, arrays, and short statement very accurately. 📈 Big gains across science and math benchmarks: +11–13% on MMLU-Pro (30+ domains) +8–9% on SuperGPQA (285+ domains) +9–11% on GPQA slight gains even on MATH, AMC, AIME vs math-RL models like SImpleRL-Zoo. Now we are releasing the preview version! - Github: github.com/TIGER-AI-Lab/Gene…, with all the pointers to models and verfiier. - Data: huggingface.co/datasets/TIGE… - Tech Report: github.com/TIGER-AI-Lab/Gene…

329

44,855

Wenhu Chen · Jun 5, 2025 · 2:11 AM UTC

Wenhu Chen @WenhuChen

5 Jun 2025

🚨 New Paper Alert 🚨 We found that Supervised Fine-tuning on ONE problem can achieve similar performance gain as RL on ONE problem with 20x less compute! Paper: arxiv.org/abs/2506.03295 Recently, people have shown that RL can work even with ONE example. This indicates that the strong reasoning capabilities were obtained during the pre-training stage, and RL can serve an effective approach to unleash these reasoning potential. ⚠️ However, RL is expensive and unstable. Even RL on one example can consume more than 100 A100 GPUs hours. RL also suffers from various stability problems. 🧠 Is there an easier approach to unleash the (general) reasoning potential from strong pre-trained LLMs? It turns out that simple SFT (as it is) on minimum data doesn't work and causes severe overfitting issue. ✅ But critique fine-tuning (CFT) on ONE problem can work! With 20x less compute (5 GPUs hours), it matches and even surpasses the performance of one-shot-RLVR (arxiv.org/abs/2504.20571). In the paper, we show that CFT on 1 problem can boost the average accuracy of six mathematical benchmarks (MATH-500, AMC, OlympiadBench, etc) by 5-15% across different-sized models. We further test on logic reasoning tasks from BBEH like causal reasoning, disambiguation, etc and show similar performance gain of 15%. This shows the generalization of one-shot CFT beyond math. 🎯 Therefore, we believe CFT works as a more efficient approach to unleash the hidden reasoning capabilities of the pre-trained LLMs! Website: tiger-ai-lab.github.io/One-S… HF collections: huggingface.co/collections/T… Everything is open-sourced.

314

44,398

Wenhu Chen · Feb 4, 2025 · 6:51 PM UTC

Wenhu Chen @WenhuChen

4 Feb 2025

💡 RL/R1 training for Math is taking its turn now. But no RL/R1 for Code Generation. Why? There is very little verifiable training data, almost no reward model. But we are here to ace it! 🚀 Very happy to introduce AceCoder! 1️⃣ We propose a pipeline to automatically create high-quality scalable verifiable code training data in the form of (instruction, [test cases]). You can run the generated program against the test cases to obtain the pass rate, which is our rule-based reward. 2️⃣ We train AceCode-RM (7B & 32B reward models), boosting Llama-3.1 by 10% via Best-of-N sampling. It even lifts Qwen2.5-coder-7B to DeepSeek-V2.5 level on HumanEval, MBPP, BigCodeBench, etc! 3️⃣ RL training with AceCode-RM & rule-based rewards significantly improves Qwen2.5 series models. 4️⃣ R1-style training? ✅ Just 80 steps from Qwen2.5-coder-base → 25% boost on HumanEval-plus & 6% on MBPP-plus. It verifies the possibility to skip SFT training for code models! Our data: huggingface.co/datasets/TIGE… Our reward models: huggingface.co/TIGER-Lab/Ace… Our RL models: huggingface.co/collections/T… Github: github.com/TIGER-AI-Lab/AceC… Temporary Paper Link: github.com/TIGER-AI-Lab/AceC…

305

25,497

Wenhu Chen · May 8, 2025 · 2:47 PM UTC

Wenhu Chen @WenhuChen

8 May 2025

Dear {{full_name}}!

298

28,850

Wenhu Chen · Oct 12, 2023 · 5:41 AM UTC

Wenhu Chen @WenhuChen

12 Oct 2023

Assistant Professor at US university. Final annual compensation: 120K-160K 🫣🫣🫣

This Post is from an account that no longer exists.

282

225,148

Wenhu Chen · Jan 11, 2025 · 6:01 AM UTC

Wenhu Chen @WenhuChen

11 Jan 2025

Really confused by why people like to re-invent new terms to rebrand old concepts. CAG is basically the memory-based transformers back in 2021. Check out some nice ones from my friends: arxiv.org/abs/2006.11527 arxiv.org/abs/2110.06176 arxiv.org/abs/2004.07202 Also my own paper: arxiv.org/abs/2204.04581

Memory Transformer

Transformer-based models have achieved state-of-the-art results in many natural language processing tasks. The self-attention architecture allows transformer to combine information from all...

arxiv.org

Aurimas Griciūnas

@Aurimas_Gr

10 Jan 2025

𝗥𝗔𝗚 (Retrieval Augmented Generation) vs. 𝗖𝗔𝗚 (Cache Augmented Generation). There has been a lot of buzz surrounding CAG lately. Let’s see what the differences are betweenRAG and CAG: 𝘙𝘈𝘎 These are the steps for implementing generation for naive RAG: 𝟭. Embed a user query to be used for contextual search via vector DBs or move straight to the step 2 if no contextual search is required. 𝟮. If Contextual search is required, query the context store to retrieve relevant context. If it is not required, use other means to search for relevant data 𝟯. Combine original user query with the system prompt that instructs the final answer construction. 𝟰. Enrich the final prompt with external context retrieved in step 2. 𝟱. Return the final answer to the user. 𝘊𝘈𝘎 𝟭. Pre-compute all of the external context into a KV Cache of the LLM. Cache it in memory. This only needs to be done once, the following steps can be run multiple times without recomputing the initial cache. 𝟮. Pass the system prompt including user query and the system prompt with instructions on how cached context should be used by the LLM. 𝟯. Return the generated answer to the user. After this, clear any generations from the cache and keep only the initially cached context. This makes the LLM ready for next generations. 𝘔𝘺 𝘵𝘩𝘰𝘶𝘨𝘩𝘵𝘴: ➡️ While it has only been described in a white paper for the first time, it is not a novel concept. We have been using different variations of CAG since Anthropic and OpenAI introduced Prompt Caching. ❌ While it might sound strong on paper, LLMs continue to suffer in accuracy while working with extensively long context. ❌ In real use cases, especially enterprise, CAG would cause a lot of security issues due to inability to isolate data. ❌ CAG does not work with constantly changing data as KV Cache would need to be continuously recomputed. ✅ CAG is strong when you need to cache reasonable amount of static data that is not sensitive. ✅ Real magic happens when you combine RAG and CAG into a single system. More on it in future posts, stay tuned in! Have you played with CAG already? Let me know in the comments 👇 #LLM #AI #MachineLearning Want to learn how to build an Agent from scratch without using any LLM Orchestration framework? Check out my article here: newsletter.swirlai.com/p/bui…

289

41,635

Wenhu Chen · Apr 15, 2025 · 3:18 AM UTC

Wenhu Chen @WenhuChen

15 Apr 2025

🔥 How do you build a state-of-the-art Vision-Language Model with direct RL? We’re excited to introduce VL-Rethinker, a new paradigm for multimodal reasoning trained directly with Reinforcement Learning. 📈 It sets new SOTA on key math+vision benchmarks: - MathVista: 80.3 → 🥇 (+6.4 vs GPT-o1 73.9) - MathVerse: 61.7 → 🥇 (+4.7 vs GPT-o1 57.0) - MathVision: 43.9 → 🥇 (+1.7 vs GPT-o1 42.2) 💡 How did we do it? We adapt the GRPO algorithm and introduce two key innovations: - Selective Sample Replay (SSR): A novel value-based replay strategy that addresses vanishing advantages in long-horizon reasoning by reusing high-quality rollouts across iterations. This significantly stabilizes policy updates in direct RL without relying on supervised warm-starting. - Forced Rethinking: To combat the lack of self-reflection in purely RL-trained models, we introduce a reasoning trigger appended to early rollouts. This explicitly encourages the model to "think again" before finalizing its answer—leading to stronger consistency and higher success rates in multi-step reasoning. Together, these two techniques make VL-Rethinker-72B the first VLM to surpass GPT-o1 significantly. This work opens the door for future slow-thinking multimodal agents that can perform effective self-reflection. Paper: arxiv.org/abs/2504.08837 Code: github.com/TIGER-AI-Lab/VL-R… Website: tiger-ai-lab.github.io/VL-Re…

288

24,790

Wenhu Chen · Dec 2, 2021 · 3:36 PM UTC

Wenhu Chen @WenhuChen

2 Dec 2021

I am looking for 2-3 students for my group in CS department, University of Waterloo. I am specifically interested in 1) making NLP models more grounded on external world knowledge, 2) integrating knowledge of different forms like tables/graph/text/images during machine reasoning

280

Wenhu Chen · Sep 29, 2025 · 3:27 AM UTC

Wenhu Chen @WenhuChen

29 Sep 2025

Can someone from frontier labs show some ablation studies for "internal noble RL vs. GRPO"? I am eager to know how much behind we are!

285

37,749

Wenhu Chen · Nov 10, 2024 · 3:34 PM UTC

Wenhu Chen @WenhuChen

10 Nov 2024

Am I the only one who thinks professors should spend more than 20% of their time coding?

274

47,457

Wenhu Chen · Jul 17, 2025 · 6:12 AM UTC

Wenhu Chen @WenhuChen

17 Jul 2025

Shouldn't that be placed in China or at least Asia given that the majority of attendees with visa issues are from China or other Asian countries.

NeurIPS Conference

@NeurIPSConf

16 Jul 2025

We're excited to announce a second physical location for NeurIPS 2025, in Mexico City. By expanding our physical locations, we hope to address concerns around skyrocketing attendance and difficulties in obtaining travel visas that some attendees have experienced in the past few years when only one location was available. Read more in our blog post: blog.neurips.cc/2025/07/16/n…

280

51,271

Wenhu Chen · Aug 6, 2025 · 2:27 PM UTC

Wenhu Chen @WenhuChen

6 Aug 2025

I have been testing GPT-oss-120b for a while. My initial feeling is that the model hallucinates a lot! It's definitely way worse then gpt-o4-mini. My hunch is that the model is completely distilled from GPT-5 or GPT-o4 with massive synthetic reasoning tokens, which contains too much hallucination.

285

36,216

Wenhu Chen · Oct 18, 2022 · 5:01 PM UTC

Wenhu Chen @WenhuChen

18 Oct 2022

Happy to share our recent paper "Re-Imagen": arxiv.org/abs/2209.14491 The existing text-to-image generation models are not particularly good at generating very specific entities like a specific person, a specific film character, a specfic dog, especially when it's infrequent.

264

Wenhu Chen · Sep 30, 2025 · 10:13 PM UTC

Wenhu Chen @WenhuChen

30 Sep 2025

Tired of RLVR and RLHF? Want to explore new possible RL algorithms? 🔥 Introducing our new RL algorithm: Critique Reinforcement Learning (CRL)! CRL can train a 4B coder model to reach 62% on LiveCodeBench-V5, surpassing the 14B DeepCoder model. 🧠 Critique-RL (CRL) is fundamentally different from RLVR/RLHF: Traditional RL trains models to produce answers. CRL trains models not to produce answers, but to critique a given solution, i.e., think step by step to judge whether it is right or wrong. LLMs are rewarded for generating a “good” critique that leads to the correct final judgment (True/False). 🚀 We adopt CRL to train Critique-Coder on the rStar-Coder dataset with GRPO. Our 4B and 8B both reach the highest performance of their size. Arxiv: arxiv.org/abs/2509.22824 Website: tiger-ai-lab.github.io/Criti… HF Release: huggingface.co/collections/T…

269

21,567

Wenhu Chen · Jun 10, 2022 · 3:39 AM UTC

Wenhu Chen @WenhuChen

10 Jun 2022

I have a feeling that NLP is probably not the most suitable research direction for academia any more. I am eager to know what academia could do better than the big tech companies in terms of "impactful" NLP research.

261

Wenhu Chen · Jun 4, 2024 · 4:29 AM UTC

Wenhu Chen @WenhuChen

4 Jun 2024

Our MMLU-Pro paper is out. It's a more difficult, robust and reasoning-driven benchmark to measure expert-level intelligence. We have gradually included 50+ models in our leaderboard: huggingface.co/spaces/TIGER-…. GPT-4o, Gemini-1.5-Pro, Claude-3-Opus are the current top-3 models. Great work led by @YuboWang726 and @xueguang_ma, and in collaboration with other awesome contributors.

@_akhaliq

4 Jun 2024

MMLU-Pro A More Robust and Challenging Multi-Task Language Understanding Benchmark In the age of large-scale language models, benchmarks like the Massive Multitask Language Understanding (MMLU) have been pivotal in pushing the boundaries of what AI can achieve

257

59,465

Wenhu Chen · Oct 12, 2025 · 4:03 AM UTC

Wenhu Chen @WenhuChen

12 Oct 2025

Our general-reasoner (arxiv.org/abs/2505.14652) came out in March this year and has been accepted by NeurIPS. We are among the first few works to extract QA from pre-training data for RL. No comparison, no citation to our paper at all 😂

Zhepeng Cen @ZhepengCen

10 Oct 2025

🚀 Scaling RL to Pretraining Levels with Webscale-RL RL for LLMs has been bottlenecked by tiny datasets (<10B tokens) vs pretraining (>1T). Our Webscale-RL pipeline converts pretraining text into diverse RL-ready QA data — scaling RL to pretraining levels! All codes and datasets are open-source! Paper: arxiv.org/abs/2510.06499 ✨ Key features: • Converts web-scale corpus into millions of verifiable QA pairs • Preserves pretraining-level diversity across 9 domains • Trains up to 100× more token-efficient than continual pretraining • Powers the Webscale-RL dataset (1.2 M examples) for scalable RL Also special thanks to my colleagues in Salesforce AI Research @SFResearch! @HaolinChen11, Shiyu, @LiuZuxin, @huan__wang, @CaimingXiong, @iscreamnearby

250

46,557

Wenhu Chen · Jun 3, 2024 · 4:03 AM UTC

Wenhu Chen @WenhuChen

3 Jun 2024

Doesn't look good. Llama3-V faces plagiarism charges. It's astonishing to see this happening even in Stanford.

PrimerYang

@yangzhizheng1

2 Jun 2024

Shocked! Llama3-V project from a Stanford team plagiarized a lot from MiniCPM-Llama3-V 2.5! its code is a reformatting of MiniCPM-Llama3-V 2.5, and the model's behavior is highly similar to a noised version of MiniCPM-Llama3-V 2.5 checkpoint. Evidence: github.com/OpenBMB/MiniCPM-V…

254

131,839

Wenhu Chen · Feb 4, 2023 · 3:37 PM UTC

Wenhu Chen @WenhuChen

4 Feb 2023

# RA/Internship success rate I have received lots of emails from different people saying that they want to do (remote) research/internship (not phd/master) with me. They all seem enthusastic, so I tried mentoring a few of them. However, almost none of them work out, maybe 2/20.

248

108,978

Wenhu Chen · Sep 12, 2023 · 2:40 AM UTC

Wenhu Chen @WenhuChen

12 Sep 2023

Excited to introduce our latest math generalist model MAmmoTH 🦣, built through instruction tuning. We proposed hybrid "chain-of-thought" & "program-of-thought" training to supercharge LLMs' math reasoning capabilities. 🦣 beats the open SoTA by 20+% on many datasets like MATH.

245

47,051

Wenhu Chen · Apr 8, 2024 · 6:02 PM UTC

Wenhu Chen @WenhuChen

8 Apr 2024

Now all the slides and recorded videos are uploaded to the course website now: cs.uwaterloo.ca/~wenhuche/te… Kudos to all the great students taking the course!

Wenhu Chen @WenhuChen

23 Feb 2024

251

39,548

Wenhu Chen · Mar 14, 2025 · 4:39 PM UTC

Wenhu Chen @WenhuChen

14 Mar 2025

We have made a huge progress in language model reasoning. But our progress in multimodal reasoning (like MMMU) is very limited. Why? It's due to the lack of diverse, difficult and high-quality multimodal reasoning dataset! 🚀 New Paper Alert! 📢 We introduce VisualWebInstruct, a novel approach to scale up multimodal reasoning datasets from the Internet using Google Image Search! 🔍 How? - We meticulously selected 30K seed images and then leverage search engines (Google Image Search) to locate websites with plenty of multimodal reasoning data, like forums or exam websites. - We perform comprehensive extraction, filtering and LLM-based cleaning and refining to harvest around 900K QA pairs from over 700K unique URLs, with 40% as visual QA pairs. 🔥 Results? - Fine-tuning on Llava-OV-mid: +10-20% absolute gains - Fine-tuning on MAmmoTH-VL: +5% absolute gain - MAmmoTH-VL2 achieves SoTA on: 📊 MMMU-Pro-std: 40.7% 🔢 MathVerse: 42.6% 🧮 DynaMath: 55.7% Our work highlights the power of web-scale multimodal data mining for enhancing VLMs' reasoning abilities! Paper: arxiv.org/abs/2503.10582 Website: tiger-ai-lab.github.io/Visua… Dataset: huggingface.co/datasets/TIGE… MAmmoTH-VL2: huggingface.co/TIGER-Lab/MAm… Github: github.com/TIGER-AI-Lab/Visu…

248

43,384

Wenhu Chen · Oct 24, 2025 · 5:15 AM UTC

Wenhu Chen @WenhuChen

24 Oct 2025

Had some really interesting discoveries recently: If a model performs extremely stable on one benchmark. Let's say a model is always getting 62% on SWEBench no matter what prompts or scaffold you used. It DOES NOT mean that the model is robust. It actually means that the model is CONTAMINATED on Swebench, i.e. directly train on the test set or the paraphrase of the test set. This could possibly become a good metric for detecting contamination. We will provide more empirical results later on.

251

39,119

Wenhu Chen · Jan 6, 2025 · 9:57 PM UTC

Wenhu Chen @WenhuChen

6 Jan 2025

Gemini-2.0 makes a huge leap on our MEGA-Bench leaderboard to beat all the competitors! With the other benchmarks being either overfitted or leaked, I believe MEGA-Bench serves a more reliable indicator to show the multimodal models' true performance to generalize to 505 real-world tasks. Leaderboard Link: huggingface.co/spaces/TIGER-… Congrats to Gemini team @OfficialLoganK @JeffDean

247

28,275

Wenhu Chen · Jun 2, 2023 · 5:24 PM UTC

Wenhu Chen @WenhuChen

2 Jun 2023

Finally, music has reached its BERT moment. In this paper, we propose a self-supervised music understanding model, which achieves SOTA performance on 14 music related tasks. arxiv.org/abs/2306.00107.

Yizhi Li @yizhilll

2 Jun 2023

1/ Excited to announce the release of our new paper "MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training"! We propose a self-supervised music understanding model, attaining overall SOTA performance on 14 MIR tasks. arxiv.org/abs/2306.00107

239

48,793

Wenhu Chen · Jul 30, 2025 · 2:55 AM UTC

Wenhu Chen @WenhuChen

30 Jul 2025

If Zuck wants to increase success rate, he should go for academics who earn 150K/year.

will brown

@willccbb

29 Jul 2025

the people at Thinking Machines who Zuck would want to hire for that much money already have generational wealth from OpenAI/Anthropic stock. they’re scientists who don’t care about yachts and jets. there’s no number you can buy them for. you have to sell them on the mission

240

29,539

Wenhu Chen · Feb 20, 2023 · 2:28 AM UTC

Wenhu Chen @WenhuChen

20 Feb 2023

Yang's blog yang-song.net/blog/2021/scor… is really a gem. It starts from the very basic score-based matching, and then to stochastic differential equations, and finally to diffusion model. It helped a lot in understanding the foundations of these generative models. Highly Recommended!

236

21,988

Wenhu Chen · Nov 10, 2023 · 8:57 PM UTC

Wenhu Chen @WenhuChen

10 Nov 2023

ICLR reviews are out. I found some interesting trends for paper scores. 1. Math and code LLM papers are favored by most reviewers. 2. Video generation papers also received high scores. 3. Agent LLM papers are mixed. 4. Some internet-famous papers are getting pretty low scores.

230

48,033

Wenhu Chen · May 4, 2024 · 2:43 AM UTC

Wenhu Chen @WenhuChen

4 May 2024

New Arxiv Alert! arxiv.org/abs/2405.01483 We propose Mantis: interleaved instruction tuning to enable large-multimodal language models to reason over multiple pieces of images and text. By only training on our high quality 720K instruction data, we can achieve SoTA on five multi-image benchmarks! It's even beating idefics2 by an average 9 points. It's supported by Llama3 backbone and SigLIP encoder! All code, data, and eval are released! Homepage: tiger-ai-lab.github.io/Manti… Instruction Data: huggingface.co/datasets/TIGE… Demo: huggingface.co/spaces/TIGER-… Work led by my awesome student @DongfuJiang, in collaboration with @sivil_taram and other folks!

245

30,125

Wenhu Chen · Jul 25, 2024 · 9:06 PM UTC

Wenhu Chen @WenhuChen

25 Jul 2024

EMNLP reviewer 2: why don't you compare with Llama 3.1? Llama 3.1 also has similar MCTS components for post-training.

232

39,497

Wenhu Chen · Jul 2, 2023 · 4:18 PM UTC

Wenhu Chen @WenhuChen

2 Jul 2023

I didn't expect this tweet will spark so much discussion. I believe our university curriculum should upgrade to cater to the current trend. Undergrads lost interest in linear algebra because they don't know what these knowledge are meant for. Deep learning is actually a very ...

Wenhu Chen @WenhuChen

2 Jul 2023

216

93,227

Wenhu Chen · May 16, 2025 · 7:04 PM UTC

Wenhu Chen @WenhuChen

16 May 2025

Very interesting analysis! However, we found that that you can actually achieve the same performance on the same ONE example with a variant of SFT -> CFT (critique fine-tuning) arxiv.org/abs/2501.17703. It's much much much faster than RL on ONE example! Here is a teaser for our early results. We will release the One-Shot CFT paper in the coming days.

Alex Dimakis

@AlexGDimakis

10 May 2025

"RL with only one training example" and "Test-Time RL" are two recent papers that I found fascinating. In the "One Training example" paper the authors find one question and ask the model to solve it again and again. Every time, the model tries 8 times (the Group in GRPO), and a gradient step is performed, to increase the reward which is a very simple verification of the correct answers, repeated thousands of times on the same problem. The shocking finding is that the model does not overfit to this one question: RL on one example, makes the model better in MATH500 and other benchmarks. (If instead you did SFT repeating one training question-solution finetuning, the model would quickly memorize this answer and overfit). But with RL, the model has to solve the problem itself, since it only sees the question, not the answer. Every time it produces different answers, and this seems to prevent overfitting. The other papers are relying on the same phenomenon: you can have a small number of training questions and re-solve them thousands of times. You can do this for the test set (as test-time RL does) and still not overfit. We also independently saw this by doing RL training on half the test set and seeing benefits in the other half for BFCL agents. My thought now is that this shows our RL learning algorithm must be extremely inefficient. When a human is learning by solving a math puzzle, they immediately learn what they can learn by solving it once (or twice). No further benefit would come by assigning the same homework problem to students a tenth time. But in RL, we keep asking the model to re-solve the same question thousands of times, and the model slowly gets better. We should be able to have much better RL learning algorithms since the information is there. (1/2)

228

32,385

Wenhu Chen · Dec 7, 2024 · 3:11 PM UTC

Wenhu Chen @WenhuChen

7 Dec 2024

It seems that lots of people don't know the backstory of this. The very brief version is that: A large amount (probably more than half) of Chinese-nationality STEM MS/PHD applicants accept offers from Canadian universities but couldn't get their visa on time. Canadian immigration office does background checks for these students, which can take at least six months, most of them take longer than 1-2 years. This prohibits the students from enrolling in the school on time. The background check is quite random. But it happens more often if you have already published some AI papers. So the stronger you are, the more unlikely you will get the visa! The direct consequence is that these students will go elsewhere after knowing they have been background checked. Even worse, good students will just avoid applying to Canada at all. I personally lost 4 very talented PhD students due to this visa delay. Ironically enough, most of them applied in the next cycle to US school and got their US visas immediately.

Wenhu Chen @WenhuChen

7 Dec 2024

221

50,197

Wenhu Chen · Aug 1, 2024 · 12:37 AM UTC

Wenhu Chen @WenhuChen

1 Aug 2024

I love simple yet effective things. However, reviewers never agree with me on that.

218

23,555

Wenhu Chen · Jul 10, 2025 · 6:33 AM UTC

Wenhu Chen @WenhuChen

10 Jul 2025

I realized that many of those "incoming faculty" finally joined industry after a gap year.

222

32,828

Wenhu Chen · Jan 20, 2025 · 7:25 PM UTC

Wenhu Chen @WenhuChen

20 Jan 2025

The Chinese "Open"AI companies are turning the Chinese New Year into a celebration for the entire global AI community. 1. Deepseek-R1: nitter.app/deepseek_ai/status/188… 2. Kimi-k1.5 nitter.app/Kimi_ai_/status/188133… Now the secret of o1 (a lot of people knew it already) is out. No PRM, no MCTS, no complex recipe. Large-scale verifiable data will let the reasoning and self-reflection emerge with any RL algorithms!

Kimi.ai

@Kimi_Moonshot

20 Jan 2025

🚀 Introducing Kimi k1.5 --- an o1-level multi-modal model -Sota short-CoT performance, outperforming GPT-4o and Claude Sonnet 3.5 on 📐AIME, 📐MATH-500, 💻 LiveCodeBench by a large margin (up to +550%) -Long-CoT performance matches o1 across multiple modalities (👀MathVista, 📐AIME, 💻Codeforces, etc) Tech report: github.com/MoonshotAI/Kimi-k… Key ingredients of k1.5 -Long context scaling. Up to 128k tokens for RL generation. Efficient training with partial rollouts. -Improved policy optimization: online mirror descent, sampling strategies, length penalty, and others. -Multi modalities. Joint reasoning over text and vision.

207

19,889

Wenhu Chen · Apr 4, 2023 · 2:36 AM UTC

Wenhu Chen @WenhuChen

4 Apr 2023

Tired of fine-tuning image generation models on each subject you care to generate? Today, we release SuTI, a zero-shot subject-driven text-to-image generator that operates fully in-context without tuning. One SuTI model is all you need! Website: open-vision-language.github.…

213

45,542

Wenhu Chen · Sep 8, 2024 · 3:27 AM UTC

Wenhu Chen @WenhuChen

8 Sep 2024

We updated MMLU-Pro leaderboard with some recent models like Reflection, GPT-4o (0806) and Arx-0.3 (A startup by Thomas Baker).

200

86,441

Wenhu Chen · May 7, 2022 · 2:03 AM UTC

Wenhu Chen @WenhuChen

7 May 2022

Yay, celebrating the equilibrium.

205

Wenhu Chen · Jun 24, 2024 · 4:36 PM UTC

Wenhu Chen @WenhuChen

24 Jun 2024

How to combine Long-context LLM with RAG? We are happy to introduce LongRAG, a new approach to boost RAG with long-context LLMs. 1. Building larger retriever units to 4K-tokens, which is 30x longer than traditional RAG systems like DPR, RAG, FiD, Atlas, etc. 2. Retrieval becomes much easier with larger units! Recall increases significantly from 52% -> 71%. 3. Reader job becomes harder, but we have really good long-context readers like GPT-4o. 4. Without any training, we reach 62.7% on NQ and 64.3% on HotpotQA (full-wiki). This is on par with the SoTA fully-trained RAG models like Atlas and IRRR+. 5. Using larger retrieval unit can make multi-hop questions into single-hop questions. This nullifies the necessity to perform iterative retrieval for HotpotQA. Our approach is very easy to use. No training is needed! No iterative retrieval is needed! Paper: arxiv.org/pdf/2406.15319 Everything is released, all the pointers are listed in tiger-ai-lab.github.io/LongR… Work led by @Ernestzyj and @xueguang_ma from TIGER-Lab.

elvis

@omarsar0

24 Jun 2024

Enhancing RAG with Long-context LLMs Proposes LongRAG, which combines RAG with long-context LLMs to enhance performance. Uses a long retriever to significantly reduce the number of extracted units by operating on longer retrieval units. The long reader takes in the long retrieval units and leverages the zero-shot answer extraction capability of long-context LLMs to improve performance of the overall system. Claims to achieve 64.3% on HotpotQA (full-wiki), which is on par with the state-of-the-art model. Quote from the paper: "The improvement in retriever can significantly benefit the reader model. By exploiting the long-context understanding ability of GPT-4o, LongRAG can achieve an EM of 62% on NQ and 64% on HotpotQA. These results could be comparable to the strongest finetuned RAG models like Atlas and MDR." What's impressive with this work is that they can significantly reduce retrieval units and increase overall recall on various benchmarks using long-context retrieval. Lots of people are quick to dismiss RAG or long-context LLMs but this work shows the opportunity to mix what looks like competing ideas to achieve even better results.

203

38,516

Wenhu Chen · Sep 9, 2023 · 11:47 PM UTC

Wenhu Chen @WenhuChen

9 Sep 2023

Now I realize the real benefits of being in academia. No coding interviews!

This Post is from an account that no longer exists.

190

54,775

Wenhu Chen · Mar 20, 2023 · 3:33 PM UTC

Wenhu Chen @WenhuChen

20 Mar 2023

Genuinely curious: is it a good timing for academic NLP people to switch the gear a little bit and work on more interdisciplinary stuff? If so, what interdisciplinary direction "NLP+X" would you recommend?

197

65,458

Wenhu Chen · Jul 9, 2024 · 1:49 AM UTC

Wenhu Chen @WenhuChen

9 Jul 2024

A sad truth about evaluation is that: If you make a private test set for your benchmark, people just won't adopt it. We have our official MMMU private test set hosted in EvalAI (eval.ai/web/challenges/chall…), but everyone is still reporting validation score. I found it's similar for MathVista, where everyone is just reporting testmini score.

196

83,239

Wenhu Chen · May 12, 2024 · 3:52 PM UTC

Wenhu Chen @WenhuChen

12 May 2024

Big News! Meet our strongest fully open-source 7B-LLM Neo. We release its 4.7T pre-training data Matrix and entire codebase at MAP-Neo! 1. Neo-7B beats the existing fully open-source models like OLMo, Amber significantly across the board. 2. Neo-7B is surpassing Llama-2-7B remarkably to approach Mistral in several dimensions like reasoning, coding and math. 3. The remarkable performance is coming from our unique ways to recall high-quality data from CC. The improved Megatron-LM training framework is also critical to the success. Data processing pipeline and improved Megatron-LM codebase: github.com/multimodal-art-pr… Dataset: huggingface.co/datasets/m-a-… Model: huggingface.co/m-a-p/neo_7b Kudos to all the MAP team members! I take very few credits for this. We will have the Neo-Instruct and paper coming out soon.

Ge Zhang @GeZhang86038849

10 May 2024

I'm extremely excited to announce "the big bomb"!: Neo and Matrix, that we're working on with colleagues and friends from open-source community, M-A-P.ai, wuhan ai, and 01.ai. Neo is the first fully-transparent bilingual large language model, with fully open-sourced pretrain corpus, data processing pipeline, training framework manipulated from Megatron-LM, intermediate ckpts, and relatively smaller ckpts for investigating scaling law. Matrix is a 4.7 trillion tokens directly adoptable pretrain corpus, which has gone through strict heuristic rules-based filtering and deduplication. The computational resource is supported by 01.ai and wuhan.ai. Kudos to my colleagues! @01AI_Yi @MM_Art_Project Neo Model Series: huggingface.co/collections/m… Matrix: huggingface.co/datasets/m-a-… We name the series as Neo and Matrix as a salute to the movie, the MATRIX! Neo has notably better performance on the metrics of reasoning, math, code, and Chinese, as shown the following!

193

44,793

Wenhu Chen · Jul 14, 2024 · 6:57 PM UTC

Wenhu Chen @WenhuChen

14 Jul 2024

Mamba is accepted to COLM 2024! Should I congratulate Albert/Tri or COLM?

193

21,596

Wenhu Chen · Oct 15, 2025 · 3:51 PM UTC

Wenhu Chen @WenhuChen

15 Oct 2025

I overheard that it's really tough for new grads to find jobs in frontier labs. Is that true? Are there any statistics regarding this trend?

195

38,206

Wenhu Chen · Jan 22, 2023 · 3:16 AM UTC

Wenhu Chen @WenhuChen

22 Jan 2023

Happy to share that Re-Imagen is accepted to #ICLR2023. Arxiv: arxiv.org/abs/2209.14491 In Re-Imagen, we are able to generate novel images about specific entities/objects without any tuning within 30 secs. Some generated examples are shown here:

190

27,806

Wenhu Chen · Feb 18, 2022 · 3:11 PM UTC

Wenhu Chen @WenhuChen

18 Feb 2022

An advise for students who are preparing for PhD Application Interview. - You don't need to demonstrate 10 projects you worked on and spend only 2 minutes on each of them by only scratching the surface. - You only need to dive deep with one single project to explain it well.

192

Wenhu Chen · Sep 30, 2024 · 3:02 AM UTC

Wenhu Chen @WenhuChen

30 Sep 2024

I just realized that, in addition to MMLU and MATH, Dan Hendrycks was also the **first author** of ImageNet-R, ImageNet-A and Outlier Exposure. How can someone be so impactful? Much respect!

182

25,223

Wenhu Chen · Oct 19, 2023 · 6:40 AM UTC

Wenhu Chen @WenhuChen

19 Oct 2023

Somehow people doing LLMs start to call everything related to retrieval augmentation as RAG and only cites Facebook RAG2020 paper. It kind of obliterates a lot of efforts done in this field 😞, especially the work done by my colleagues at Google Research.

183

83,523

Wenhu Chen · May 4, 2023 · 1:36 AM UTC

Wenhu Chen @WenhuChen

4 May 2023

#ACL2023NLP Can Large Language Model reason over large-scale knowledge graph (like Freebase) to answer complex multi-hop questions with only a few demonstration? The answer is yes! Our recent paper (arxiv.org/abs/2305.01750) proposes the first in-context KBQA framework.

Few-shot In-context Learning for Knowledge Base Question Answering

Question answering over knowledge bases is considered a difficult problem due to the challenge of generalizing to a wide variety of possible natural language questions. Additionally, the...

arxiv.org

179

22,479

Wenhu Chen · Jun 1, 2025 · 3:42 PM UTC

Wenhu Chen @WenhuChen

1 Jun 2025

We are super excited to announce Verl-Tool, which is a user-friendly framework to support diverse types of agentic training with RL. github.com/TIGER-AI-Lab/verl… Now we have supported Code-Interpreter, Pixel Operations, Browser, and Bash. If you need to support your tool or environment, the process is very easy: ``` Go to the ./verl_tool/agent_workers/reward_manager directory and add your new reward manager. Then, make sure update the verl_tool/trainer/main_ppo.py file to include your new reward manager. ``` With verl-tool, you can easily train Qwen-math-7B model to achieve 40+ on AIME24. We will release a technical report soon to introduce our results across a wide range of agentic tasks.

GitHub - TIGER-AI-Lab/verl-tool: A version of verl to support diverse tool use [TMLR 2026]

A version of verl to support diverse tool use [TMLR 2026] - TIGER-AI-Lab/verl-tool

github.com

Dongfu Jiang

@DongfuJiang

1 Jun 2025

Introducing VerlTool - a unified and easy-to-extend tool agent training framework based on verl. Recently, there's been a growing trend toward training tool agents with reinforcement learning algorithms like GRPO and PPO. Representative works include SearchR1, ToRL, ReTool, and ToolRL. While these achieve impressive performance, their training codes are either not fully open-sourced or too difficult to modify and customize with new tools, creating unexpectedly high engineering costs for the community when exploring new ideas. To address these issues and reduce engineering overhead, we propose verl-tool. Key Features: 1. 🔧 Complete decoupling of actor rollout and environment interaction - We use verl as a submodule to benefit from ongoing verl repo updates. All tool calling is integrated via a unified API, allowing you to easily add new tools by simply adding a Python file and testing independently. 2. 🌍 Tool-as-environment paradigm - Each tool interaction can modify the environment state. We store and reload environment states for each trajectory. For each training, you can launch 3. ⚡ Native RL framework for tool-calling agents - verl-tool natively supports multi-turn interactive loops between agents and their tool environments. 4. 📊 User-friendly evaluation suite - Launch your trained model with OpenAI API alongside the tool server. Simply send questions and get final outputs with all interactions handled internally. We've successfully reproduced ToRL results using our verl-tool framework, demonstrating its correctness and demonstrating comparable performance on mathematical benchmarks. VerlTool is an active ongoing project! We aim to incorporate more tools covering a wide range of use cases and expect they can be trained together in a single framework. Suggestions and contributions are highly welcomed! Check out our GitHub: github.com/TIGER-AI-Lab/verl… More details: 👇 (0/4)

188

20,710

Wenhu Chen · Oct 21, 2025 · 12:56 PM UTC

Wenhu Chen @WenhuChen

21 Oct 2025

Totally agree. We experimented with only-image input for every task. The results are quite good. Checkout our early paper PixelWorld: arxiv.org/abs/2501.19339

PixelWorld: How Far Are We from Perceiving Everything as Pixels?

Recent agentic language models increasingly need to interact with real-world environments that contain tightly intertwined visual and textual information, often through raw camera pixels rather...

arxiv.org

Andrej Karpathy

@karpathy

20 Oct 2025

I quite like the new DeepSeek-OCR paper. It's a good OCR model (maybe a bit worse than dots), and yes data collection etc., but anyway it doesn't matter. The more interesting part for me (esp as a computer vision at heart who is temporarily masquerading as a natural language person) is whether pixels are better inputs to LLMs than text. Whether text tokens are wasteful and just terrible, at the input. Maybe it makes more sense that all inputs to LLMs should only ever be images. Even if you happen to have pure text input, maybe you'd prefer to render it and then feed that in: - more information compression (see paper) => shorter context windows, more efficiency - significantly more general information stream => not just text, but e.g. bold text, colored text, arbitrary images. - input can now be processed with bidirectional attention easily and as default, not autoregressive attention - a lot more powerful. - delete the tokenizer (at the input)!! I already ranted about how much I dislike the tokenizer. Tokenizers are ugly, separate, not end-to-end stage. It "imports" all the ugliness of Unicode, byte encodings, it inherits a lot of historical baggage, security/jailbreak risk (e.g. continuation bytes). It makes two characters that look identical to the eye look as two completely different tokens internally in the network. A smiling emoji looks like a weird token, not an... actual smiling face, pixels and all, and all the transfer learning that brings along. The tokenizer must go. OCR is just one of many useful vision -> text tasks. And text -> text tasks can be made to be vision ->text tasks. Not vice versa. So many the User message is images, but the decoder (the Assistant response) remains text. It's a lot less obvious how to output pixels realistically... or if you'd want to. Now I have to also fight the urge to side quest an image-input-only version of nanochat...

184

40,482

Wenhu Chen · Jan 26, 2025 · 1:57 PM UTC

Wenhu Chen @WenhuChen

26 Jan 2025

Replying to @alexandr_wang

Come on! Your last name is Wang.

177

10,694

Wenhu Chen · Feb 9, 2024 · 7:27 PM UTC

Wenhu Chen @WenhuChen

9 Feb 2024

Wishing everyone a happy Chinese New Year!

178

11,713

Wenhu Chen · Feb 8, 2025 · 3:36 AM UTC

Wenhu Chen @WenhuChen

8 Feb 2025

Many academia labs (including mine) couldn't even afford a single H100 server. There are much better ways to spend money than 500B mostly on GPUs for one company, which already has plenty of them. That money could lead to extraordinary innovation in academia.

Andrew Gordon Wilson

@andrewgwils

7 Feb 2025

Imagine if academia were given 500B for AI research... it would be absolutely revolutionary compared to one company stockpiling GPUs. That's like 100 CMUs. For 0.01% of that money the right lab could profoundly advance the field.

169

38,460

Wenhu Chen · Dec 20, 2024 · 7:14 PM UTC

Wenhu Chen @WenhuChen

20 Dec 2024

The gap between open-sourced models and closed-source models is getting larger and larger. What should academia do to catch up?

173

134,266

Wenhu Chen · Jun 2, 2021 · 12:24 AM UTC

Wenhu Chen @WenhuChen

2 Jun 2021

Finally defended my thesis and became Dr. Chen! I want to express my deepest gratitude to my committee members, my family, my friends who have supported me throughout my Ph.D. journey.

171

Wenhu Chen · Oct 14, 2022 · 1:28 AM UTC

Wenhu Chen @WenhuChen

14 Oct 2022

New Preprint: arxiv.org/abs/2210.06710 Large Language Models (GPT-3) are 1-shot table reasoners. Though not specifically trained or optimized for table understanding, we found that the large language models are quite competent at complex table reasoning. With only 1 demonstration

171

Wenhu Chen · May 8, 2024 · 2:47 AM UTC

Wenhu Chen @WenhuChen

8 May 2024

Announcing MAmmoTH2: tiger-ai-lab.github.io/MAmmo… Let's scale up instruction tuning! We believe that the web corpus contains massive naturally existing high-quality instruction tuning data to enhance LLM reasoning. We proposes a pipeline to discover them. We manage to harvest 10M instruction data (named WebInstruct), which has the exact same size as Llama3's instruction data! So we train from Llama-3-base to do an apple-to-apple comparison. We are able to outperform Llama-3-Instruct on all the reasoning benchmarks. Also, we can match it on the general chatbot benchmark MT-bench. We think this result is quite encouraging to demonstrate the quality of our web-mined instruction data! Our best model is based on Mixtral-8x7B. We build a demo in huggingface.co/spaces/TIGER-…. All of our models are released under huggingface.co/TIGER-Lab. Llama3-70B version is on the way! Hopefully, we can beat the official Llama3-70B-instruct again!

Aran Komatsuzaki

@arankomatsuzaki

7 May 2024

MAmmoTH2: Scaling Instructions from the Web - Proposes a paradigm to efficiently harvest 10M instruction data from web corpus to enhance LLM reasoning - 11% -> 34% on MATH and 36% -> 67% on GSM8K proj: tiger-ai-lab.github.io/MAmmo… abs: arxiv.org/abs/2405.03548

168

72,419

Wenhu Chen · Sep 9, 2025 · 10:07 PM UTC

Wenhu Chen @WenhuChen

9 Sep 2025

Super thrilled to WebExplorer, which is a simple yet effective approach to train long-horizon web agents. Instead of depending heavily on rigid pre-defined graph structures, WebExplorer utilizes the model-based exploration strategy to synthesize high-quality agentic data. Our 8B model is able to outperform most 32B or even 72B models on BrowseComp and HLE. Check out our paper at arxiv.org/abs/2509.06501.

@_akhaliq

9 Sep 2025

WebExplorer Explore and Evolve for Training Long-Horizon Web Agents

170

32,163

Wenhu Chen · Dec 5, 2024 · 12:58 AM UTC

Wenhu Chen @WenhuChen

5 Dec 2024

I'm super excited to share our recent work OmniEdit, an omnipotent editing model to handle all different types of editing requests including addition, removal, swapping, environment, background, style, etc. The best part is the **highest-quality** 1.2M high-resolution image editing dataset in huggingface.co/datasets/TIGE…. The biggest blocker in image editing is the lack of high-quality editing pairs. Most existing released datasets are highly noisy, low-resolution, with strong artifacts. This basically prohibits the progress in this area. We spent **8 months** to experiment with many approaches to synthesize and filter clean image editing pairs. Eventually, we built seven specialized pipelines to propose massive amount of candidates and then prompt GPT-4o to assign quality scores these candidates. We took the highest-ranked candidates as our 1.2M training data.

172

19,272

Wenhu Chen · Dec 7, 2023 · 5:42 PM UTC

Wenhu Chen @WenhuChen

7 Dec 2023

Looking for the best open-source (small) Math model? I'm happy to release MAmmoTH-7B-Mistral (huggingface.co/TIGER-Lab/MAm…), which achieves 40% on MATH and 52% on MMLU-Math. Nothing fancy, I just fine-tuned Mistral-7B on our previous MathInstruct dataset (huggingface.co/datasets/TIGE…).

169

29,150

Wenhu Chen · Jun 25, 2025 · 3:16 PM UTC

Wenhu Chen @WenhuChen

25 Jun 2025

How many people got their ICCV paper rejected due to co-authors being identified as irresponsible reviewers. This is indeed a harsh policy for the (responsible) first author, who has no control over the behavior of their co-authors.

172

45,271

Wenhu Chen · Aug 24, 2024 · 12:09 AM UTC

Wenhu Chen @WenhuChen

24 Aug 2024

It seems that a lot of people don't see LLM as a part of NLP. They see it as a totally standalone interdisciplinary research area.

158

30,974

Wenhu Chen · Feb 16, 2024 · 1:54 AM UTC

Wenhu Chen @WenhuChen

16 Feb 2024

Replying to @jbhuang0604

R3: this paper doesn't release its code and data. It has no contribution to the community. Strong reject！

163

12,284

Wenhu Chen · Dec 15, 2024 · 11:23 PM UTC

Wenhu Chen @WenhuChen

15 Dec 2024

NeurIPS has been incredibly well-organized this year. It’s truly amazing to see so many brilliant minds working together to push the AI boundaries. While it’s disheartening to witness instances of racism, I’m deeply encouraged by the solidarity shown by many non-Chinese colleagues who are speaking up for fairness and inclusivity on social media. I deeply believe inclusiveness is the core of our research community!

167

14,608

Wenhu Chen · Jun 1, 2024 · 2:02 AM UTC

Wenhu Chen @WenhuChen

1 Jun 2024

Thrilled to work with @JiachenLi11 to release T2V-Turbo, which is a very fast yet high-quality consistency model. With only 4 diffusion steps (5 seconds), it can obtain high-quality video. T2V-Turbo currently ranks the first on VBench (huggingface.co/spaces/Vchite…), beating other competitors like Pika and Runway Gen-2. We created a demo at: huggingface.co/spaces/TIGER-… T2V-turbo Website: t2v-turbo.github.io/.

@_akhaliq

30 May 2024

T2V-Turbo Breaking the Quality Bottleneck of Video Consistency Model with Mixed Reward Feedback Diffusion-based text-to-video (T2V) models have achieved significant success but continue to be hampered by the slow sampling speed of their iterative sampling processes. To

158

60,753