MSL@Meta. I led PoT, MMMU, MMLU-Pro, MAmmoTH, General-Reasoner, VL-Rethinker, Pixel-Reasoner. I contributed to Gemini-2.5. Prev @GoogleDeepMind.

United States
If your model is weak, your paper might end up getting more citations because people are always happy to include your model as a baseline.
46
144
2,185
184,578
Lots of my Chinese friends came to US due to the fear of 996 culture in Chinese tech companies. Now it seems that US companies (especially the bay area) are gradually adopting the 996 culture.
59
39
1,378
211,087
Another Microsoft paper revealing the size of GPT-4, GPT-o1 and Claude Sonnet. I'm not sure how trustworthy these numbers are, but they do make a lot of sense to me. Source: arxiv.org/pdf/2412.19260
36
131
1,125
207,840
I honestly think the undergrads should spend more time on understanding the foundamentals of math and science instead of chasing all these AI hypes. Knowing how Vicuna differs from Alpaca probably won't matter much in a year, but knowing how to do SVD will always help you ...
Mathematics is the art of giving the same name to different things(Henri Poincaré). Machine learning is the art of giving different names to the same thing.
37
130
1,068
364,896
I spent the weekend reading some recent great math+reasoning papers: 1. AceMath (arxiv.org/abs/2412.15084) 2. rStar-Math (arxiv.org/pdf/2501.04519) 3. PRIME (arxiv.org/abs/2412.01981) Here are some of my naive thoughts! It could be wrong. All of these papers are showing possible ways to reach o1. The secret source is pretty much the same thing: **high-quality/difficult prompt with verifiable answer** 1. AceMath takes a simple approach (rejection-fine-tuning -> RFT) to scale up all the SFT dataset to massive size based on the verifiable answer matching. No RM is necessary, but you can still use outcome RM to help boost the performance. 2. rStarMath uses self-evolving SFT approach to gradually boost the data quality and process preference model (PPM) performance. rStarMath is still an RFT, where the samples are coming from MCTS guided with PPM. Still, it requires strong supervision from the verifiable reward in the end. rStarMath also scales up inference compute by utilizing the PPM at each step. 3. PRIME takes a very different angle! PRIME actually uses PPO to train the model, but the major contribution is on how to assign the outcome's reward to each intermediate steps. It also relies heavily on using the verifiable answer to obtain the "correct" on-policy model outputs. The results are quite interesting. It seems that all these approaches are reaching similar results. Eurus-2 might seem weaker due to its smaller training set size. These results are all somewhat on par with o1-mini already/ Given some leakage that o1-mini is ~20B, it basically says that on there is no gap with o1 at least on math problems now. However, o1-mini might win significantly in other broader reasoning tasks, like physics, puzzles, etc. These results might reveal that reaching o1 is more of a data or infra problem than an algorithm problem. As we find great ways to scale up the (good and difficult prompt, verifiable answer) pairs from different domains, the actual algorithm might not influence too much. Some algorithms are more data efficient than the others, but many of them will take us to o1 or even o3.
18
156
917
81,667
The Internet latency is no joke. It took three years to open an Arxiv link.
26
23
864
93,888
My course "Recent Advances on Foundation Models" at Waterloo is public. Check out cs.uwaterloo.ca/~wenhuche/te…. In the course, we cover lots of interesting topics including transformers, LLM, pre-training, quantization, sparse attention, instruction tuning, RLHF, prompting, Vision transformers, diffusion models, multimodal models, agents, RAG, etc. I will continue to upload the slides (ppt) to the website. Some of them will also have recorded videos soon. There are already 12 lecture slides available now. These slides are made by the awesome attendees of the course.
10
157
730
104,045
Canada is losing its leadership in AI because your immigration office blocked almost all the talented grad students from China and other places. Cohere would have been much stronger if they could hire these talented grad students.
Canada is a leader in AI because of companies like @Cohere. We are working with Cohere to build a cutting-edge AI data centre here at home — essential infrastructure for powering AI.
24
37
672
175,356
Everyone is talking about RL these days. But are we done with SFT? The answer is NO. If we revive SFT in another form, it can even beat RL! Very happy to introduce Critique Fine-Tuning, a new form of SFT, which can more efficiently activate language models' reasoning capabilities. The basic idea is simple: training the base language models to critique given noisy responses instead of imitating the correct responses. This is inspired by the human learning process, which emphasizes deep analysis and critical thinking. By training Qwen2.5-Math-base with CFT for 8 GPU hours (1 hour on 8xH100) on 50K examples, we can improve it by 22 absolute points on six math reasoning datasets! It significantly outperforms standard SFT. It can match or outperform Qwen2.5-Math-Instruct trained with 2M+ examples. We tried CFT on other backbones like Qwen2.5-base, the gain of CFT over SFT is even larger. What's more, CFT training can even match SimpleRL (github.com/hkust-nlp/simpleR…), which is the open replication of R1 training. Note that CFT training only requires 8 GPU hours while SimpleRL requires ~1152 GPU hours. We believe there is a lot variants of SFT we should explore. CFT is only be one of them. Paper: arxiv.org/abs/2501.17703 Code: github.com/TIGER-AI-Lab/Crit… Website & Models & Data Link: tiger-ai-lab.github.io/Criti…
23
94
692
73,982
I have very mixed feelings when reading these recent search + rl papers. I understand the LLM folks are too young to read the 2020 RAG papers. But claims like "RAG lacks the flexibility for multi-turn, multi-query retrieval" and "We are the SoTA by achieving NQ = 41%, HotpotQA=38%" really baffle me. 1. Iterative RAG has been studied for quite a long time. There are plenty of great work done in this space. 2. The SoTA for NQ and HotpotQA are both around 70%! There are some good curated RAG paper list in github.com/coree/awesome-rag and github.com/hymie122/RAG-Surv….
9
62
667
133,380
Tired of MMLU? The current models already hit the ceiling? It's time to upgrade MMLU! Introducing our new benchmark MMLU-Pro, a more robust and challenging massive multi-task language understanding benchmark with 12K questions. What's New? 1. MMLU-Pro uses 10 options instead of 4 options. So there is less room for random guessing. 2. MMLU-Pro significantly increases the complexity level by adding more college-level problems across different disciplines. 3. MMLU-Pro is also more robust and less sensitive to different prompts. We show our preview evaluation results in: huggingface.co/datasets/TIGE… We found that GPT-4o (71%) actually improves GPT-4-turbo (62%) by 9%! On the original MMLU, the improvement is only around 2%.
43
124
654
173,378
Working on LLMs is really stressing. I have seen my research ideas being scooped almost on a weekly basis 🥲.
34
28
609
104,664
Yay! Something to celebrate.
33
4
613
54,906
For people without a green card, being laid off is likely one of the most devastating experiences. I can certainly feel the pain.
18
14
567
49,782
Just updated my profile.
14
13
553
51,176
This paper provides some really interesting insights: 1. Previously, people found that Qwen base models are particularly good at R1 training to show strong exploration skills. - This paper shows that there is no magic about Qwen base models. It's likely pre-trained with concatenated Q+A data. Therefore, the base models will automatically answer questions instead of completing it. Therefore, pre-training LLama-3.2 a bit on similar concatenated Q+A data can also trigger it to explore and achieve better performance. 2. Previously, there is a common belief that "Aha Moment" is a result from RL training. - This paper shows that some base models already exhibit amount of self-reflection. RL is simply enhancing this behavior. 3. Previously, the increased output length is believed to be the key for performance improvement. - This paper argues that it's not the case. The responses with self-reflection get lower accuracy than the ones without self-reflection. 4. Previously, people are obsessed with length increase with GRPO algorithm. - This paper argues that this phenomenon is simply due to the length bias in GRPO. Basically, by dividing the advantage with the total length, the wrong responses are penalized less than shorter responses in GRPO. By removing the length normalization term, the length won't increase dramatically while the performance even increases slightly. In a nutshell, this paper provides a critical perspective about the "obsession of long coT".
🪂Understanding R1-Zero-Like Training: A Critical Perspective * DeepSeek-V3-Base already exhibits "Aha moment" before RL-tuning?? * The ever-increasing output length in RL-tuning might be due to a BIAS in GRPO?? * Getting GRPO Done Right, we achieve a 7B AIME sota! 🧵 📜Full details: github.com/sail-sg/understan… 🛠️Code: github.com/sail-sg/understan…
10
84
568
70,354
As everyone is celebrating their ICML acceptance, I am here to celebrate the birth of my first baby. Highest respect to all the moms.
25
4
510
Improving LLMs' math reasoning ability is actually not as easy as what people anticipated. There are some interesting observations: 1. Using textbooks does not improve models' reasoning a lot. We tried to use the 800 high-quality math textbooks from Mathpile to continue training
44
55
492
110,715
New Arxiv: arxiv.org/abs/2305.12524 GPT-4/PaLM-2 have both shown almost perfect performance on existing grade school math dataset. What about more challenging STEM questions, especially the ones which require specific theorems, like Stoke's theorem, Wiener Process, etc?
13
111
475
152,713
As a researcher, it's easy to get distracted by what others are working on. I've seen many people conducting research on problems they don't genuinely care about—just because the community values them (e.g., solving Math Olympiad problems). It's important to focus on research that truly matters to you and aligns with what you genuinely believe in.
18
42
434
40,105
Thrilled to announce that TIGER-Lab has 5 papers accepted to NeurIPS main track, with THREE spotlights! Congrats to all the lead student authors and collaborators! A (late) birthday gift to myself!
13
15
439
40,806
Personal news: I'm thrilled to be joining the Waterloo @UWCheritonCS and Vector @VectorInst as an Assistant Professor in 2022 Fall! Before that, I will work in @GoogleAI as a researcher in the gap year. Ping me if you are interested in working on NLP, Deep Learning.
34
22
401
The GPU crisis in academia is becoming more serious than ever. At NeurIPS, the most popular conversation between different faculty is probably "how did you get your GPUs?". Shameless plug: if you have free gpu (A100/H100) cycles and you need (good) papers, please contact me 😃
17
30
380
79,592
Ever wonder what's really happening when we use RL to teach LLMs to reason? 🤔 The process is full of mysteries. 🤯 What causes those sudden "aha moments" in training? 📏 Why does better reasoning often lead to longer answers ("length-scaling")? 📉 Why does token entropy often drop, even as the model gets smarter? These aren't random quirks. Our new paper reveals they're all signs of a single, coherent process: RL forges an emergent, human-like reasoning hierarchy in LLMs. 🧠 🌐 Project Page: tiger-ai-lab.github.io/Hiera… 📝 Paper: arxiv.org/abs/2509.03646
Emergent Hierarchical Reasoning in LLMs The paper argues that RL improves LLM reasoning via an emergent two-phase hierarchy. First, the model firms up low-level execution, then progress hinges on exploring high-level planning. More on this interesting analysis:
9
65
398
40,200
What the heck! This is the BEST paper I have seen in 2024 so far. Highly recommend it.
🚀 DeepSeekMath: Approaching Mathematical Reasoning Capability of GPT-4 with a 7B Model. Highlights: - Continue pre-training DeepSeek-Coder-Base-v1.5 7B with 120B math tokens from Common Crawl. - Introduce GRPO, a variant of PPO, that enhances mathematical reasoning and reduces training resources. More Details:arxiv.org/abs/2402.03300 Model Download:huggingface.co/deepseek-ai GitHub Repo:github.com/deepseek-ai/DeepS… #DeepSeek #DeepSeekMath
3
53
385
58,839
🚀 New Paper: Pixel Reasoner 🧠🖼️ How can Vision-Language Models (VLMs) perform chain-of-thought reasoning within the image itself? We introduce Pixel Reasoner, the first open-source framework that enables VLMs to “think in pixel space” through curiosity-driven reinforcement learning. Current VLMs reason only in text — even when grounded in rich images or videos, their logical steps are verbalized in natural language. This restricts their ability to interrogate visual evidence and demonstrate how conclusions are drawn. 🔍 So we ask: What if we could make VLMs "show their work" by reasoning directly in the pixel space? Inspired by GPT-o3’s "think-in-image" ability, we propose a framework where VLMs use interactive visual operations — zoom, select-frame, highlight — to reason through complex visual inputs. To do this, we design a two-stage training process: Instruction tuning with synthesized visual reasoning traces. Reinforcement learning with curiosity-driven reward to balance exploration between pixel and text reasoning ✨ With this, Pixel Reasoner achieves near-SoTA performance on many information-rich multimodal benchmarks: 📊 84% on InfographicsVQA 🧠 84% on V* benchmark 🧩 74% on TallyQA-Complex It also achieves strong accuracy of 68% on MVBench (a video benchmark). Website: tiger-ai-lab.github.io/Pixel… Paper: arxiv.org/abs/2505.15966 Code: github.com/TIGER-AI-Lab/Pixe… Demo: huggingface.co/spaces/TIGER-… (coming soon)
10
64
391
82,845
arxiv.org/pdf/2504.07086 is quite interesting. It standardizes the evaluation of all the existing math reasoning models and re-evaluate these models. Takeaway 1: Most RL-trained variants of the DeepSeek R1-Distill model do not yield meaningful performance improvements (except DeepscaleR), suggesting that a reliable and scalable RL training recipes are still lacking. Takeaway 2 While RL-trained methods can often substantially improve base model performance, instruction tuning remains superior (except Open Reasoner Zero), suggesting again that a reliable and scalable RL training recipes are still lacking. They propose to maintain a third-party evaluation of math reasoning models at bethgelab.github.io/sober-re…. This effort is really applaudable.
9
67
372
82,662
Crazy bump of o1-review on MMLU-Pro math subtask! It brings the previous highest score from 79% to 91%. I am still waiting the other tasks as my api quota for o1 is pretty low. This result also confirms the annotation quality of our MMLU-Pro dataset😃
13
36
359
134,146
I spent some time evaluating the frontier math models on AIME24 and AIME25 to see how they "Generalize". An interesting trend I found is that SFT on minimum data can also generalize quite well if you pick the right data. See LIMO-32B. Training with RL does not necessarily lead to better generalization than distillation. See the last two row.
19
56
352
74,789
What’s preventing us from training open-source image editing models like Nano-Banana or Seedream? The main barrier is the lack of high-quality training data for image editing. Most existing image editing datasets are synthesized using weak reward models or poor quality filters—for example, by prompting GPT-4o or other VLMs. To address this problem, we built the most powerful image editing reward model available. We first curated a large-scale image editing preference dataset and then trained EditReward on top of it. Our best EditReward model, trained from Mimo-7B, achieves the highest agreement with human experts. We also applied EditReward to filter existing noisy datasets and demonstrated significant improvements. Paper: arxiv.org/abs/2509.26346 Website: tiger-ai-lab.github.io/EditR… Code: github.com/TIGER-AI-Lab/Edit… HF: huggingface.co/collections/T…
Why do open-source image editing models lag behind closed-source giants like GPT-Image-1, Seedream, & Google-Nano-Banana? 🤔 It’s mainly due to the quality of the training reward signal. We’re bridging the gap. Meet EditReward! 🏆
5
51
362
36,419
Thrilled to introduce UniIR, the first unified retriever to handle all types of information seeking needs: 1. text -> text 2. text -> image 3. text -> image + text 4. image -> image 5. image -> text 6. image + text -> text 6. image + text -> image 7. image + text -> image + text.
🚀 Introduce UniIR, a unified instruction-guided multimodal retriever handles diverse tasks. - 1️⃣model for 8️⃣ retrieval tasks (SoTA w/ Instruction-tuning) - Generalizes to unseen retrieval tasks. - M-BEIR: multimodal retrieval benchmark w/ 10 datasets, 1.1M queries, 5.6M cands.
4
58
351
64,154
I've seen impressive recent results from hybrid Mamba-Transformer architectures, which show significant progress compared to earlier efforts. These hybrid models excel at handling long-context inputs and enable higher throughput. Generally, there are two effective approaches to integrating these architectures: 1. Layer-wise Mixing: Alternating Transformer and Mamba layers within the architecture. 2. Sequence-wise Mixing: Using Mamba for encoding long input sequence part and feed the encoded states to cross-attention layers. Both strategies have demonstrated strong performance and efficiency, particularly in tasks involving extensive context. They basically
5
85
318
31,109
🚀 General-Reasoner: Generalizing LLM Reasoning Across All Domains (Beyond Math) Most recent RL/R1 works focus on math reasoning—but math-only tuning doesn't generalize to general reasoning (e.g. drop on MMLU-Pro and SuperGPQA). Why are we limited to math reasoning? 1. Existing Rule-based verifiers work only for numeric/math answers—can’t verify latex expression, matrices, arrays, and short statement. 2. No high-quality verifiable data outside math. 📢 We're excited to introduce General-Reasoner, a novel framework that expands LLM reasoning to math, physics, chemistry, finance, business, and more! ✨ Key ideas: - A new dataset **WebInstruct-verified** of verifiable reasoning data across many disciplines. - A model-based generative verifier that can verify short answers like latex expression, matrices, arrays, and short statement very accurately. 📈 Big gains across science and math benchmarks: +11–13% on MMLU-Pro (30+ domains) +8–9% on SuperGPQA (285+ domains) +9–11% on GPQA slight gains even on MATH, AMC, AIME vs math-RL models like SImpleRL-Zoo. Now we are releasing the preview version! - Github: github.com/TIGER-AI-Lab/Gene…, with all the pointers to models and verfiier. - Data: huggingface.co/datasets/TIGE… - Tech Report: github.com/TIGER-AI-Lab/Gene…
8
77
329
44,855
🚨 New Paper Alert 🚨 We found that Supervised Fine-tuning on ONE problem can achieve similar performance gain as RL on ONE problem with 20x less compute! Paper: arxiv.org/abs/2506.03295 Recently, people have shown that RL can work even with ONE example. This indicates that the strong reasoning capabilities were obtained during the pre-training stage, and RL can serve an effective approach to unleash these reasoning potential. ⚠️ However, RL is expensive and unstable. Even RL on one example can consume more than 100 A100 GPUs hours. RL also suffers from various stability problems. 🧠 Is there an easier approach to unleash the (general) reasoning potential from strong pre-trained LLMs? It turns out that simple SFT (as it is) on minimum data doesn't work and causes severe overfitting issue. ✅ But critique fine-tuning (CFT) on ONE problem can work! With 20x less compute (5 GPUs hours), it matches and even surpasses the performance of one-shot-RLVR (arxiv.org/abs/2504.20571). In the paper, we show that CFT on 1 problem can boost the average accuracy of six mathematical benchmarks (MATH-500, AMC, OlympiadBench, etc) by 5-15% across different-sized models. We further test on logic reasoning tasks from BBEH like causal reasoning, disambiguation, etc and show similar performance gain of 15%. This shows the generalization of one-shot CFT beyond math. 🎯 Therefore, we believe CFT works as a more efficient approach to unleash the hidden reasoning capabilities of the pre-trained LLMs! Website: tiger-ai-lab.github.io/One-S… HF collections: huggingface.co/collections/T… Everything is open-sourced.
11
60
314
44,398
💡 RL/R1 training for Math is taking its turn now. But no RL/R1 for Code Generation. Why? There is very little verifiable training data, almost no reward model. But we are here to ace it! 🚀 Very happy to introduce AceCoder! 1️⃣ We propose a pipeline to automatically create high-quality scalable verifiable code training data in the form of (instruction, [test cases]). You can run the generated program against the test cases to obtain the pass rate, which is our rule-based reward. 2️⃣ We train AceCode-RM (7B & 32B reward models), boosting Llama-3.1 by 10% via Best-of-N sampling. It even lifts Qwen2.5-coder-7B to DeepSeek-V2.5 level on HumanEval, MBPP, BigCodeBench, etc! 3️⃣ RL training with AceCode-RM & rule-based rewards significantly improves Qwen2.5 series models. 4️⃣ R1-style training? ✅ Just 80 steps from Qwen2.5-coder-base → 25% boost on HumanEval-plus & 6% on MBPP-plus. It verifies the possibility to skip SFT training for code models! Our data: huggingface.co/datasets/TIGE… Our reward models: huggingface.co/TIGER-Lab/Ace… Our RL models: huggingface.co/collections/T… Github: github.com/TIGER-AI-Lab/AceC… Temporary Paper Link: github.com/TIGER-AI-Lab/AceC…
8
51
305
25,497
Dear {{full_name}}!
20
3
298
28,850
Assistant Professor at US university. Final annual compensation: 120K-160K 🫣🫣🫣
17
19
282
225,148
Really confused by why people like to re-invent new terms to rebrand old concepts. CAG is basically the memory-based transformers back in 2021. Check out some nice ones from my friends: arxiv.org/abs/2006.11527 arxiv.org/abs/2110.06176 arxiv.org/abs/2004.07202 Also my own paper: arxiv.org/abs/2204.04581
𝗥𝗔𝗚 (Retrieval Augmented Generation) vs. 𝗖𝗔𝗚 (Cache Augmented Generation). There has been a lot of buzz surrounding CAG lately. Let’s see what the differences are betweenRAG and CAG: 𝘙𝘈𝘎 These are the steps for implementing generation for naive RAG: 𝟭. Embed a user query to be used for contextual search via vector DBs or move straight to the step 2 if no contextual search is required. 𝟮. If Contextual search is required, query the context store to retrieve relevant context. If it is not required, use other means to search for relevant data 𝟯. Combine original user query with the system prompt that instructs the final answer construction. 𝟰. Enrich the final prompt with external context retrieved in step 2. 𝟱. Return the final answer to the user. 𝘊𝘈𝘎 𝟭. Pre-compute all of the external context into a KV Cache of the LLM. Cache it in memory. This only needs to be done once, the following steps can be run multiple times without recomputing the initial cache. 𝟮. Pass the system prompt including user query and the system prompt with instructions on how cached context should be used by the LLM. 𝟯. Return the generated answer to the user. After this, clear any generations from the cache and keep only the initially cached context. This makes the LLM ready for next generations. 𝘔𝘺 𝘵𝘩𝘰𝘶𝘨𝘩𝘵𝘴: ➡️ While it has only been described in a white paper for the first time, it is not a novel concept. We have been using different variations of CAG since Anthropic and OpenAI introduced Prompt Caching. ❌ While it might sound strong on paper, LLMs continue to suffer in accuracy while working with extensively long context. ❌ In real use cases, especially enterprise, CAG would cause a lot of security issues due to inability to isolate data. ❌ CAG does not work with constantly changing data as KV Cache would need to be continuously recomputed. ✅ CAG is strong when you need to cache reasonable amount of static data that is not sensitive. ✅ Real magic happens when you combine RAG and CAG into a single system. More on it in future posts, stay tuned in! Have you played with CAG already? Let me know in the comments 👇 #LLM #AI #MachineLearning Want to learn how to build an Agent from scratch without using any LLM Orchestration framework? Check out my article here: newsletter.swirlai.com/p/bui…
4
38
289
41,635
🔥 How do you build a state-of-the-art Vision-Language Model with direct RL? We’re excited to introduce VL-Rethinker, a new paradigm for multimodal reasoning trained directly with Reinforcement Learning. 📈 It sets new SOTA on key math+vision benchmarks: - MathVista: 80.3 → 🥇 (+6.4 vs GPT-o1 73.9) - MathVerse: 61.7 → 🥇 (+4.7 vs GPT-o1 57.0) - MathVision: 43.9 → 🥇 (+1.7 vs GPT-o1 42.2) 💡 How did we do it? We adapt the GRPO algorithm and introduce two key innovations: - Selective Sample Replay (SSR): A novel value-based replay strategy that addresses vanishing advantages in long-horizon reasoning by reusing high-quality rollouts across iterations. This significantly stabilizes policy updates in direct RL without relying on supervised warm-starting. - Forced Rethinking: To combat the lack of self-reflection in purely RL-trained models, we introduce a reasoning trigger appended to early rollouts. This explicitly encourages the model to "think again" before finalizing its answer—leading to stronger consistency and higher success rates in multi-step reasoning. Together, these two techniques make VL-Rethinker-72B the first VLM to surpass GPT-o1 significantly. This work opens the door for future slow-thinking multimodal agents that can perform effective self-reflection. Paper: arxiv.org/abs/2504.08837 Code: github.com/TIGER-AI-Lab/VL-R… Website: tiger-ai-lab.github.io/VL-Re…
9
62
288
24,790
I am looking for 2-3 students for my group in CS department, University of Waterloo. I am specifically interested in 1) making NLP models more grounded on external world knowledge, 2) integrating knowledge of different forms like tables/graph/text/images during machine reasoning
13
69
280
Can someone from frontier labs show some ablation studies for "internal noble RL vs. GRPO"? I am eager to know how much behind we are!
12
6
285
37,749
Am I the only one who thinks professors should spend more than 20% of their time coding?
25
4
274
47,457
Shouldn't that be placed in China or at least Asia given that the majority of attendees with visa issues are from China or other Asian countries.
We're excited to announce a second physical location for NeurIPS 2025, in Mexico City. By expanding our physical locations, we hope to address concerns around skyrocketing attendance and difficulties in obtaining travel visas that some attendees have experienced in the past few years when only one location was available. Read more in our blog post: blog.neurips.cc/2025/07/16/n…
14
8
280
51,271
I have been testing GPT-oss-120b for a while. My initial feeling is that the model hallucinates a lot! It's definitely way worse then gpt-o4-mini. My hunch is that the model is completely distilled from GPT-5 or GPT-o4 with massive synthetic reasoning tokens, which contains too much hallucination.
16
21
285
36,216
Happy to share our recent paper "Re-Imagen": arxiv.org/abs/2209.14491 The existing text-to-image generation models are not particularly good at generating very specific entities like a specific person, a specific film character, a specfic dog, especially when it's infrequent.
4
51
264
Tired of RLVR and RLHF? Want to explore new possible RL algorithms? 🔥 Introducing our new RL algorithm: Critique Reinforcement Learning (CRL)! CRL can train a 4B coder model to reach 62% on LiveCodeBench-V5, surpassing the 14B DeepCoder model. 🧠 Critique-RL (CRL) is fundamentally different from RLVR/RLHF: Traditional RL trains models to produce answers. CRL trains models not to produce answers, but to critique a given solution, i.e., think step by step to judge whether it is right or wrong. LLMs are rewarded for generating a “good” critique that leads to the correct final judgment (True/False). 🚀 We adopt CRL to train Critique-Coder on the rStar-Coder dataset with GRPO. Our 4B and 8B both reach the highest performance of their size. Arxiv: arxiv.org/abs/2509.22824 Website: tiger-ai-lab.github.io/Criti… HF Release: huggingface.co/collections/T…
6
52
269
21,567
I have a feeling that NLP is probably not the most suitable research direction for academia any more. I am eager to know what academia could do better than the big tech companies in terms of "impactful" NLP research.
37
22
261
Our MMLU-Pro paper is out. It's a more difficult, robust and reasoning-driven benchmark to measure expert-level intelligence. We have gradually included 50+ models in our leaderboard: huggingface.co/spaces/TIGER-…. GPT-4o, Gemini-1.5-Pro, Claude-3-Opus are the current top-3 models. Great work led by @YuboWang726 and @xueguang_ma, and in collaboration with other awesome contributors.
MMLU-Pro A More Robust and Challenging Multi-Task Language Understanding Benchmark In the age of large-scale language models, benchmarks like the Massive Multitask Language Understanding (MMLU) have been pivotal in pushing the boundaries of what AI can achieve
6
44
257
59,465
Our general-reasoner (arxiv.org/abs/2505.14652) came out in March this year and has been accepted by NeurIPS. We are among the first few works to extract QA from pre-training data for RL. No comparison, no citation to our paper at all 😂
🚀 Scaling RL to Pretraining Levels with Webscale-RL RL for LLMs has been bottlenecked by tiny datasets (<10B tokens) vs pretraining (>1T). Our Webscale-RL pipeline converts pretraining text into diverse RL-ready QA data — scaling RL to pretraining levels! All codes and datasets are open-source! Paper: arxiv.org/abs/2510.06499 ✨ Key features: • Converts web-scale corpus into millions of verifiable QA pairs • Preserves pretraining-level diversity across 9 domains • Trains up to 100× more token-efficient than continual pretraining • Powers the Webscale-RL dataset (1.2 M examples) for scalable RL Also special thanks to my colleagues in Salesforce AI Research @SFResearch! @HaolinChen11, Shiyu, @LiuZuxin, @huan__wang, @CaimingXiong, @iscreamnearby
7
23
250
46,557
Doesn't look good. Llama3-V faces plagiarism charges. It's astonishing to see this happening even in Stanford.
Shocked! Llama3-V project from a Stanford team plagiarized a lot from MiniCPM-Llama3-V 2.5! its code is a reformatting of MiniCPM-Llama3-V 2.5, and the model's behavior is highly similar to a noised version of MiniCPM-Llama3-V 2.5 checkpoint. Evidence: github.com/OpenBMB/MiniCPM-V…
9
22
254
131,839
# RA/Internship success rate I have received lots of emails from different people saying that they want to do (remote) research/internship (not phd/master) with me. They all seem enthusastic, so I tried mentoring a few of them. However, almost none of them work out, maybe 2/20.
53
19
248
108,978
Excited to introduce our latest math generalist model MAmmoTH 🦣, built through instruction tuning. We proposed hybrid "chain-of-thought" & "program-of-thought" training to supercharge LLMs' math reasoning capabilities. 🦣 beats the open SoTA by 20+% on many datasets like MATH.
8
39
245
47,051
Now all the slides and recorded videos are uploaded to the course website now: cs.uwaterloo.ca/~wenhuche/te… Kudos to all the great students taking the course!
My course "Recent Advances on Foundation Models" at Waterloo is public. Check out cs.uwaterloo.ca/~wenhuche/te…. In the course, we cover lots of interesting topics including transformers, LLM, pre-training, quantization, sparse attention, instruction tuning, RLHF, prompting, Vision transformers, diffusion models, multimodal models, agents, RAG, etc. I will continue to upload the slides (ppt) to the website. Some of them will also have recorded videos soon. There are already 12 lecture slides available now. These slides are made by the awesome attendees of the course.
58
251
39,548
We have made a huge progress in language model reasoning. But our progress in multimodal reasoning (like MMMU) is very limited. Why? It's due to the lack of diverse, difficult and high-quality multimodal reasoning dataset! 🚀 New Paper Alert! 📢 We introduce VisualWebInstruct, a novel approach to scale up multimodal reasoning datasets from the Internet using Google Image Search! 🔍 How? - We meticulously selected 30K seed images and then leverage search engines (Google Image Search) to locate websites with plenty of multimodal reasoning data, like forums or exam websites. - We perform comprehensive extraction, filtering and LLM-based cleaning and refining to harvest around 900K QA pairs from over 700K unique URLs, with 40% as visual QA pairs. 🔥 Results? - Fine-tuning on Llava-OV-mid: +10-20% absolute gains - Fine-tuning on MAmmoTH-VL: +5% absolute gain - MAmmoTH-VL2 achieves SoTA on: 📊 MMMU-Pro-std: 40.7% 🔢 MathVerse: 42.6% 🧮 DynaMath: 55.7% Our work highlights the power of web-scale multimodal data mining for enhancing VLMs' reasoning abilities! Paper: arxiv.org/abs/2503.10582 Website: tiger-ai-lab.github.io/Visua… Dataset: huggingface.co/datasets/TIGE… MAmmoTH-VL2: huggingface.co/TIGER-Lab/MAm… Github: github.com/TIGER-AI-Lab/Visu…
3
52
248
43,384
Had some really interesting discoveries recently: If a model performs extremely stable on one benchmark. Let's say a model is always getting 62% on SWEBench no matter what prompts or scaffold you used. It DOES NOT mean that the model is robust. It actually means that the model is CONTAMINATED on Swebench, i.e. directly train on the test set or the paraphrase of the test set. This could possibly become a good metric for detecting contamination. We will provide more empirical results later on.
12
13
251
39,119
Gemini-2.0 makes a huge leap on our MEGA-Bench leaderboard to beat all the competitors! With the other benchmarks being either overfitted or leaked, I believe MEGA-Bench serves a more reliable indicator to show the multimodal models' true performance to generalize to 505 real-world tasks. Leaderboard Link: huggingface.co/spaces/TIGER-… Congrats to Gemini team @OfficialLoganK @JeffDean
9
29
247
28,275
Finally, music has reached its BERT moment. In this paper, we propose a self-supervised music understanding model, which achieves SOTA performance on 14 music related tasks. arxiv.org/abs/2306.00107.
1/ Excited to announce the release of our new paper "MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training"! We propose a self-supervised music understanding model, attaining overall SOTA performance on 14 MIR tasks. arxiv.org/abs/2306.00107
3
35
239
48,793
If Zuck wants to increase success rate, he should go for academics who earn 150K/year.
the people at Thinking Machines who Zuck would want to hire for that much money already have generational wealth from OpenAI/Anthropic stock. they’re scientists who don’t care about yachts and jets. there’s no number you can buy them for. you have to sell them on the mission
7
4
240
29,539
Yang's blog yang-song.net/blog/2021/scor… is really a gem. It starts from the very basic score-based matching, and then to stochastic differential equations, and finally to diffusion model. It helped a lot in understanding the foundations of these generative models. Highly Recommended!
46
236
21,988
ICLR reviews are out. I found some interesting trends for paper scores. 1. Math and code LLM papers are favored by most reviewers. 2. Video generation papers also received high scores. 3. Agent LLM papers are mixed. 4. Some internet-famous papers are getting pretty low scores.
4
11
230
48,033
New Arxiv Alert! arxiv.org/abs/2405.01483 We propose Mantis: interleaved instruction tuning to enable large-multimodal language models to reason over multiple pieces of images and text. By only training on our high quality 720K instruction data, we can achieve SoTA on five multi-image benchmarks! It's even beating idefics2 by an average 9 points. It's supported by Llama3 backbone and SigLIP encoder! All code, data, and eval are released! Homepage: tiger-ai-lab.github.io/Manti… Instruction Data: huggingface.co/datasets/TIGE… Demo: huggingface.co/spaces/TIGER-… Work led by my awesome student @DongfuJiang, in collaboration with @sivil_taram and other folks!
8
39
245
30,125
EMNLP reviewer 2: why don't you compare with Llama 3.1? Llama 3.1 also has similar MCTS components for post-training.
16
12
232
39,497
I didn't expect this tweet will spark so much discussion. I believe our university curriculum should upgrade to cater to the current trend. Undergrads lost interest in linear algebra because they don't know what these knowledge are meant for. Deep learning is actually a very ...
I honestly think the undergrads should spend more time on understanding the foundamentals of math and science instead of chasing all these AI hypes. Knowing how Vicuna differs from Alpaca probably won't matter much in a year, but knowing how to do SVD will always help you ...
11
24
216
93,227
Very interesting analysis! However, we found that that you can actually achieve the same performance on the same ONE example with a variant of SFT -> CFT (critique fine-tuning) arxiv.org/abs/2501.17703. It's much much much faster than RL on ONE example! Here is a teaser for our early results. We will release the One-Shot CFT paper in the coming days.
"RL with only one training example" and "Test-Time RL" are two recent papers that I found fascinating. In the "One Training example" paper the authors find one question and ask the model to solve it again and again. Every time, the model tries 8 times (the Group in GRPO), and a gradient step is performed, to increase the reward which is a very simple verification of the correct answers, repeated thousands of times on the same problem. The shocking finding is that the model does not overfit to this one question: RL on one example, makes the model better in MATH500 and other benchmarks. (If instead you did SFT repeating one training question-solution finetuning, the model would quickly memorize this answer and overfit). But with RL, the model has to solve the problem itself, since it only sees the question, not the answer. Every time it produces different answers, and this seems to prevent overfitting. The other papers are relying on the same phenomenon: you can have a small number of training questions and re-solve them thousands of times. You can do this for the test set (as test-time RL does) and still not overfit. We also independently saw this by doing RL training on half the test set and seeing benefits in the other half for BFCL agents. My thought now is that this shows our RL learning algorithm must be extremely inefficient. When a human is learning by solving a math puzzle, they immediately learn what they can learn by solving it once (or twice). No further benefit would come by assigning the same homework problem to students a tenth time. But in RL, we keep asking the model to re-solve the same question thousands of times, and the model slowly gets better. We should be able to have much better RL learning algorithms since the information is there. (1/2)
12
29
228
32,385
It seems that lots of people don't know the backstory of this. The very brief version is that: A large amount (probably more than half) of Chinese-nationality STEM MS/PHD applicants accept offers from Canadian universities but couldn't get their visa on time. Canadian immigration office does background checks for these students, which can take at least six months, most of them take longer than 1-2 years. This prohibits the students from enrolling in the school on time. The background check is quite random. But it happens more often if you have already published some AI papers. So the stronger you are, the more unlikely you will get the visa! The direct consequence is that these students will go elsewhere after knowing they have been background checked. Even worse, good students will just avoid applying to Canada at all. I personally lost 4 very talented PhD students due to this visa delay. Ironically enough, most of them applied in the next cycle to US school and got their US visas immediately.
Canada is losing its leadership in AI because your immigration office blocked almost all the talented grad students from China and other places. Cohere would have been much stronger if they could hire these talented grad students.
9
20
221
50,197
I love simple yet effective things. However, reviewers never agree with me on that.
15
16
218
23,555
I realized that many of those "incoming faculty" finally joined industry after a gap year.
15
5
222
32,828
The Chinese "Open"AI companies are turning the Chinese New Year into a celebration for the entire global AI community. 1. Deepseek-R1: nitter.app/deepseek_ai/status/188… 2. Kimi-k1.5 nitter.app/Kimi_ai_/status/188133… Now the secret of o1 (a lot of people knew it already) is out. No PRM, no MCTS, no complex recipe. Large-scale verifiable data will let the reasoning and self-reflection emerge with any RL algorithms!
🚀 Introducing Kimi k1.5 --- an o1-level multi-modal model -Sota short-CoT performance, outperforming GPT-4o and Claude Sonnet 3.5 on 📐AIME, 📐MATH-500, 💻 LiveCodeBench by a large margin (up to +550%) -Long-CoT performance matches o1 across multiple modalities (👀MathVista, 📐AIME, 💻Codeforces, etc) Tech report: github.com/MoonshotAI/Kimi-k… Key ingredients of k1.5 -Long context scaling. Up to 128k tokens for RL generation. Efficient training with partial rollouts. -Improved policy optimization: online mirror descent, sampling strategies, length penalty, and others. -Multi modalities. Joint reasoning over text and vision.
9
18
207
19,889
Tired of fine-tuning image generation models on each subject you care to generate? Today, we release SuTI, a zero-shot subject-driven text-to-image generator that operates fully in-context without tuning. One SuTI model is all you need! Website: open-vision-language.github.…
13
46
213
45,542
We updated MMLU-Pro leaderboard with some recent models like Reflection, GPT-4o (0806) and Arx-0.3 (A startup by Thomas Baker).
12
27
200
86,441
Yay, celebrating the equilibrium.
6
2
205
How to combine Long-context LLM with RAG? We are happy to introduce LongRAG, a new approach to boost RAG with long-context LLMs. 1. Building larger retriever units to 4K-tokens, which is 30x longer than traditional RAG systems like DPR, RAG, FiD, Atlas, etc. 2. Retrieval becomes much easier with larger units! Recall increases significantly from 52% -> 71%. 3. Reader job becomes harder, but we have really good long-context readers like GPT-4o. 4. Without any training, we reach 62.7% on NQ and 64.3% on HotpotQA (full-wiki). This is on par with the SoTA fully-trained RAG models like Atlas and IRRR+. 5. Using larger retrieval unit can make multi-hop questions into single-hop questions. This nullifies the necessity to perform iterative retrieval for HotpotQA. Our approach is very easy to use. No training is needed! No iterative retrieval is needed! Paper: arxiv.org/pdf/2406.15319 Everything is released, all the pointers are listed in tiger-ai-lab.github.io/LongR… Work led by @Ernestzyj and @xueguang_ma from TIGER-Lab.
Enhancing RAG with Long-context LLMs Proposes LongRAG, which combines RAG with long-context LLMs to enhance performance. Uses a long retriever to significantly reduce the number of extracted units by operating on longer retrieval units. The long reader takes in the long retrieval units and leverages the zero-shot answer extraction capability of long-context LLMs to improve performance of the overall system. Claims to achieve 64.3% on HotpotQA (full-wiki), which is on par with the state-of-the-art model. Quote from the paper: "The improvement in retriever can significantly benefit the reader model. By exploiting the long-context understanding ability of GPT-4o, LongRAG can achieve an EM of 62% on NQ and 64% on HotpotQA. These results could be comparable to the strongest finetuned RAG models like Atlas and MDR." What's impressive with this work is that they can significantly reduce retrieval units and increase overall recall on various benchmarks using long-context retrieval. Lots of people are quick to dismiss RAG or long-context LLMs but this work shows the opportunity to mix what looks like competing ideas to achieve even better results.
2
63
203
38,516
Now I realize the real benefits of being in academia. No coding interviews!
6
8
190
54,775
Genuinely curious: is it a good timing for academic NLP people to switch the gear a little bit and work on more interdisciplinary stuff? If so, what interdisciplinary direction "NLP+X" would you recommend?
39
20
197
65,458
A sad truth about evaluation is that: If you make a private test set for your benchmark, people just won't adopt it. We have our official MMMU private test set hosted in EvalAI (eval.ai/web/challenges/chall…), but everyone is still reporting validation score. I found it's similar for MathVista, where everyone is just reporting testmini score.
9
11
196
83,239
Big News! Meet our strongest fully open-source 7B-LLM Neo. We release its 4.7T pre-training data Matrix and entire codebase at MAP-Neo! 1. Neo-7B beats the existing fully open-source models like OLMo, Amber significantly across the board. 2. Neo-7B is surpassing Llama-2-7B remarkably to approach Mistral in several dimensions like reasoning, coding and math. 3. The remarkable performance is coming from our unique ways to recall high-quality data from CC. The improved Megatron-LM training framework is also critical to the success. Data processing pipeline and improved Megatron-LM codebase: github.com/multimodal-art-pr… Dataset: huggingface.co/datasets/m-a-… Model: huggingface.co/m-a-p/neo_7b Kudos to all the MAP team members! I take very few credits for this. We will have the Neo-Instruct and paper coming out soon.
I'm extremely excited to announce "the big bomb"!: Neo and Matrix, that we're working on with colleagues and friends from open-source community, M-A-P.ai, wuhan ai, and 01.ai. Neo is the first fully-transparent bilingual large language model, with fully open-sourced pretrain corpus, data processing pipeline, training framework manipulated from Megatron-LM, intermediate ckpts, and relatively smaller ckpts for investigating scaling law. Matrix is a 4.7 trillion tokens directly adoptable pretrain corpus, which has gone through strict heuristic rules-based filtering and deduplication. The computational resource is supported by 01.ai and wuhan.ai. Kudos to my colleagues! @01AI_Yi @MM_Art_Project Neo Model Series: huggingface.co/collections/m… Matrix: huggingface.co/datasets/m-a-… We name the series as Neo and Matrix as a salute to the movie, the MATRIX! Neo has notably better performance on the metrics of reasoning, math, code, and Chinese, as shown the following!
2
32
193
44,793
Mamba is accepted to COLM 2024! Should I congratulate Albert/Tri or COLM?
3
4
193
21,596
I overheard that it's really tough for new grads to find jobs in frontier labs. Is that true? Are there any statistics regarding this trend?
10
3
195
38,206
Happy to share that Re-Imagen is accepted to #ICLR2023. Arxiv: arxiv.org/abs/2209.14491 In Re-Imagen, we are able to generate novel images about specific entities/objects without any tuning within 30 secs. Some generated examples are shown here:
4
20
190
27,806
An advise for students who are preparing for PhD Application Interview. - You don't need to demonstrate 10 projects you worked on and spend only 2 minutes on each of them by only scratching the surface. - You only need to dive deep with one single project to explain it well.
5
10
192
I just realized that, in addition to MMLU and MATH, Dan Hendrycks was also the **first author** of ImageNet-R, ImageNet-A and Outlier Exposure. How can someone be so impactful? Much respect!
9
5
182
25,223
Somehow people doing LLMs start to call everything related to retrieval augmentation as RAG and only cites Facebook RAG2020 paper. It kind of obliterates a lot of efforts done in this field 😞, especially the work done by my colleagues at Google Research.
10
10
183
83,523
#ACL2023NLP Can Large Language Model reason over large-scale knowledge graph (like Freebase) to answer complex multi-hop questions with only a few demonstration? The answer is yes! Our recent paper (arxiv.org/abs/2305.01750) proposes the first in-context KBQA framework.
6
42
179
22,479
We are super excited to announce Verl-Tool, which is a user-friendly framework to support diverse types of agentic training with RL. github.com/TIGER-AI-Lab/verl… Now we have supported Code-Interpreter, Pixel Operations, Browser, and Bash. If you need to support your tool or environment, the process is very easy: ``` Go to the ./verl_tool/agent_workers/reward_manager directory and add your new reward manager. Then, make sure update the verl_tool/trainer/main_ppo.py file to include your new reward manager. ``` With verl-tool, you can easily train Qwen-math-7B model to achieve 40+ on AIME24. We will release a technical report soon to introduce our results across a wide range of agentic tasks.
Introducing VerlTool - a unified and easy-to-extend tool agent training framework based on verl. Recently, there's been a growing trend toward training tool agents with reinforcement learning algorithms like GRPO and PPO. Representative works include SearchR1, ToRL, ReTool, and ToolRL. While these achieve impressive performance, their training codes are either not fully open-sourced or too difficult to modify and customize with new tools, creating unexpectedly high engineering costs for the community when exploring new ideas. To address these issues and reduce engineering overhead, we propose verl-tool. Key Features: 1. 🔧 Complete decoupling of actor rollout and environment interaction - We use verl as a submodule to benefit from ongoing verl repo updates. All tool calling is integrated via a unified API, allowing you to easily add new tools by simply adding a Python file and testing independently. 2. 🌍 Tool-as-environment paradigm - Each tool interaction can modify the environment state. We store and reload environment states for each trajectory. For each training, you can launch 3. ⚡ Native RL framework for tool-calling agents - verl-tool natively supports multi-turn interactive loops between agents and their tool environments. 4. 📊 User-friendly evaluation suite - Launch your trained model with OpenAI API alongside the tool server. Simply send questions and get final outputs with all interactions handled internally. We've successfully reproduced ToRL results using our verl-tool framework, demonstrating its correctness and demonstrating comparable performance on mathematical benchmarks. VerlTool is an active ongoing project! We aim to incorporate more tools covering a wide range of use cases and expect they can be trained together in a single framework. Suggestions and contributions are highly welcomed! Check out our GitHub: github.com/TIGER-AI-Lab/verl… More details: 👇 (0/4)
2
21
188
20,710
Totally agree. We experimented with only-image input for every task. The results are quite good. Checkout our early paper PixelWorld: arxiv.org/abs/2501.19339
I quite like the new DeepSeek-OCR paper. It's a good OCR model (maybe a bit worse than dots), and yes data collection etc., but anyway it doesn't matter. The more interesting part for me (esp as a computer vision at heart who is temporarily masquerading as a natural language person) is whether pixels are better inputs to LLMs than text. Whether text tokens are wasteful and just terrible, at the input. Maybe it makes more sense that all inputs to LLMs should only ever be images. Even if you happen to have pure text input, maybe you'd prefer to render it and then feed that in: - more information compression (see paper) => shorter context windows, more efficiency - significantly more general information stream => not just text, but e.g. bold text, colored text, arbitrary images. - input can now be processed with bidirectional attention easily and as default, not autoregressive attention - a lot more powerful. - delete the tokenizer (at the input)!! I already ranted about how much I dislike the tokenizer. Tokenizers are ugly, separate, not end-to-end stage. It "imports" all the ugliness of Unicode, byte encodings, it inherits a lot of historical baggage, security/jailbreak risk (e.g. continuation bytes). It makes two characters that look identical to the eye look as two completely different tokens internally in the network. A smiling emoji looks like a weird token, not an... actual smiling face, pixels and all, and all the transfer learning that brings along. The tokenizer must go. OCR is just one of many useful vision -> text tasks. And text -> text tasks can be made to be vision ->text tasks. Not vice versa. So many the User message is images, but the decoder (the Assistant response) remains text. It's a lot less obvious how to output pixels realistically... or if you'd want to. Now I have to also fight the urge to side quest an image-input-only version of nanochat...
5
11
184
40,482
Replying to @alexandr_wang
Come on! Your last name is Wang.
12
177
10,694
Wishing everyone a happy Chinese New Year!
3
21
178
11,713
Many academia labs (including mine) couldn't even afford a single H100 server. There are much better ways to spend money than 500B mostly on GPUs for one company, which already has plenty of them. That money could lead to extraordinary innovation in academia.
Imagine if academia were given 500B for AI research... it would be absolutely revolutionary compared to one company stockpiling GPUs. That's like 100 CMUs. For 0.01% of that money the right lab could profoundly advance the field.
8
12
169
38,460
The gap between open-sourced models and closed-source models is getting larger and larger. What should academia do to catch up?
37
10
173
134,266
Finally defended my thesis and became Dr. Chen! I want to express my deepest gratitude to my committee members, my family, my friends who have supported me throughout my Ph.D. journey.
15
3
171
New Preprint: arxiv.org/abs/2210.06710 Large Language Models (GPT-3) are 1-shot table reasoners. Though not specifically trained or optimized for table understanding, we found that the large language models are quite competent at complex table reasoning. With only 1 demonstration
7
25
171
Announcing MAmmoTH2: tiger-ai-lab.github.io/MAmmo… Let's scale up instruction tuning! We believe that the web corpus contains massive naturally existing high-quality instruction tuning data to enhance LLM reasoning. We proposes a pipeline to discover them. We manage to harvest 10M instruction data (named WebInstruct), which has the exact same size as Llama3's instruction data! So we train from Llama-3-base to do an apple-to-apple comparison. We are able to outperform Llama-3-Instruct on all the reasoning benchmarks. Also, we can match it on the general chatbot benchmark MT-bench. We think this result is quite encouraging to demonstrate the quality of our web-mined instruction data! Our best model is based on Mixtral-8x7B. We build a demo in huggingface.co/spaces/TIGER-…. All of our models are released under huggingface.co/TIGER-Lab. Llama3-70B version is on the way! Hopefully, we can beat the official Llama3-70B-instruct again!
MAmmoTH2: Scaling Instructions from the Web - Proposes a paradigm to efficiently harvest 10M instruction data from web corpus to enhance LLM reasoning - 11% -> 34% on MATH and 36% -> 67% on GSM8K proj: tiger-ai-lab.github.io/MAmmo… abs: arxiv.org/abs/2405.03548
6
31
168
72,419
Super thrilled to WebExplorer, which is a simple yet effective approach to train long-horizon web agents. Instead of depending heavily on rigid pre-defined graph structures, WebExplorer utilizes the model-based exploration strategy to synthesize high-quality agentic data. Our 8B model is able to outperform most 32B or even 72B models on BrowseComp and HLE. Check out our paper at arxiv.org/abs/2509.06501.
WebExplorer Explore and Evolve for Training Long-Horizon Web Agents
4
15
170
32,163
I'm super excited to share our recent work OmniEdit, an omnipotent editing model to handle all different types of editing requests including addition, removal, swapping, environment, background, style, etc. The best part is the **highest-quality** 1.2M high-resolution image editing dataset in huggingface.co/datasets/TIGE…. The biggest blocker in image editing is the lack of high-quality editing pairs. Most existing released datasets are highly noisy, low-resolution, with strong artifacts. This basically prohibits the progress in this area. We spent **8 months** to experiment with many approaches to synthesize and filter clean image editing pairs. Eventually, we built seven specialized pipelines to propose massive amount of candidates and then prompt GPT-4o to assign quality scores these candidates. We took the highest-ranked candidates as our 1.2M training data.
6
24
172
19,272
Looking for the best open-source (small) Math model? I'm happy to release MAmmoTH-7B-Mistral (huggingface.co/TIGER-Lab/MAm…), which achieves 40% on MATH and 52% on MMLU-Math. Nothing fancy, I just fine-tuned Mistral-7B on our previous MathInstruct dataset (huggingface.co/datasets/TIGE…).
5
31
169
29,150
How many people got their ICCV paper rejected due to co-authors being identified as irresponsible reviewers. This is indeed a harsh policy for the (responsible) first author, who has no control over the behavior of their co-authors.
13
3
172
45,271
It seems that a lot of people don't see LLM as a part of NLP. They see it as a totally standalone interdisciplinary research area.
19
11
158
30,974
Replying to @jbhuang0604
R3: this paper doesn't release its code and data. It has no contribution to the community. Strong reject!
3
3
163
12,284
NeurIPS has been incredibly well-organized this year. It’s truly amazing to see so many brilliant minds working together to push the AI boundaries. While it’s disheartening to witness instances of racism, I’m deeply encouraged by the solidarity shown by many non-Chinese colleagues who are speaking up for fairness and inclusivity on social media. I deeply believe inclusiveness is the core of our research community!
1
4
167
14,608
Thrilled to work with @JiachenLi11 to release T2V-Turbo, which is a very fast yet high-quality consistency model. With only 4 diffusion steps (5 seconds), it can obtain high-quality video. T2V-Turbo currently ranks the first on VBench (huggingface.co/spaces/Vchite…), beating other competitors like Pika and Runway Gen-2. We created a demo at: huggingface.co/spaces/TIGER-… T2V-turbo Website: t2v-turbo.github.io/.
T2V-Turbo Breaking the Quality Bottleneck of Video Consistency Model with Mixed Reward Feedback Diffusion-based text-to-video (T2V) models have achieved significant success but continue to be hampered by the slow sampling speed of their iterative sampling processes. To
2
40
158
60,753