Sharing AI research. Early work on AI (GPT-J, scaling, MoE). Ex ML PhD (GT) & Google.

OpenAI did what used to be considered impossible. They made people want to use Bing.
195
1,541
16,642
1,112,000
The leap from o1 to o3 is exponential, completely bypassing o2. If this pattern holds, o3 won’t lead to o4—it’ll jump straight to o9.
349
259
6,729
371,400
Don't forget to check your DMs
68
48
3,133
243,839
Google presents an AI system to write expert-level scientific software. Using LLMs + tree search, it invented novel methods in bioinformatics, epidemiology, geospatial analysis & more, often surpassing human SOTA. (1/4)
62
504
3,097
534,825
When you generate images with VQGAN + CLIP, the image quality dramatically improves if you add "unreal engine" to your prompt. People are now calling this "unreal engine trick" lol e.g. "the angel of air. unreal engine"
47
354
2,446
BloombergGPT: A Large Language Model for Finance Presents BloombergGPT, a 50 billion parameter language model that is trained on a wide range of financial data. arxiv.org/abs/2303.17564
48
429
2,393
503,698
Large Language Models are Zero-Shot Reasoners Simply adding “Let’s think step by step” before each answer increases the accuracy on MultiArith from 17.7% to 78.7% and GSM8K from 10.4% to 40.7% with GPT-3. arxiv.org/abs/2205.11916
56
534
2,407
We've released the weights (1.3B and 2.7B) of our replication of GPT-3 🥳 Using the updated Colab notebook in the repo you should be able to finetune the models on your own data as well as run inference. github.com/EleutherAI/gpt-ne…
30
517
2,349
Pattern of my 20s: “This idea’s great, but others are better positioned. I’m late and lack domain expertise. I’ll find something new.” → Later: someone even less qualified makes it work.
37
89
2,036
93,928
Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation proj: humanaigc.github.io/animate-… abs: arxiv.org/abs/2311.17117
12
589
2,026
771,738
OpenAI presents: Competitive Programming with Large Reasoning Models - Competed live at IOI 2024 - o3 achieved gold - General-purpose o3 surpasses o1 w/ hand-crafted pipelines specialized for coding resultss
51
197
1,869
624,612
If you think you have regrets, here are mine: - I turned down an early invitation from Noam to join CharacterAI—and similarly from Igor to join XAI. - I stumbled through a coding interview for OpenAI when Wojciech asked if I wanted to work on GPT-4. - I was once trying to train a large image diffusion model on the LAION dataset for Emad (Stability) before Rombach. I switched projects because I wasn't patient enough for compute being delivered. I also couldn’t join StabilityAI due to my student visa. - I was so absorbed in research that I was very late to startup world, which led to my current struggle because many AI startup ideas have already been taken. I also never took full advantage of my network (esp. Twitter) for my career. It's honestly a goldmine that is beyond what I can handle. I may be good at spotting promising research ideas early, but when it comes to making career decisions—I’m pretty terrible at that lol
93
31
1,673
220,385
Google presents Transformer 2 - Unifies attention, recurrence, retrieval, FFN into a single module - Performs on par with Transformer w/ 20x better compute efficiency - Efficiently processes 100M context length proj: tinyurl.com/59upc7v6 abs: tinyurl.com/3nw25nz2
52
248
1,558
274,797
Gradients without Backpropagation Presents a method to compute gradients based solely on the directional derivative that one can compute exactly and efficiently via the forward mode, entirely eliminating the need for backpropagation in gradient descent. arxiv.org/abs/2202.08587
22
211
1,442
Toolformer: Language Models Can Teach Themselves to Use Tools Presents Toolformer, a model trained to decide which APIs to call, when to call them, what arguments to pass, and how to best incorporate the results into future token prediction. abs: arxiv.org/abs/2302.04761
23
230
1,331
206,938
Apple presents: Distillation Scaling Laws Presents a distillation scaling law that estimates distilled model performance based on a compute budget and its allocation between the student and teacher
16
199
1,327
134,781
LongNet: Scaling Transformers to 1,000,000,000 Tokens Presents LONGNET, a Transformer variant that can scale sequence length to more than 1 billion tokens, without sacrificing the performance on shorter sequences abs: arxiv.org/abs/2307.02486 repo: github.com/microsoft/torchsc…
28
262
1,268
752,797
On behalf of arXiv CV dataset and evaluation committee, I'd like to announce that we will ask authors to discontinue the use of the Lena Forsén image. Instead, we encourage the use of the image of Frieren eating a gigantic hamburger. Thank you for your understanding.
34
242
1,267
138,215
Stanford presents: s1: Simple test-time scaling - Seeks the simplest approach to achieve test-time scaling and strong reasoning performance - Exceeds o1-preview on competition math questions by up to 27% (MATH and AIME24) - Model, data, and code are open-source
12
157
1,280
135,633
Millenials vs. Gen Z
20
143
1,219
162,775
The False Promise of Imitating Proprietary LLMs Open-sourced LLMs are adept at mimicking ChatGPT’s style but not its factuality. There exists a substantial capabilities gap, which requires better base LM. arxiv.org/abs/2305.15717
47
245
1,241
728,110
Nvidia just opensourced Describe Anything! It can generate detailed descriptions for user-specified regions in images and videos, marked by points, boxes, scribbles, or masks
11
145
1,211
76,564
GPT series still hasn't shown any signs of saturation 😲
45
72
1,174
119,590
Dreamix: Video Diffusion Models are General Video Editors proj: dreamix-video-editing.github… abs: arxiv.org/abs/2302.01329
15
226
1,143
207,610
Google presents Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention 1B model that was fine-tuned on up to 5K sequence length passkey instances solves the 1M length problem arxiv.org/abs/2404.07143
26
239
1,131
205,677
GPT-3: Money is All You Need
6
112
1,088
Ben and I have released GPT-J, 6B JAX-based Transformer LM 🥳 - Performs on par with 6.7B GPT-3 - Performs better and decodes faster than GPT-Neo - repo + colab + free web demo article: bit.ly/2TH8yl0 repo: bit.ly/3eszQ6C
26
232
1,043
A Survey of Large Language Models arxiv.org/abs/2303.18223
10
207
993
194,387
MusicLM: Generating Music From Text Presents MusicLM, a model for generating high-fidelity music from text. MusicLM generates music at 24 kHz that remains consistent over several minutes. proj: google-research.github.io/se… abs: arxiv.org/abs/2301.11325 data: kaggle.com/datasets/googleai…
19
214
963
163,084
Agent Laboratory: Using LLM Agents as Research Assistants Enables you to focus on ideation and critical thinking while automating repetitive and time-intensive tasks like coding and documentation
26
197
985
88,981
Imagic: Text-Based Real Image Editing with Diffusion Models Demonstrates, for the very first time, the ability to apply complex (e.g., non-rigid) text-guided semantic edits to a single real image using Imagen. arxiv.org/abs/2210.09276
12
179
969
Retentive Network: A Successor to Transformer for Large Language Models Proposes RetNet as a foundation architecture for LLMs, simultaneously achieving training parallelism, low-cost inference, and good performance. arxiv.org/abs/2307.08621
22
208
950
199,868
Are Emergent Abilities of Large Language Models a Mirage? Presents an alternative explanation for emergent abilities: one can choose a metric which leads to the inference of an emergent ability or another metric which does not. arxiv.org/abs/2304.15004
25
176
915
626,767
CogVLM: Visual Expert for Pretrained Language Models Presents CogVLM, a powerful open-source visual language foundation model that achieves SotA perf on 10 classic cross-modal benchmarks repo: github.com/THUDM/CogVLM abs: arxiv.org/abs/2311.03079
17
163
909
263,246
Google presents: Stealing Part of a Production Language Model - Extracts the projection matrix of OpenAI’s ada and babbage LMs for <$20 - Confirms that their hidden dim is 1024 and 2048, respectively - Also recovers the exact hidden dim size of gpt-3.5-turbo arxiv.org/abs/2403.06634
13
140
922
244,153
Why are some people still using 4o when we have o3?
403
11
911
252,742
Apple presents MM1, a family of multimodal LLMs up to 30B parameters, that are SoTA in pre-training metrics and perform competitively after fine-tuning arxiv.org/abs/2403.09611
15
176
910
222,109
Microsoft presents: Optimizing Large Language Model Training Using FP4 Quantization - Presents the first FP4 training framework for LLMs - Achieves accuracy comparable to BF16 with minimal degradation - Scales effectively to 13B LLMs trained on 100B tokens
20
121
920
220,850
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale When pre-trained and transferred to CV tasks, Vision Transformer attains excellent results compared to SOTA CNNs while requiring much fewer computational resources to train. openreview.net/forum?id=Yicb…
5
197
916
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training Shows that: - RL generalizes in rule-based envs, esp. when trained with an outcome-based reward - SFT tends to memorize the training data and struggles to generalize OOD
11
143
920
76,349
Think before you speak: Training Language Models With Pause Tokens - Performing training and inference on LMs with a learnable pause token appended to the input prefix - Gains on 8 tasks, e,g, +18% on SQuAD arxiv.org/abs/2310.02226
16
165
890
369,951
Before we can get to "God-like AI" we'll need to get through "Dog-like AI".
14
49
855
123,363
Track Anything: Segment Anything Meets Videos repo: github.com/gaomingqi/Track-A… abs: arxiv.org/abs/2304.11968
15
160
862
125,698
Meta presents Better & Faster Large Language Models via Multi-token Prediction - training language models to predict multiple future tokens at once results in higher sample efficiency - up to 3x faster at inference arxiv.org/abs/2404.19737
12
125
866
182,602
Life of a paper: 1. Appears on arXiv 1.001. @ak92501 and I tweet 1.002. @lucidrains makes a repo 2. The author tweets 3. Appears on ML subreddit 4. @hardmaru tweets 5. @ykilcher makes a video Aleph-0. Rejected by reviewers for "lack of novelty" 0. Conceived by Jurgen in 90s
13
87
862
NVIDIA and CMU presents ASAP, which enables highly agile motions that were previously difficult to achieve! @Cristiano Siuuuuuuu!
32
116
842
104,040
Actually, gradient descent can be seen as attention that applies beyond the model's context length! Let me explain why 🧵 👇 (1/N) Ref: arxiv.org/abs/2202.05798 arxiv.org/abs/2212.10559
10
131
831
171,113
My first blog post was released 🥳 I have aggregated ~20 notable recent ML papers, esp. from ICLR 2021, with summaries, visualizations and my comments! The development in each field is summarized, and the future trends are speculated. arankomatsuzaki.wordpress.co…
10
161
832
Damn, the AI is hitting a wall again.
28
40
817
39,679
Anyone interested in working on an open-source project for Alpha Evolve with us?
205
49
822
181,336
o3 achieved 99.8th percentile on Codeforces
32
68
773
454,474
Scaling Transformer to 1M tokens and beyond with RMT By leveraging the Recurrent Memory Transformer architecture, they have successfully increased the model’s effective context length to an unprecedented two million tokens. arxiv.org/abs/2304.11062
22
162
780
209,642
Microsoft just released Phi-3 - phi-3-mini: 3.8B model trained on 3.3T tokens rivals Mixtral 8x7B and GPT-3.5 - phi-3-medium: 14B model trained on 4.8T tokens w/ 78% on MMLU and 8.9 on MT-bench arxiv.org/abs/2404.14219
27
135
761
339,321
RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback Shows that RLAIF can produce comparable improvements to RLHF without depending on human annotators arxiv.org/abs/2309.00267
9
164
761
235,314
ByteDance presents MagicVideo-V2 Outperforms SotA video models such as Pika 1.0, SVD-XT according to human evaluation abs: arxiv.org/abs/2401.04468 proj: magicvideov2.github.io/
28
155
746
228,118
Fine-Tuning Language Models with Just Forward Passes Proposes a memory-efficient zeroth-order optimizer, MeZO, adapting the classical ZO-SGD to operate inplace, thereby fine-tuning LMs with the same memory footprint as inference. - A single A100 80GB GPU, MeZO can train a 30-billion parameter mode - MeZO significantly outperforms in-context learning and linear probing - MeZO achieves comparable performance to fine-tuning with backprop across multiple tasks, with up to 12x memory reduction - MeZO can effectively optimize non-differentiable objectives (e.g., maximizing accuracy or F1) repo: github.com/princeton-nlp/MeZ… abs: arxiv.org/abs/2305.17333
13
163
735
139,339
Detecting Pretraining Data from Large Language Models We propose Min-K% Prob, a simple and effective method that can detect whether if a LLM was pretrained on the provided text without knowing the pretraining data. proj: swj0419.github.io/detect-pre… abs: arxiv.org/abs/2310.16789
5
156
733
77,982
LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning With only four lines of code modification, the proposed method can effortlessly extend existing LLMs’ context window without any fine-tuning. arxiv.org/abs/2401.01325
18
141
737
78,990
Here's the leaderboard of prompts to add to GPT-3. Can you guys come up with anything better?
Large Language Models are Zero-Shot Reasoners Simply adding “Let’s think step by step” before each answer increases the accuracy on MultiArith from 17.7% to 78.7% and GSM8K from 10.4% to 40.7% with GPT-3. arxiv.org/abs/2205.11916
84
122
733
Gorilla: Large Language Model Connected with Massive APIs Releases Gorilla, a finetuned LLaMA-based model that surpasses the performance of GPT-4 on writing API calls. proj: gorilla.cs.berkeley.edu/ abs: arxiv.org/abs/2305.15334
15
179
714
255,655
I spend about 30 minutes roughly every weekday for skimming arXiv papers and tweeting them as bookmarks for myself. Now I've got 30k of brilliant minds following me. I may not be the most popular ML account, but this seems like a great return-to-investment ratio.
9
23
716
140,728
ChatGPT Outperforms Crowd-Workers for Text-Annotation Tasks ChatGPT outperforms crowd-workers for several annotation tasks, including relevance, stance, topics, and frames detection, w/ 20x less cost. arxiv.org/abs/2303.15056
19
141
708
254,055
Deepseek-V3-Base was just opensourced! - 685B MoE w/ 256 experts topk=8 with sigmoid routing - Outperforms Sonnet 3.5 on Aider benchmark huggingface.co/deepseek-ai/D…
10
123
698
94,036
KAN: Kolmogorov–Arnold Networks Proposes an alternative to MLP that outperforms in terms of accuracy and interpretability arxiv.org/abs/2404.19756
8
134
686
99,050
The Jamba paper was just dropped arxiv.org/abs/2403.19887
5
126
691
108,375
InstructPix2Pix: Learning to Follow Image Editing Instructions Proposes a method for editing images from human instructions, in the forward pass w/o requiring per-example fine-tuning or inversion, in a matter of seconds. timothybrooks.com/instruct-p… arxiv.org/abs/2211.09800
22
124
673
Solving Mixed Integer Programs Using Neural Networks The first learning-based method to substantially outperform SCIP (a mixed interger program solver) on various large-scale real-world application datasets. arxiv.org/pdf/2012.13349
5
144
658
Google presents Best Practices and Lessons Learned on Synthetic Data for Language Models Provides an overview of synthetic data research, discussing its applications, challenges, and future directions arxiv.org/abs/2404.07503
6
132
683
158,129
Scaling Synthetic Data Creation with 1,000,000,000 Personas - Presents a collection of 1B diverse personas automatically curated from web data - Massive gains on MATH: 49.6 ->64.9 repo: github.com/tencent-ailab/per… abs: arxiv.org/abs/2406.20094
14
112
674
142,578
StyleGAN + CLIP "Satoshi Nakamoto"
9
58
669
Sparks of Artificial General Intelligence: Early experiments with GPT-4 Reports on their investigation of an early version of GPT-4, when it was still in active development by OpenAI. arxiv.org/abs/2303.12712
15
114
662
302,290
Lost in the Middle: How Language Models Use Long Contexts Finds that performance of LMs is often highest when relevant info occurs at the beginning or end of the input context, and significantly degrades otherwise arxiv.org/abs/2307.03172
18
123
669
146,422
MambaByte: Token-free Selective State Space Model Outperforms SotA subword Transformers while being tokenizer agnostic and achieving fast inference thanks to linear inference cost arxiv.org/abs/2401.13660
15
120
636
141,396
CoLT5: Faster Long-Range Transformers with Conditional Computation Achieves: - stronger performance than LongT5 with much faster training and inference - SOTA on the SCROLLS benchmark - strong gains up to 64k input length arxiv.org/abs/2303.09752
6
108
636
210,199
Google presents To Believe or Not to Believe Your LLM arxiv.org/abs/2406.02543
8
129
644
76,011
Extracting Training Data from Diffusion Models Extracts over a thousand training examples from SotA models (e.g. Stable Diffusion), ranging from photographs of individual people to trademarked company logos. arxiv.org/abs/2301.13188
14
127
631
162,719
Google presents: Matryoshka Quantization Presents a novel multi-scale quantization technique that allows training and maintaining just one model, which can then be served at different precision levels
11
107
637
54,877
SynthLabs + Stanford presents: Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought Proposes Meta Meta-CoT, which extends CoT by explicitly modeling the underlying reasoning required to arrive at a particular CoT
17
134
641
82,207
Large Language Models as Tool Makers Attempts to remove the dependency on external tools by proposing a closed-loop framework, where LLMs create their own reusable tools for problem-solving. arxiv.org/abs/2305.17126
12
117
624
99,547
RL’s Razor: On-policy RL forgets less than SFT. Even at matched accuracy, RL shows less catastrophic forgetting Key factor: RL’s on-policy updates bias toward KL-minimal solutions Theory + LLM & toy experiments confirm RL stays closer to base model
9
97
628
111,919
Microsoft presents You Only Cache Once: Decoder-Decoder Architectures for Language Models Substantially reduces GPU memory demands, yet retains global attention capability repo: github.com/microsoft/unilm/t… abs: arxiv.org/abs/2405.05254
19
131
622
68,487
ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs ToolLLaMA exhibits comparable performance to ChatGPT repo: github.com/OpenBMB/ToolBench abs: arxiv.org/abs/2307.16789
11
149
607
88,949
Big day for AI agents! Tongyi Lab (@Ali_TongyiLab) just dropped half a dozen new papers, most focused on Deep Research agents. I’ll walk you through the highlights in this thread. (1/N)
18
93
613
68,626
Google presents Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations? Highlights the risk in introducing new factual knowledge through fine-tuning, which leads to hallucinations arxiv.org/abs/2405.05904
10
118
610
92,781
LIMO: Less is More for Reasoning Achieves 57.1% on AIME and 94.8% on MATH w/ only 817 training samples, i.e., only 1% of the training data required by previous approaches
20
73
601
118,718
FP8-LM: Training FP8 Large Language Models Trains GPT-175B with H100s 64% faster than BF16 without any performance degradation repo: github.com/Azure/MS-AMP abs: arxiv.org/abs/2310.18313
3
129
591
93,363
Exploring the MIT Mathematics and EECS Curriculum Using Large Language Models Presents a comprehensive dataset of 4,550 questions and solutions from all MIT EECS courses required for obtaining a degree arxiv.org/abs/2306.08997
19
113
564
1,872,891
Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback Overviews techniques to understand, improve, and complement RLHF in practice arxiv.org/abs/2307.15217
5
140
591
80,198
More GPUs are All I Need
11
48
593
Google presents Mixture-of-Depths Dynamically allocating compute in transformer-based language models Same performance w/ a fraction of the FLOPs per forward pass arxiv.org/abs/2404.02258
6
86
587
216,609
RGB no more: Minimally-decoded JPEG Vision Transformers Achieves up to 39.2% faster training and 17.9% faster inference with no accuracy loss compared to the RGB counterpart. arxiv.org/abs/2211.16421
11
79
587
Diffusion Models Beat GANs on Image Synthesis Achieves 3.85 FID on ImageNet 512×512 and matches BigGAN-deep even with as few as 25 forward passes per sample, all while maintaining better coverage of the distribution. arxiv.org/abs/2105.05233
8
106
581
AgentBench: Evaluating LLMs as Agents Presents a multi-dimensional evolving benchmark that currently consists of 8 distinct environments to assess LLM as Agent's reasoning and decision-making abilities in a multi-turn open-ended generation setting. repo: github.com/THUDM/AgentBench abs: arxiv.org/abs/2308.03688
10
138
591
115,018
Apple presents AToken: A unified visual tokenizer • First tokenizer unifying images, videos & 3D • Shared 4D latent space (preserves both reconstruction & semantics) • Strong across gen & understanding tasks (ImageNet 82.2%, MSRVTT 32.6%, 3D acc 90.9%)
4
76
589
105,284
Muse: Text-To-Image Generation via Masked Generative Transformers Presents Muse, a text-to-image Transformer model that achieves SotA image generation perf while being far more efficient than diffusion or AR models. proj: muse-model.github.io/ abs: arxiv.org/abs/2301.00704
15
130
567
117,214
“Let’s think step by step” is all you need
6
48
568