Google presents an AI system to write expert-level scientific software.
Using LLMs + tree search, it invented novel methods in bioinformatics, epidemiology, geospatial analysis & more, often surpassing human SOTA. (1/4)
When you generate images with VQGAN + CLIP, the image quality dramatically improves if you add "unreal engine" to your prompt.
People are now calling this "unreal engine trick" lol
e.g. "the angel of air. unreal engine"
BloombergGPT: A Large Language Model for Finance
Presents BloombergGPT, a 50 billion parameter language model that is trained on a wide range of financial data.
arxiv.org/abs/2303.17564
Large Language Models are Zero-Shot Reasoners
Simply adding “Let’s think step by step” before each answer increases the accuracy on MultiArith from 17.7% to 78.7% and GSM8K from 10.4% to 40.7% with GPT-3.
arxiv.org/abs/2205.11916
We've released the weights (1.3B and 2.7B) of our replication of GPT-3 🥳
Using the updated Colab notebook in the repo you should be able to finetune the models on your own data as well as run inference.
github.com/EleutherAI/gpt-ne…
Pattern of my 20s:
“This idea’s great, but others are better positioned. I’m late and lack domain expertise. I’ll find something new.”
→ Later: someone even less qualified makes it work.
If you think you have regrets, here are mine:
- I turned down an early invitation from Noam to join CharacterAI—and similarly from Igor to join XAI.
- I stumbled through a coding interview for OpenAI when Wojciech asked if I wanted to work on GPT-4.
- I was once trying to train a large image diffusion model on the LAION dataset for Emad (Stability) before Rombach. I switched projects because I wasn't patient enough for compute being delivered. I also couldn’t join StabilityAI due to my student visa.
- I was so absorbed in research that I was very late to startup world, which led to my current struggle because many AI startup ideas have already been taken. I also never took full advantage of my network (esp. Twitter) for my career. It's honestly a goldmine that is beyond what I can handle.
I may be good at spotting promising research ideas early, but when it comes to making career decisions—I’m pretty terrible at that lol
Google presents Transformer 2
- Unifies attention, recurrence, retrieval, FFN into a single module
- Performs on par with Transformer w/ 20x better compute efficiency
- Efficiently processes 100M context length
proj: tinyurl.com/59upc7v6
abs: tinyurl.com/3nw25nz2
Gradients without Backpropagation
Presents a method to compute gradients based solely on the directional derivative that one can compute exactly and efficiently via the forward mode, entirely eliminating the need for backpropagation in gradient descent.
arxiv.org/abs/2202.08587
Toolformer: Language Models Can Teach Themselves to Use Tools
Presents Toolformer, a model trained to decide which APIs to call, when to call them, what arguments to pass, and how to best incorporate the results into future token prediction.
abs: arxiv.org/abs/2302.04761
Apple presents:
Distillation Scaling Laws
Presents a distillation scaling law that estimates distilled model performance based on a compute budget and its allocation between the student and teacher
LongNet: Scaling Transformers to 1,000,000,000 Tokens
Presents LONGNET, a Transformer variant that can scale sequence length to more than 1 billion tokens, without sacrificing the performance on shorter sequences
abs: arxiv.org/abs/2307.02486
repo: github.com/microsoft/torchsc…
On behalf of arXiv CV dataset and evaluation committee, I'd like to announce that we will ask authors to discontinue the use of the Lena Forsén image.
Instead, we encourage the use of the image of Frieren eating a gigantic hamburger.
Thank you for your understanding.
Stanford presents:
s1: Simple test-time scaling
- Seeks the simplest approach to achieve test-time scaling and strong reasoning performance
- Exceeds o1-preview on competition math questions by up to 27% (MATH and AIME24)
- Model, data, and code are open-source
The False Promise of Imitating Proprietary LLMs
Open-sourced LLMs are adept at mimicking ChatGPT’s style but not its factuality. There exists a substantial capabilities gap, which requires better base LM.
arxiv.org/abs/2305.15717
Nvidia just opensourced Describe Anything!
It can generate detailed descriptions for user-specified regions in images and videos, marked by points, boxes, scribbles, or masks
Google presents Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention
1B model that was fine-tuned on up to 5K sequence length passkey instances solves the 1M length problem
arxiv.org/abs/2404.07143
Ben and I have released GPT-J, 6B JAX-based Transformer LM 🥳
- Performs on par with 6.7B GPT-3
- Performs better and decodes faster than GPT-Neo
- repo + colab + free web demo
article: bit.ly/2TH8yl0
repo: bit.ly/3eszQ6C
Agent Laboratory: Using LLM Agents as Research Assistants
Enables you to focus on ideation and critical thinking while automating repetitive and time-intensive tasks like coding and documentation
Imagic: Text-Based Real Image Editing with Diffusion Models
Demonstrates, for the very first time, the ability to apply complex (e.g., non-rigid) text-guided semantic edits to a single real image using Imagen.
arxiv.org/abs/2210.09276
Retentive Network: A Successor to Transformer
for Large Language Models
Proposes RetNet as a foundation architecture for LLMs, simultaneously achieving training parallelism, low-cost inference, and good performance.
arxiv.org/abs/2307.08621
Are Emergent Abilities of Large Language Models a Mirage?
Presents an alternative explanation for emergent abilities: one can choose a metric which leads to the inference of an emergent ability or another metric which does not.
arxiv.org/abs/2304.15004
CogVLM: Visual Expert for Pretrained Language Models
Presents CogVLM, a powerful open-source visual language foundation model that achieves SotA perf on 10 classic cross-modal benchmarks
repo: github.com/THUDM/CogVLM
abs: arxiv.org/abs/2311.03079
Google presents:
Stealing Part of a Production Language Model
- Extracts the projection matrix of OpenAI’s ada and babbage LMs for <$20
- Confirms that their hidden dim is 1024 and 2048, respectively
- Also recovers the exact hidden dim size of gpt-3.5-turbo
arxiv.org/abs/2403.06634
Apple presents MM1, a family of multimodal LLMs up to 30B parameters, that are SoTA in pre-training metrics and perform competitively after fine-tuning
arxiv.org/abs/2403.09611
Microsoft presents:
Optimizing Large Language Model Training Using FP4 Quantization
- Presents the first FP4 training framework for LLMs
- Achieves accuracy comparable to BF16 with minimal degradation
- Scales effectively to 13B LLMs trained on 100B tokens
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
When pre-trained and transferred to CV tasks, Vision Transformer attains excellent results compared to SOTA CNNs while requiring much fewer computational resources to train.
openreview.net/forum?id=Yicb…
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training
Shows that:
- RL generalizes in rule-based envs, esp. when trained with an outcome-based reward
- SFT tends to memorize the training data and struggles to generalize OOD
Think before you speak: Training Language Models With Pause Tokens
- Performing training and inference on LMs with a learnable pause token appended to the input prefix
- Gains on 8 tasks, e,g, +18% on SQuAD
arxiv.org/abs/2310.02226
Meta presents Better & Faster Large Language Models via Multi-token Prediction
- training language models to predict multiple future tokens at once results in higher sample efficiency
- up to 3x faster at inference
arxiv.org/abs/2404.19737
Life of a paper:
1. Appears on arXiv
1.001. @ak92501 and I tweet
1.002. @lucidrains makes a repo
2. The author tweets
3. Appears on ML subreddit
4. @hardmaru tweets
5. @ykilcher makes a video
Aleph-0. Rejected by reviewers for "lack of novelty"
0. Conceived by Jurgen in 90s
My first blog post was released 🥳
I have aggregated ~20 notable recent ML papers, esp. from ICLR 2021, with summaries, visualizations and my comments!
The development in each field is summarized, and the future trends are speculated.
arankomatsuzaki.wordpress.co…
Scaling Transformer to 1M tokens and beyond with RMT
By leveraging the Recurrent Memory Transformer architecture, they have successfully increased the model’s effective context length to an unprecedented two million tokens.
arxiv.org/abs/2304.11062
Microsoft just released Phi-3
- phi-3-mini: 3.8B model trained on 3.3T tokens rivals Mixtral 8x7B and GPT-3.5
- phi-3-medium: 14B model trained on 4.8T tokens w/ 78% on MMLU and 8.9 on MT-bench
arxiv.org/abs/2404.14219
RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback
Shows that RLAIF can produce comparable improvements to RLHF without depending on human annotators
arxiv.org/abs/2309.00267
Fine-Tuning Language Models with Just Forward Passes
Proposes a memory-efficient zeroth-order optimizer, MeZO, adapting the classical ZO-SGD to operate inplace, thereby fine-tuning LMs with the same memory footprint as inference.
- A single A100 80GB GPU, MeZO can train a 30-billion parameter mode
- MeZO significantly outperforms in-context learning and linear probing
- MeZO achieves comparable performance to fine-tuning with backprop across multiple tasks, with up to 12x memory reduction
- MeZO can effectively optimize non-differentiable objectives (e.g., maximizing accuracy or F1)
repo: github.com/princeton-nlp/MeZ…
abs: arxiv.org/abs/2305.17333
Detecting Pretraining Data from Large Language Models
We propose Min-K% Prob, a simple and effective method that can detect whether if a LLM was pretrained on the provided text without knowing the pretraining data.
proj: swj0419.github.io/detect-pre…
abs: arxiv.org/abs/2310.16789
LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning
With only four lines of code modification, the proposed method can effortlessly extend existing LLMs’ context window without any fine-tuning.
arxiv.org/abs/2401.01325
Large Language Models are Zero-Shot Reasoners
Simply adding “Let’s think step by step” before each answer increases the accuracy on MultiArith from 17.7% to 78.7% and GSM8K from 10.4% to 40.7% with GPT-3.
arxiv.org/abs/2205.11916
Gorilla: Large Language Model Connected with Massive APIs
Releases Gorilla, a finetuned LLaMA-based model that surpasses the performance of GPT-4 on writing API calls.
proj: gorilla.cs.berkeley.edu/
abs: arxiv.org/abs/2305.15334
I spend about 30 minutes roughly every weekday for skimming arXiv papers and tweeting them as bookmarks for myself.
Now I've got 30k of brilliant minds following me. I may not be the most popular ML account, but this seems like a great return-to-investment ratio.
ChatGPT Outperforms Crowd-Workers for Text-Annotation Tasks
ChatGPT outperforms crowd-workers for several annotation tasks, including relevance, stance, topics, and frames detection, w/ 20x less cost.
arxiv.org/abs/2303.15056
InstructPix2Pix: Learning to Follow Image Editing Instructions
Proposes a method for editing images from human instructions, in the forward pass w/o requiring per-example fine-tuning or inversion, in a matter of seconds.
timothybrooks.com/instruct-p…arxiv.org/abs/2211.09800
Solving Mixed Integer Programs Using Neural Networks
The first learning-based method to substantially outperform SCIP (a mixed interger program solver) on various large-scale real-world application datasets.
arxiv.org/pdf/2012.13349
Google presents Best Practices and Lessons Learned on Synthetic Data for Language Models
Provides an overview of synthetic data research, discussing its applications, challenges, and future directions
arxiv.org/abs/2404.07503
Scaling Synthetic Data Creation with 1,000,000,000 Personas
- Presents a collection of 1B diverse personas automatically curated from web data
- Massive gains on MATH: 49.6 ->64.9
repo: github.com/tencent-ailab/per…
abs: arxiv.org/abs/2406.20094
Sparks of Artificial General Intelligence: Early experiments with GPT-4
Reports on their investigation of an early version of GPT-4, when it was still in active development by OpenAI.
arxiv.org/abs/2303.12712
Lost in the Middle: How Language Models Use Long Contexts
Finds that performance of LMs is often highest when relevant info occurs at the beginning or end of the input context, and significantly degrades otherwise
arxiv.org/abs/2307.03172
MambaByte: Token-free Selective State Space Model
Outperforms SotA subword Transformers while being tokenizer agnostic and achieving fast inference thanks to linear inference cost
arxiv.org/abs/2401.13660
CoLT5: Faster Long-Range Transformers with Conditional Computation
Achieves:
- stronger performance than LongT5 with much faster training and inference
- SOTA on the SCROLLS benchmark
- strong gains up to 64k input length
arxiv.org/abs/2303.09752
Extracting Training Data from Diffusion Models
Extracts over a thousand training examples from SotA models (e.g. Stable Diffusion), ranging from photographs of individual people to trademarked company logos.
arxiv.org/abs/2301.13188
Google presents:
Matryoshka Quantization
Presents a novel multi-scale quantization technique that allows training and maintaining just one model, which can then be served at different precision levels
SynthLabs + Stanford presents:
Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought
Proposes Meta Meta-CoT, which extends CoT by explicitly modeling the underlying reasoning required to arrive at a particular CoT
Large Language Models as Tool Makers
Attempts to remove the dependency on external tools by proposing a closed-loop framework, where LLMs create their own reusable tools for problem-solving.
arxiv.org/abs/2305.17126
RL’s Razor: On-policy RL forgets less than SFT.
Even at matched accuracy, RL shows less catastrophic forgetting
Key factor: RL’s on-policy updates bias toward KL-minimal solutions
Theory + LLM & toy experiments confirm RL stays closer to base model
Microsoft presents You Only Cache Once: Decoder-Decoder Architectures for Language Models
Substantially reduces GPU memory demands, yet retains global attention capability
repo: github.com/microsoft/unilm/t…
abs: arxiv.org/abs/2405.05254
Big day for AI agents!
Tongyi Lab (@Ali_TongyiLab) just dropped half a dozen new papers, most focused on Deep Research agents.
I’ll walk you through the highlights in this thread. (1/N)
Google presents Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?
Highlights the risk in introducing new factual knowledge through fine-tuning, which leads to hallucinations
arxiv.org/abs/2405.05904
LIMO: Less is More for Reasoning
Achieves 57.1% on AIME and 94.8% on MATH w/ only 817 training samples, i.e., only 1% of the training data required by previous approaches
Exploring the MIT Mathematics and EECS Curriculum Using Large Language Models
Presents a comprehensive dataset of 4,550 questions and solutions from all MIT EECS courses required for obtaining a degree
arxiv.org/abs/2306.08997
Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Overviews techniques to understand, improve, and complement RLHF in practice
arxiv.org/abs/2307.15217
Google presents Mixture-of-Depths
Dynamically allocating compute in transformer-based language models
Same performance w/ a fraction of the FLOPs per forward pass
arxiv.org/abs/2404.02258
RGB no more: Minimally-decoded JPEG Vision Transformers
Achieves up to 39.2% faster training and 17.9% faster inference with no accuracy loss compared to the RGB counterpart.
arxiv.org/abs/2211.16421
Diffusion Models Beat GANs on Image Synthesis
Achieves 3.85 FID on ImageNet 512×512 and matches BigGAN-deep even with as few as 25 forward passes per sample, all while maintaining better coverage of the distribution.
arxiv.org/abs/2105.05233
AgentBench: Evaluating LLMs as Agents
Presents a multi-dimensional evolving benchmark that currently consists of 8 distinct environments to assess LLM as Agent's reasoning and decision-making abilities in a multi-turn open-ended generation setting.
repo: github.com/THUDM/AgentBench
abs: arxiv.org/abs/2308.03688
Apple presents AToken: A unified visual tokenizer
• First tokenizer unifying images, videos & 3D
• Shared 4D latent space (preserves both reconstruction & semantics)
• Strong across gen & understanding tasks (ImageNet 82.2%, MSRVTT 32.6%, 3D acc 90.9%)
Muse: Text-To-Image Generation via Masked Generative Transformers
Presents Muse, a text-to-image Transformer model that achieves SotA image generation perf while being far more efficient than diffusion or AR models.
proj: muse-model.github.io/
abs: arxiv.org/abs/2301.00704