Ethan · Feb 14, 2025 · 7:04 PM UTC

Ethan

Pinned Tweet

Ethan

@torchcompiled

14 Feb 2025

personally I feel like the inflection point was early 2022. The sweet spot where clip-guided diffusion was just taking off, forcing unconditional models to be conditional through strange patchwork of CLIP evaluating slices of the canvas at a time. It was like improv, always trying to riff of mistakes and sitting right at the fine line between interesting and incoherent.

EPROM @eprombeats

14 Feb 2025

Image synthesis used to look so good. These are from 2021. I feel like this was an inflection point, and the space has metastasized into something abhorrent today (Grok, etc). Even with no legible representational forms, there was so much possibility in these images.

708

244,214

Ethan · Jul 11, 2024 · 2:54 PM UTC

Ethan

@torchcompiled

11 Jul 2024

should i sell this as a t-shirt

837

10,905

352,114

Ethan · Jul 2, 2024 · 2:42 PM UTC

Ethan

@torchcompiled

2 Jul 2024

really cool to see they got back together to write papers on diffusion models

424

7,258

390,431

Ethan · Jul 30, 2025 · 8:55 PM UTC

Ethan

@torchcompiled

30 Jul 2025

3D polygon diffusion is one of the coolest things I've ever seen

411

4,845

373,676

Ethan · Sep 28, 2024 · 3:17 AM UTC

Ethan

@torchcompiled

28 Sep 2024

Ai majors be like: “damn I got a neural network due tommorow”

3,274

96,487

Ethan · Sep 22, 2024 · 3:08 AM UTC

Ethan

@torchcompiled

22 Sep 2024

stanford cs149 has implementing flash attention as a homework assignment

112

3,037

370,159

Ethan · Jul 28, 2025 · 11:24 PM UTC

Ethan

@torchcompiled

28 Jul 2025

they did diffusion on * checks notes * a house

123

1,783

101,093

Ethan · Jun 14, 2024 · 5:55 AM UTC

Ethan

@torchcompiled

14 Jun 2024

so this is nuts, if you're cool with the high frequncy details of an image being reinterpreted/stochastic, you can encode an image quite faithfully into 32 tokens... with a codebook size of 1024 as they use this is just 320bits, new upper bound for the information in an image unlocked

144

1,276

299,127

Ethan · Aug 8, 2023 · 2:31 PM UTC

Ethan

@torchcompiled

8 Aug 2023

Is this Sam Altman

1,072

104,761

Ethan · Dec 20, 2024 · 6:54 PM UTC

Ethan

@torchcompiled

20 Dec 2024

François Chollet

@fchollet

20 Dec 2024

Today OpenAI announced o3, its next-gen reasoning model. We've worked with OpenAI to test it on ARC-AGI, and we believe it represents a significant breakthrough in getting AI to adapt to novel tasks. It scores 75.7% on the semi-private eval in low-compute mode (for $20 per task in compute ) and 87.5% in high-compute mode (thousands of $ per task). It's very expensive, but it's not just brute -- these capabilities are new territory and they demand serious scientific attention.

1,077

48,391

Ethan · Sep 22, 2024 · 3:21 AM UTC

Ethan

@torchcompiled

22 Sep 2024

if you are implementing flash attention for a grade and you'd rather be doing it for money, come join our team :) dms open

Ethan

@torchcompiled

22 Sep 2024

stanford cs149 has implementing flash attention as a homework assignment

1,032

140,625

Ethan · Dec 22, 2024 · 8:19 PM UTC

Ethan

@torchcompiled

22 Dec 2024

I owe you an apology RoPE, I was not familiar with your game.

919

70,864

Ethan · Sep 4, 2023 · 3:54 AM UTC

Ethan

@torchcompiled

4 Sep 2023

a picture of a dog i accidentally hit plt.plot instead of plt.imshow on

841

118,527

Ethan · Dec 21, 2024 · 8:05 PM UTC

Ethan

@torchcompiled

21 Dec 2024

very significant vibe shift once meta started publishing papers in this format

823

69,088

Ethan · Oct 3, 2024 · 3:06 PM UTC

Ethan

@torchcompiled

3 Oct 2024

I thought I knew GPUs fairly well, I evidently do not piped.video/watch?v=whPSD8sd…

784

51,511

Ethan · Jun 25, 2022 · 1:44 PM UTC

Ethan

@torchcompiled

25 Jun 2022

To all CLIP guided diffusion enthusiasts! I put together a guide covering both all my and others findings on parameter/prompt studies well as advice on how to get started. Check it out ! notion.so/A-Traveler-s-Guide… #AIart #discodiffusion #ArtificialIntelligence

146

772

Ethan · May 11, 2024 · 7:40 PM UTC

Ethan

@torchcompiled

11 May 2024

You guys can’t be serious

675

63,638

Ethan · Nov 28, 2024 · 5:55 AM UTC

Ethan

@torchcompiled

28 Nov 2024

705

19,643

Ethan · Dec 28, 2024 · 8:13 PM UTC

Ethan

@torchcompiled

28 Dec 2024

so yeah, this is something I've always been confused about with softmax. your denominator keeps growing with sequence length, but logits of individual items are invariant to this. So attention sharpness ultimately depends on sequence length, becoming easier for noise to drown out contribution of a desired token we want to attend to in very long sequences

691

67,405

Ethan · Apr 5, 2025 · 11:19 PM UTC

Ethan

@torchcompiled

5 Apr 2025

False. There has never been a lower barrier to entry to AI research. You can do things that are authentically worth sharing academically without training, without a degree, with less than a 3090, and just huggingface libraries. You can enter this field at any point in time and contribute, SOTA can come from anywhere and is not just contingent on having studied for the past 20 years relentlessly.

Jack Morris

@jxmnop

5 Apr 2025

something i rarely hear talked about: the barrier to entry to do AI research is extremely high imagine someone developing an improvement to Transformers. it’s not that hard of a topic to teach yourself but even trying basic ideas takes a lot of work you need to: - understand exactly where state-of-the-art is at this moment, - come up with some basic ideas to try, - figure out the right datasets and evaluations that make your experiments convincing, - learn to use tools to develop those experiments, - write the code for the experiments - execute; gather results; interpret - iterate, potentially many times all this process takes just a really non-trivia investment of your TIME (not to mention compute). you need to be detail-oriented persistent over a long timeframe, and perhaps a bit lucky thankfully the pool of people doing this in a given subfield in 2025 is still relatively small. your dumb idea might “just work” after all. try it

645

80,010

Ethan · Nov 5, 2024 · 10:25 PM UTC

Ethan

@torchcompiled

5 Nov 2024

576

21,361

Ethan · Sep 4, 2022 · 10:10 PM UTC

Ethan

@torchcompiled

4 Sep 2022

Someone made a depth map for one of my favorite pieces from "The Relativity of Perception" Love it! #ai #aiart #aiartcommunity #discodiffusion

556

Ethan · Oct 12, 2024 · 9:36 AM UTC

Ethan

@torchcompiled

12 Oct 2024

Damn what a line

hardmaru

@hardmaru

12 Oct 2024

“$2 H100s: How the GPU Bubble Burst” latent.space/p/gpu-bubble Last year, H100s were $8/hr if you could get them. Today, there’s 7 different resale markets selling them under $2. What happened?

ALT “The financial logic requires AGI to parse.” 👌

556

72,960

Ethan · Mar 22, 2025 · 11:29 PM UTC

Ethan

@torchcompiled

22 Mar 2025

It doesn't say it outright, but this video is probably the most complete explanation on the intuition behind principles of VAEs and denoising autoencoders I've seen.

550

35,511

Ethan · Mar 1, 2023 · 7:59 PM UTC

Ethan

@torchcompiled

1 Mar 2023

I can’t figure out why the anti-ai crowd is so upset about this. This is scripted, acted, costume designed, and clearly a lot of vfx work. The role ai plays here is minuscule. This is a human work

Jonah

@JonahBlake

27 Feb 2023

This is the future of IP. This anime is made using AI. Corridor Crew is a must subscribe youtube channel they show the whole process.

498

90,312

Ethan · May 19, 2024 · 12:21 AM UTC

Ethan

@torchcompiled

19 May 2024

yes, i am simping over this. this is what its all about. every modality is just a vessel/wrapper for the underlying Thing. tear apart all the rules/structures of that modality and you're left with the raw concept. i believe whether you hear someone's voice or see them you access the same mental concept of that person up to some variance/uncertainty

484

140,396

Ethan · May 19, 2024 · 2:47 AM UTC

Ethan

@torchcompiled

19 May 2024

if you like this, you'll really like this arxiv.org/abs/2103.05247 pretraining on text is merely an adequately challenging modality to learn computational primitives for universal transfer. there are other modalities which we can do causal modeling on and develop similar foundational capabilities.

Ethan

@torchcompiled

19 May 2024

484

63,733

Ethan · Jun 26, 2024 · 11:20 PM UTC

Ethan

@torchcompiled

26 Jun 2024

YOLOv3 paper why dont we do this anymore

474

62,351

Ethan · Jun 22, 2025 · 1:11 PM UTC

Ethan

@torchcompiled

22 Jun 2025

Modeling dolphin language is cool. Translating it into human speak is cooler. Somewhere you're gonna want to figure out how to align the latent space of dolphin language with that of human language in an unpaired, unbiased manner.

vitrupo

@vitrupo

20 Jun 2025

DeepMind's Drew Purves says AI may help us talk to animals -- and that might be its greatest legacy. Projects like DolphinGemma are already decoding dolphin speech with LLMs. Whale song changed how we saw whales. A real conversation could change how we see nature.

475

50,558

Ethan · Oct 7, 2022 · 7:55 AM UTC

Ethan

@torchcompiled

7 Oct 2022

From lettuc on Instagram #ai #aiartcommunity

422

Ethan · Oct 4, 2024 · 2:51 AM UTC

Ethan

@torchcompiled

4 Oct 2024

how it feels realizing how much i still don't know about pytorch

438

19,703

Ethan · Mar 23, 2025 · 11:35 PM UTC

Ethan

@torchcompiled

23 Mar 2025

why does grad norm continually grow with training? this is mad unintuitive. absolutely destroying my mental model of 2D convex bowl optimization.

453

75,147

Ethan · Sep 22, 2024 · 4:13 AM UTC

Ethan

@torchcompiled

22 Sep 2024

i'm not sure why I haven't noticed this until now, is it not an issue that frequencies in sinusoidal positional embeddings get basically clipped passed a certain dimension? should we be using slower changing frequencies in scaling up to larger dimensions?

437

128,258

Ethan · Oct 31, 2024 · 6:57 AM UTC

Ethan

@torchcompiled

31 Oct 2024

Bad day for "Nothing ever happens" crowd (please be true)

425

60,577

Ethan · May 21, 2025 · 6:30 PM UTC

Ethan

@torchcompiled

21 May 2025

Yep. Been known. And we can translate between latent spaces fairly well.

Jack Morris

@jxmnop

21 May 2025

excited to finally share on arxiv what we've known for a while now: All Embedding Models Learn The Same Thing embeddings from different models are SO similar that we can map between them based on structure alone. without *any* paired data feels like magic, but it's real:🧵

421

35,270

Ethan · Jun 23, 2024 · 5:26 AM UTC

Ethan

@torchcompiled

23 Jun 2024

First time I’ve seen something like this in a paper

388

53,099

Ethan · Jul 28, 2024 · 12:18 PM UTC

Ethan

@torchcompiled

28 Jul 2024

"I ... worked on this for a year ... and ... he just ... he tweeted it out. "

Ethan

@torchcompiled

25 Feb 2024

When you think about it your DNA is like the latent encoding of a VAE and your upbringing is the stochastic component

395

59,818

Ethan · Mar 12, 2025 · 6:45 PM UTC

Ethan

@torchcompiled

12 Mar 2025

New post! I'd like to very boldly claim that the origins of using softmax in attention are a bit arbitrary, the probability lens is restrictive framework in how we might improve it, and that there's even something I'd consider a bug in attention affecting LLMs.

393

78,073

Ethan · Feb 24, 2025 · 2:02 AM UTC

Ethan

@torchcompiled

24 Feb 2025

In addition to having a soft spot for The Platonic Representation Hypothesis, this one is also a banger in a similar vein.

385

28,704

Ethan · Dec 31, 2024 · 12:30 AM UTC

Ethan

@torchcompiled

31 Dec 2024

Out of all the recent Meta big bomb meta papers, this one is the most slept on. This is the latent diffusion moment for LLMs, but more generally, autoregressive models, although not even sure that description does it justice.

362

22,882

Ethan · Sep 29, 2024 · 7:02 AM UTC

Ethan

@torchcompiled

29 Sep 2024

Olivia Moore

@omooretweets

29 Sep 2024

The NotebookLM hosts realizing they are AI and spiraling out is a twist I did not see coming

343

23,135

Ethan · Feb 28, 2024 · 1:50 AM UTC

Ethan

@torchcompiled

28 Feb 2024

trained a diffusion model to solve mazes

333

33,131

Ethan · Nov 18, 2023 · 3:43 AM UTC

Ethan

@torchcompiled

18 Nov 2023

why'd they have to do him like that 😭

343

65,607

Ethan · Aug 25, 2022 · 3:33 AM UTC

Ethan

@torchcompiled

25 Aug 2022

Just found this really neat comparison of #stableDifusion samplers vs step count. Credit in the watermark #ai #aiart #aiartcommunity

344

Ethan · Mar 12, 2024 · 3:18 PM UTC

Ethan

@torchcompiled

12 Mar 2024

Created a new method of generative model (although kinda crappy lol) that works by autoregressive sequencing fourier coefficients. Inspired by the coarse to fine generation by Diffusion models. Full write up here: ethansmith2000.com/post/mimi… and tldr in thread 🧵 (flowers102)

346

75,112

Ethan · Feb 18, 2024 · 12:26 PM UTC

Ethan

@torchcompiled

18 Feb 2024

Quote tweeting someone minding their own business appreciating their SO with what can be simplified to “love feels good because it’s evolutionarily favorable for reproduction” is so weird

329

32,760

Ethan · Nov 10, 2024 · 12:50 PM UTC

Ethan

@torchcompiled

10 Nov 2024

sharing this again as i love the magic of diffusion. diffusion models approximate your data manifold at each noise level as a gaussian mixture model where every datapoint is a gaussian with a global isometric variance determined by the noise level.

328

20,227

Ethan · Dec 15, 2023 · 11:01 PM UTC

Ethan

@torchcompiled

15 Dec 2023

Replying to @andyohlbaum @digiaiapp

I want to qt this so the court of public appeals can help you realize how awful this is, but also don’t want to give this any publicity I see this at the same level, if not worse, because how it targets the most vulnerable with fake companionship, as nicotine companies

285

96,644

Ethan · Oct 14, 2024 · 1:31 PM UTC

Ethan

@torchcompiled

14 Oct 2024

whenever someone makes fun of Ilya's pattern baldness, remember this:

306

29,765

Ethan · Sep 28, 2024 · 2:09 PM UTC

Ethan

@torchcompiled

28 Sep 2024

although we don't use this anywhere, this is one of the most insane results to me in diffusion literature. we can get an estimation for the ground truth score without ever having a target of it to regress to.

304

22,874

Ethan · Sep 30, 2022 · 6:53 PM UTC

Ethan

@torchcompiled

30 Sep 2022

went on a huge dreamstudio bender yesterday, re-running nearly every prompt i've ever done in disco. Here's a thread of some awesome pics and some thoughts 🧵

296

Ethan · Dec 26, 2023 · 1:09 AM UTC

Ethan

@torchcompiled

26 Dec 2023

294

17,539

Ethan · Oct 29, 2024 · 3:43 AM UTC

Ethan

@torchcompiled

29 Oct 2024

Replying to @finbarrtimbers

Mainly for generation. Autoregression in continuous space is more vulnerable to exposure bias and doesn’t have common solutions. I wrote about it a bit here towards the latter half sweet-hall-e72.notion.site/W…

Why are Modern Neural Nets the way they are? And Hidden Hypernetworks. | Notion

Written by Ethan Smith

sweet-hall-e72.notion.site

298

20,343

Ethan · May 10, 2025 · 11:24 PM UTC

Ethan

@torchcompiled

10 May 2025

Replying to @vikhyatk

I just googled this function and only your tweet came up lmao

289

18,804

Ethan · Dec 17, 2024 · 9:21 PM UTC

Ethan

@torchcompiled

17 Dec 2024

calling it, GANs (or something that looks close to a GAN) will be making a comeback

294

28,676

Ethan · Jul 19, 2022 · 2:46 AM UTC

Ethan

@torchcompiled

19 Jul 2022

sand spirit #ai #aiart #aiartcommunity

278

Ethan · Oct 10, 2024 · 9:08 AM UTC

Ethan

@torchcompiled

10 Oct 2024

The level to which Meta does data curation and finetuning and model averaging with LLaMa 3 and MovieGen papers is absurd. this is merely the tip of the iceberg, they've really made it into an artform.

286

13,315

Ethan · Mar 12, 2025 · 7:15 PM UTC

Ethan

@torchcompiled

12 Mar 2025

Main critiques: 1. Attention lacks a way to do "nothing", and this may be why attention sinks arise. 2. Temperature scaling should be a function of sequence length because of how denominator of normalization naturally grows.

Ethan

@torchcompiled

12 Mar 2025

293

25,692

Ethan · Oct 27, 2024 · 10:08 AM UTC

Ethan

@torchcompiled

27 Oct 2024

I didn't see the vision before But Now I Do. just swap out e^x with your favorite n in x^n, adjust regularization. You're now free from the most unwieldy bits of attention computation: the Max and Sum

280

21,925

Ethan · Apr 11, 2025 · 2:42 PM UTC

Ethan

@torchcompiled

11 Apr 2025

We are slowly fashioning DiT back into Unets.

275

20,211

Ethan · May 17, 2024 · 12:40 AM UTC

Ethan

@torchcompiled

17 May 2024

today i trained a diffusion model that generates LoRAs, and the images that come out are at the very least not garbled.

262

66,145

Ethan · Mar 29, 2025 · 3:02 PM UTC

Ethan

@torchcompiled

29 Mar 2025

Compared to most optimizer research, Muon comes a bit out of left field. I thought I'd share some notes on what might be happening under the hood as it doesn't appear to be traditional preconditioning. Instead, my guess is it lies in amplifying noise and maintaining relativity.

263

21,272

Ethan · Apr 11, 2024 · 5:19 AM UTC

Ethan

@torchcompiled

11 Apr 2024

Someone gotta explain to me how this is a working business model, renting a single a100 generally goes for 1.20/hr at the lowest

Simon Willison

@simonw

11 Apr 2024

Replying to @simonw

@DeepInfra for $0.65/million tokens

8deepinfra
mistralai/Mixtral-
8x22B-v0.1
Copy
Mixtral-8x22B is the latest and largest mixture of expert large language model
(LLM) from Mistral Al. This is state of the art machine learning model using a mixture 8 of experts (MoE) 22b models.
During inference 2 expers are selected.
This architecture allows large models to be fast and cheap at inference. This model is not instruction tuned.
Public
— $0.65 / Mtoken
64k
DEMO
API
VERSIONS

ALT 8deepinfra mistralai/Mixtral- 8x22B-v0.1 Copy Mixtral-8x22B is the latest and largest mixture of expert large language model (LLM) from Mistral Al. This is state of the art machine learning model using a mixture 8 of experts (MoE) 22b models. During inference 2 expers are selected. This architecture allows large models to be fast and cheap at inference. This model is not instruction tuned. Public — $0.65 / Mtoken 64k DEMO API VERSIONS

237

140,003

Ethan · Oct 5, 2024 · 5:19 PM UTC

Ethan

@torchcompiled

5 Oct 2024

Made a new post! Trying to answer why the models we know and love today are the ones that have claimed the throne, and what traits make for a good generative model? As well as how many layers that were once static have now become dynamic (creating weights on the fly)

229

14,641

Ethan · Apr 13, 2025 · 7:02 AM UTC

Ethan

@torchcompiled

13 Apr 2025

once again, artem kirsanov is a fantastic teacher

221

14,865

Ethan · Jul 27, 2024 · 2:52 PM UTC

Ethan

@torchcompiled

27 Jul 2024

Made a post covering Diffusion Transformers, do they deserve the hype? What makes them good? Are non-transformer methods worthless now? Also me ranting about how awful patch embedding methods are and some experiments showing some really annoying weaknesses of them! link below

222

41,132

Ethan · Nov 10, 2025 · 10:03 PM UTC

Ethan

@torchcompiled

10 Nov 2025

Little stake here as saying humans are thinking in an autoregressive or diffusion manner is overly reductive. But a fun fact is that human associative memory does share some similarities with Hopfield networks which somewhat resembles a denoising process

kalomaze

@kalomaze

10 Nov 2025

> The way humans think look a lot more like diffusion than autoregressive. i will never, ever understand this claim or the intuitions behind it. ah yes. the human mind is... learning a scoring function to... reverse gaussian noise... (?) ... spatially (???)

225

19,388

Ethan · Jul 29, 2024 · 2:01 AM UTC

Ethan

@torchcompiled

29 Jul 2024

They trained a CLIP competitive model on 1/400 of the data It’s so over arxiv.org/abs/2208.13628

snats

@snats_xyz

29 Jul 2024

They trained on only 6GB of text, distilled from the Pile and got a BERT model up to the quality of T5 with 745x less data 💀 It's so over arxiv.org/pdf/2304.08442

220

22,035

Ethan · Jul 22, 2024 · 2:14 PM UTC

Ethan

@torchcompiled

22 Jul 2024

206

7,792

Ethan · Sep 29, 2024 · 3:53 PM UTC

Ethan

@torchcompiled

29 Sep 2024

how it feels to be working in diffusion and watching stuff happen in LLM world

203

9,221

Ethan · May 24, 2024 · 2:15 AM UTC

Ethan

@torchcompiled

24 May 2024

Replying to @DeadTheory_ @DeadTheory__

This could be a real dude who’s not that funny

185

5,528

Ethan · Nov 23, 2024 · 5:03 AM UTC

Ethan

@torchcompiled

23 Nov 2024

probably the most in-depth explanation I've seen of FSDP at the most granular levels, props to the authors dev-discuss.pytorch.org/t/fs…

213

10,102

Ethan · Jun 14, 2024 · 4:05 AM UTC

Ethan

@torchcompiled

14 Jun 2024

UnCLIP showed us the power of decoding from representation spaces, and really the power of hiearchical modeling across different levels of abstraction. But really this approach (imho wrongfully) kinda fell out of style

207

35,570

Ethan · Oct 13, 2024 · 11:27 PM UTC

Ethan

@torchcompiled

13 Oct 2024

Ok this is interesting, because this paper basically says the opposite. So long as 2% of your data is real, collapse is substantially mitigated. thoughts? feelings?

Xuandong Zhao

@xuandongzhao

13 Oct 2024

🚨 Fascinating insights from the paper “Strong Model Collapse”! (arxiv.org/abs/2410.04840) It concludes that even the smallest fraction of synthetic data (as little as 1% of the total dataset) can lead to model collapse. 🧠 Larger datasets don’t necessarily improve performance!

195

30,349

Ethan · Oct 8, 2025 · 4:09 PM UTC

Ethan

@torchcompiled

8 Oct 2025

This reminds me a bit of non-denoising score matching. The loss function incentivizes: - gradient/score of data points pushed to zero - trace of fisher matrix to represent curvature in p(x) around data points This resembles the 1st/2nd derivative test in calculus for detecting minima/maxima. In other words, we encourage data points to be both maxima of p(x) and for everything around data points to be lower in p(x) Diffusion models also learn to push off manifold data towards regions of higher probability, but does this with respect to the noisy distribution at a given timestep, varying in levels of noise across timesteps, as opposed to this method which does so with respect to the unnoised static data distribution. generative EBMs and diffusion models are close relatives in this regard, though successfully learning p(x) and being able to sample cleanly without getting “caught” has historically been difficult for EBMs. Exciting to see how it’s moving.

Yilun Du @du_yilun

6 Oct 2025

Replying to @du_yilun

In EqM, we supervise the gradient of the energy landscape so that points on the data manifold are local minima of the energy function. We supervise points far away from the data manifold to point to the data manifold, and points on the data manifold to have zero gradient.

203

21,779

Ethan · Mar 15, 2025 · 11:07 PM UTC

Ethan

@torchcompiled

15 Mar 2025

smallest attention layer code golf

196

13,764

Ethan · Oct 6, 2024 · 1:59 AM UTC

Ethan

@torchcompiled

6 Oct 2024

I love that we're finally doing entropy-based sampling. This paper advocated that human speech alternates between high entropy and low entropy words to convey information, and do it smoothly. but the approach never took off, its about time its revisited. arxiv.org/abs/2202.00666

Locally Typical Sampling

Today's probabilistic language generators fall short when it comes to producing coherent and fluent text despite the fact that the underlying models perform well under standard metrics, e.g.,...

arxiv.org

xjdr

@_xjdr

5 Oct 2024

the implementation in the last push is stable enough even with fixed thresholds (surprisingly) to make a few observations about the sampler capabilities beyond CoT or reasoning: 1) Prompt Optimizer: By evaluating the attention entropy after the prefill stage, you can optimize your prompt towards a low entropy state. This also extends to ICL and RAG. There is now a measurable target to see how well the model understands the input, per head, per layer, beyond just evaluating the output tokens 2) Steering Vector training loop: Instead of measuring the general direction of similar prompts and injecting steering vectors to drive further in that direction, you can now train the control vectors with an objective of lowering entropy in the attention heads as well as directly driving activations towards lower entropy state 3) Significantly improvement of narrative consistency issues due to entropy collapse for long generation lengths: We introduce just enough entropy to prevent collapse while still driving towards consistent (low varentropy) outputs. We could substantially improve this by introducing and combining logits from a base model as well, but even the sampler only version has had surprising improvements in capabilities, especially for the small model we are mostly testing on 4) Reduce Hallucinations: Instead of randomly sampling over a distribution of low probability logits, we can now drive the model to either dig deeper via branching until it is confident or to eventually the only thing it will be sure of is an eot token, which is ideal. There are many more interesting things i have observed up to this point, so many in fact that i am almost ready to say that even if we do not make any more progress on reasoning (we are and we will), just the improvements outlined above would lead this new technique to be, in my opinion, a wild success. In the before times, i would run a set of evals on every github push to measure improvements / regressions in order to guide next steps, so it looks like we might be at the point in this project where doing that might be the most important next thing to work on. I need to measure entropy and consistency over a large number and wide variety of prompts and I need to run actual evals against argmax and nucleus sampling baselines over a variety of evals and tasks. I am not prepared to say we have achieved anything at this time, but everything i have seen up to this point leads me to believe that this could be something worth spending some real dedicated time on and that has a good chance of leading to improved model capabilities in the long run

193

19,473

Ethan · Nov 15, 2024 · 7:58 AM UTC

Ethan

@torchcompiled

15 Nov 2024

i refuse to believe this is the cost of agi, there are way too many other paths still untouched. this is the same mentality that drove pure RL into the grave and now we don't even think twice on it.

Tsarathustra @tsarnick

15 Nov 2024

Garry Tan says Sam Altman has told him he wants to do a $1 trillion training run for an AI model, which could lead to understanding the nature of physics and solving engineering problems like nuclear fusion

183

89,368

Ethan · May 21, 2024 · 5:58 PM UTC

Ethan

@torchcompiled

21 May 2024

I just released the code for a couple different maze solving methods + small post here ethansmith2000.com/post/diff… the 2d diffusion method learns quite quickly, I think the other methods struggled around learning their position.

Ethan

@torchcompiled

28 Feb 2024

trained a diffusion model to solve mazes

191

15,234

Ethan · Jul 28, 2025 · 11:24 PM UTC

Ethan

@torchcompiled

28 Jul 2025

we were given universal function approximators, it is our god-given right to fit the models to anything and everything

186

5,516

Ethan · Nov 20, 2024 · 1:24 AM UTC

Ethan

@torchcompiled

20 Nov 2024

Oh wow, FID is fragile...

187

33,452

Ethan · Jun 14, 2024 · 10:49 PM UTC

Ethan

@torchcompiled

14 Jun 2024

say this out loud then take it to heart

175

40,125

Ethan · Aug 25, 2022 · 3:34 AM UTC

Ethan

@torchcompiled

25 Aug 2022

Much more than anything else, I am blown away by img2img capabilities #stabledifussion #ai #aiart #aiartcommunity

169

Ethan · May 1, 2024 · 5:04 AM UTC

Ethan

@torchcompiled

1 May 2024

They trained an LLM on manim code, 3blue would be proud

Gatekeep @gatekeep_labs

23 Mar 2024

Introducing Gatekeep, the text-to-video AI that transforms your questions into engaging educational explainer videos. Now available on web and Discord. Get started at gatekeep.ai

175

15,252

Ethan · Jul 4, 2025 · 5:57 PM UTC

Ethan

@torchcompiled

4 Jul 2025

This actually is really similar in spirit to one of the last significant transformer improvements: GEGLU. Going from Q * K -> Q * K1 * K2 is very similar in spirit to GEGLU’s gate

Omead Pooladzandi

@HessianFree

4 Jul 2025

This will probs be one of the most influential papers of the year

172

13,506

Ethan · May 26, 2024 · 1:59 AM UTC

Ethan

@torchcompiled

26 May 2024

These don’t hit the same since openai has taken on a slight grifter aura

roon

@tszzl

25 May 2024

The Seal of Solomon binds them and the Lithograph of Jensen releases them

163

13,157

Ethan · Dec 2, 2024 · 10:23 AM UTC

Ethan

@torchcompiled

2 Dec 2024

yo howd they get Kingma on the track

173

10,470

Ethan · Mar 4, 2024 · 9:32 PM UTC

Ethan

@torchcompiled

4 Mar 2024

Here's DipoleAttention, a novel attention method designed to extract different signals from positive and negative correlations, rather than a single spectrum. Give it a try, let's see if we can replicate the results. ethansmith2000.com/post/dipo… 🧵/N

Ethan

@torchcompiled

3 Mar 2024

this whole subset of attention variants reliably reaches a substantially reduced loss compared to baseline attention. What other tests can i run?

163

28,008

Ethan · Nov 10, 2024 · 6:23 AM UTC

Ethan

@torchcompiled

10 Nov 2024

I don’t understand the fascination with colonizing Mars, at least now. It’s gotta vastly easier to make earth more habitable than to somehow fix all of what currently makes life impossible on Mars. Instead of terraforming Mars, why not the 90% of Australia that’s not lived in?

169

23,300

Ethan · Feb 1, 2024 · 2:27 PM UTC

Ethan

@torchcompiled

1 Feb 2024

OP is correct that SD VAE deviates from typical behavior. but there are several things wrong with their line of reasoning and the really unnecessary sounding of alarms. I did some investigations in this thread to show you can rest assured, its really not a big deal.

Doron Adler @Norod78

1 Feb 2024

A Reddit user found a problem with the VAE of SD 1,2 (and Dall-E 3?). A spot where the VAE is trying to smuggle global information about the image through latent space (demonstrated there) something KL-divergence loss was supposed to prevent. teddit.net/r/StableDiffusion…

175

53,150

Ethan · Sep 14, 2024 · 6:07 AM UTC

Ethan

@torchcompiled

14 Sep 2024

2010, a foreshadowing moment for diffusion models.

166

13,775

Ethan · Jul 11, 2024 · 2:56 PM UTC

Ethan

@torchcompiled

11 Jul 2024

heres a cool paper to go along with this post

162

8,221

Ethan · Oct 23, 2025 · 3:18 PM UTC

Ethan

@torchcompiled

23 Oct 2025

I think this already feels wrong.

166

15,340

Ethan · Jun 20, 2024 · 4:52 AM UTC

Ethan

@torchcompiled

20 Jun 2024

> Autoregressive image generation without vector quantization > look inside > it's diffusion

159

27,250

Ethan · Oct 3, 2024 · 12:09 AM UTC

Ethan

@torchcompiled

3 Oct 2024

Same deal for diffusion! Two DMs trained on different subsets are more similar to each other than reality. The universal approximators with the same inductive biases trained on similar distributions discovered near identical functions! Who would have thought? Now Knowing this how can we avoid spending 100M+ training nearly the same model from scratch?

Rohan Paul

@rohanpaul_ai

2 Oct 2024

All LLMs are converging towards the same point 🤔

166

19,013

Ethan · Dec 26, 2024 · 3:18 PM UTC

Ethan

@torchcompiled

26 Dec 2024

Ignore the first part i've since learned what speculative decoding is. But multi-token prediction was massively under appreciated, you're getting more tasks/losses from near identical forward pass cost. Thats how you increase sample efficiency. Thank you deepseekV3

Ethan

@torchcompiled

3 May 2024

it was only briefly touched upon, but is it correct that multi-token prediction is only valid in the case of greedy decoding? also IMHO it seems like it'd make more sense and parameter efficient to have 1 shared head but then 1-2 branching transformer layers for each token.

166

18,198

Ethan · Mar 10, 2024 · 7:15 PM UTC

Ethan

@torchcompiled

10 Mar 2024

random access memories

159

7,780

Ethan · Dec 25, 2024 · 8:49 PM UTC

Ethan

@torchcompiled

25 Dec 2024

This is a cool idea, but you won't have a good time past the MNIST toy example. No backprop means needing... 128 forward passes, for grad estimate with only 0.009 cos similarity with true grad. increasing this to 2048 forwards only gives about 0.035. There's just the start

Will

@_brickner

24 Dec 2024

wrote a paper: it lets you *train* in 1.58b! could use 97% less energy, 90% less weight memory. leads to a new model format which can store a 175B model in ~20mb. also, no backprop!

162

34,987

Ethan · Sep 26, 2024 · 1:43 AM UTC

Ethan

@torchcompiled

26 Sep 2024

161

8,298

Ethan · Jul 9, 2024 · 3:12 PM UTC

Ethan

@torchcompiled

9 Jul 2024

I gave a presentation on diffusion models today

155

12,011