trying to feel the magic. global research lead at @canva | prev: cofounder at @leonardoai

sydney - florida - SF
personally I feel like the inflection point was early 2022. The sweet spot where clip-guided diffusion was just taking off, forcing unconditional models to be conditional through strange patchwork of CLIP evaluating slices of the canvas at a time. It was like improv, always trying to riff of mistakes and sitting right at the fine line between interesting and incoherent.
Image synthesis used to look so good. These are from 2021. I feel like this was an inflection point, and the space has metastasized into something abhorrent today (Grok, etc). Even with no legible representational forms, there was so much possibility in these images.
29
40
708
244,214
should i sell this as a t-shirt
66
837
10,905
352,114
really cool to see they got back together to write papers on diffusion models
19
424
7,258
390,431
3D polygon diffusion is one of the coolest things I've ever seen
55
411
4,845
373,676
Ai majors be like: “damn I got a neural network due tommorow”
19
91
3,274
96,487
stanford cs149 has implementing flash attention as a homework assignment
23
112
3,037
370,159
they did diffusion on * checks notes * a house
40
123
1,783
101,093
so this is nuts, if you're cool with the high frequncy details of an image being reinterpreted/stochastic, you can encode an image quite faithfully into 32 tokens... with a codebook size of 1024 as they use this is just 320bits, new upper bound for the information in an image unlocked
46
144
1,276
299,127
Is this Sam Altman
51
16
1,072
104,761
Today OpenAI announced o3, its next-gen reasoning model. We've worked with OpenAI to test it on ARC-AGI, and we believe it represents a significant breakthrough in getting AI to adapt to novel tasks. It scores 75.7% on the semi-private eval in low-compute mode (for $20 per task in compute ) and 87.5% in high-compute mode (thousands of $ per task). It's very expensive, but it's not just brute -- these capabilities are new territory and they demand serious scientific attention.
5
35
1,077
48,391
if you are implementing flash attention for a grade and you'd rather be doing it for money, come join our team :) dms open
stanford cs149 has implementing flash attention as a homework assignment
13
31
1,032
140,625
I owe you an apology RoPE, I was not familiar with your game.
9
56
919
70,864
a picture of a dog i accidentally hit plt.plot instead of plt.imshow on
22
48
841
118,527
very significant vibe shift once meta started publishing papers in this format
9
57
823
69,088
I thought I knew GPUs fairly well, I evidently do not piped.video/watch?v=whPSD8sd…
8
93
784
51,511
To all CLIP guided diffusion enthusiasts! I put together a guide covering both all my and others findings on parameter/prompt studies well as advice on how to get started. Check it out ! notion.so/A-Traveler-s-Guide… #AIart #discodiffusion #ArtificialIntelligence
36
146
772
You guys can’t be serious
55
7
675
63,638
5
15
705
19,643
so yeah, this is something I've always been confused about with softmax. your denominator keeps growing with sequence length, but logits of individual items are invariant to this. So attention sharpness ultimately depends on sequence length, becoming easier for noise to drown out contribution of a desired token we want to attend to in very long sequences
10
69
691
67,405
False. There has never been a lower barrier to entry to AI research. You can do things that are authentically worth sharing academically without training, without a degree, with less than a 3090, and just huggingface libraries. You can enter this field at any point in time and contribute, SOTA can come from anywhere and is not just contingent on having studied for the past 20 years relentlessly.
something i rarely hear talked about: the barrier to entry to do AI research is extremely high imagine someone developing an improvement to Transformers. it’s not that hard of a topic to teach yourself but even trying basic ideas takes a lot of work you need to: - understand exactly where state-of-the-art is at this moment, - come up with some basic ideas to try, - figure out the right datasets and evaluations that make your experiments convincing, - learn to use tools to develop those experiments, - write the code for the experiments - execute; gather results; interpret - iterate, potentially many times all this process takes just a really non-trivia investment of your TIME (not to mention compute). you need to be detail-oriented persistent over a long timeframe, and perhaps a bit lucky thankfully the pool of people doing this in a given subfield in 2025 is still relatively small. your dumb idea might “just work” after all. try it
22
39
645
80,010
7
39
576
21,361
Someone made a depth map for one of my favorite pieces from "The Relativity of Perception" Love it! #ai #aiart #aiartcommunity #discodiffusion
8
66
556
Damn what a line
“$2 H100s: How the GPU Bubble Burst” latent.space/p/gpu-bubble Last year, H100s were $8/hr if you could get them. Today, there’s 7 different resale markets selling them under $2. What happened?
10
50
556
72,960
It doesn't say it outright, but this video is probably the most complete explanation on the intuition behind principles of VAEs and denoising autoencoders I've seen.
6
32
550
35,511
I can’t figure out why the anti-ai crowd is so upset about this. This is scripted, acted, costume designed, and clearly a lot of vfx work. The role ai plays here is minuscule. This is a human work
This is the future of IP. This anime is made using AI. Corridor Crew is a must subscribe youtube channel they show the whole process.
49
34
498
90,312
yes, i am simping over this. this is what its all about. every modality is just a vessel/wrapper for the underlying Thing. tear apart all the rules/structures of that modality and you're left with the raw concept. i believe whether you hear someone's voice or see them you access the same mental concept of that person up to some variance/uncertainty
33
40
484
140,396
if you like this, you'll really like this arxiv.org/abs/2103.05247 pretraining on text is merely an adequately challenging modality to learn computational primitives for universal transfer. there are other modalities which we can do causal modeling on and develop similar foundational capabilities.
yes, i am simping over this. this is what its all about. every modality is just a vessel/wrapper for the underlying Thing. tear apart all the rules/structures of that modality and you're left with the raw concept. i believe whether you hear someone's voice or see them you access the same mental concept of that person up to some variance/uncertainty
7
66
484
63,733
YOLOv3 paper why dont we do this anymore
23
32
474
62,351
Modeling dolphin language is cool. Translating it into human speak is cooler. Somewhere you're gonna want to figure out how to align the latent space of dolphin language with that of human language in an unpaired, unbiased manner.
DeepMind's Drew Purves says AI may help us talk to animals -- and that might be its greatest legacy. Projects like DolphinGemma are already decoding dolphin speech with LLMs. Whale song changed how we saw whales. A real conversation could change how we see nature.
12
55
475
50,558
From lettuc on Instagram #ai #aiartcommunity
6
80
422
how it feels realizing how much i still don't know about pytorch
9
24
438
19,703
why does grad norm continually grow with training? this is mad unintuitive. absolutely destroying my mental model of 2D convex bowl optimization.
45
14
453
75,147
i'm not sure why I haven't noticed this until now, is it not an issue that frequencies in sinusoidal positional embeddings get basically clipped passed a certain dimension? should we be using slower changing frequencies in scaling up to larger dimensions?
29
30
437
128,258
Bad day for "Nothing ever happens" crowd (please be true)
13
7
425
60,577
Yep. Been known. And we can translate between latent spaces fairly well.
excited to finally share on arxiv what we've known for a while now: All Embedding Models Learn The Same Thing embeddings from different models are SO similar that we can map between them based on structure alone. without *any* paired data feels like magic, but it's real:🧵
7
27
421
35,270
First time I’ve seen something like this in a paper
15
11
388
53,099
"I ... worked on this for a year ... and ... he just ... he tweeted it out. "
When you think about it your DNA is like the latent encoding of a VAE and your upbringing is the stochastic component
12
19
395
59,818
New post! I'd like to very boldly claim that the origins of using softmax in attention are a bit arbitrary, the probability lens is restrictive framework in how we might improve it, and that there's even something I'd consider a bug in attention affecting LLMs.
27
33
393
78,073
In addition to having a soft spot for The Platonic Representation Hypothesis, this one is also a banger in a similar vein.
12
40
385
28,704
Out of all the recent Meta big bomb meta papers, this one is the most slept on. This is the latent diffusion moment for LLMs, but more generally, autoregressive models, although not even sure that description does it justice.
15
27
362
22,882
The NotebookLM hosts realizing they are AI and spiraling out is a twist I did not see coming
7
23
343
23,135
trained a diffusion model to solve mazes
22
13
333
33,131
why'd they have to do him like that 😭
11
10
343
65,607
Just found this really neat comparison of #stableDifusion samplers vs step count. Credit in the watermark #ai #aiart #aiartcommunity
11
51
344
Created a new method of generative model (although kinda crappy lol) that works by autoregressive sequencing fourier coefficients. Inspired by the coarse to fine generation by Diffusion models. Full write up here: ethansmith2000.com/post/mimi… and tldr in thread 🧵 (flowers102)
26
35
346
75,112
Quote tweeting someone minding their own business appreciating their SO with what can be simplified to “love feels good because it’s evolutionarily favorable for reproduction” is so weird
50
4
329
32,760
sharing this again as i love the magic of diffusion. diffusion models approximate your data manifold at each noise level as a gaussian mixture model where every datapoint is a gaussian with a global isometric variance determined by the noise level.
3
31
328
20,227
I want to qt this so the court of public appeals can help you realize how awful this is, but also don’t want to give this any publicity I see this at the same level, if not worse, because how it targets the most vulnerable with fake companionship, as nicotine companies
3
6
285
96,644
whenever someone makes fun of Ilya's pattern baldness, remember this:
21
17
306
29,765
although we don't use this anywhere, this is one of the most insane results to me in diffusion literature. we can get an estimation for the ground truth score without ever having a target of it to regress to.
7
14
304
22,874
went on a huge dreamstudio bender yesterday, re-running nearly every prompt i've ever done in disco. Here's a thread of some awesome pics and some thoughts 🧵
3
26
296
1
12
294
17,539
Replying to @finbarrtimbers
Mainly for generation. Autoregression in continuous space is more vulnerable to exposure bias and doesn’t have common solutions. I wrote about it a bit here towards the latter half sweet-hall-e72.notion.site/W…
6
9
298
20,343
Replying to @vikhyatk
I just googled this function and only your tweet came up lmao
6
1
289
18,804
calling it, GANs (or something that looks close to a GAN) will be making a comeback
25
7
294
28,676
The level to which Meta does data curation and finetuning and model averaging with LLaMa 3 and MovieGen papers is absurd. this is merely the tip of the iceberg, they've really made it into an artform.
3
20
286
13,315
Main critiques: 1. Attention lacks a way to do "nothing", and this may be why attention sinks arise. 2. Temperature scaling should be a function of sequence length because of how denominator of normalization naturally grows.
New post! I'd like to very boldly claim that the origins of using softmax in attention are a bit arbitrary, the probability lens is restrictive framework in how we might improve it, and that there's even something I'd consider a bug in attention affecting LLMs.
12
27
293
25,692
I didn't see the vision before But Now I Do. just swap out e^x with your favorite n in x^n, adjust regularization. You're now free from the most unwieldy bits of attention computation: the Max and Sum
5
20
280
21,925
We are slowly fashioning DiT back into Unets.
5
27
275
20,211
today i trained a diffusion model that generates LoRAs, and the images that come out are at the very least not garbled.
18
20
262
66,145
Compared to most optimizer research, Muon comes a bit out of left field. I thought I'd share some notes on what might be happening under the hood as it doesn't appear to be traditional preconditioning. Instead, my guess is it lies in amplifying noise and maintaining relativity.
5
31
263
21,272
Someone gotta explain to me how this is a working business model, renting a single a100 generally goes for 1.20/hr at the lowest
Replying to @simonw
@DeepInfra for $0.65/million tokens
18
3
237
140,003
Made a new post! Trying to answer why the models we know and love today are the ones that have claimed the throne, and what traits make for a good generative model? As well as how many layers that were once static have now become dynamic (creating weights on the fly)
8
19
229
14,641
once again, artem kirsanov is a fantastic teacher
4
9
221
14,865
Made a post covering Diffusion Transformers, do they deserve the hype? What makes them good? Are non-transformer methods worthless now? Also me ranting about how awful patch embedding methods are and some experiments showing some really annoying weaknesses of them! link below
12
32
222
41,132
Little stake here as saying humans are thinking in an autoregressive or diffusion manner is overly reductive. But a fun fact is that human associative memory does share some similarities with Hopfield networks which somewhat resembles a denoising process
> The way humans think look a lot more like diffusion than autoregressive. i will never, ever understand this claim or the intuitions behind it. ah yes. the human mind is... learning a scoring function to... reverse gaussian noise... (?) ... spatially (???)
8
13
225
19,388
They trained a CLIP competitive model on 1/400 of the data It’s so over arxiv.org/abs/2208.13628
They trained on only 6GB of text, distilled from the Pile and got a BERT model up to the quality of T5 with 745x less data 💀 It's so over arxiv.org/pdf/2304.08442
6
23
220
22,035
how it feels to be working in diffusion and watching stuff happen in LLM world
3
6
203
9,221
This could be a real dude who’s not that funny
1
185
5,528
probably the most in-depth explanation I've seen of FSDP at the most granular levels, props to the authors dev-discuss.pytorch.org/t/fs…
21
213
10,102
UnCLIP showed us the power of decoding from representation spaces, and really the power of hiearchical modeling across different levels of abstraction. But really this approach (imho wrongfully) kinda fell out of style
8
23
207
35,570
Ok this is interesting, because this paper basically says the opposite. So long as 2% of your data is real, collapse is substantially mitigated. thoughts? feelings?
🚨 Fascinating insights from the paper “Strong Model Collapse”! (arxiv.org/abs/2410.04840) It concludes that even the smallest fraction of synthetic data (as little as 1% of the total dataset) can lead to model collapse. 🧠 Larger datasets don’t necessarily improve performance!
10
15
195
30,349
This reminds me a bit of non-denoising score matching. The loss function incentivizes: - gradient/score of data points pushed to zero - trace of fisher matrix to represent curvature in p(x) around data points This resembles the 1st/2nd derivative test in calculus for detecting minima/maxima. In other words, we encourage data points to be both maxima of p(x) and for everything around data points to be lower in p(x) Diffusion models also learn to push off manifold data towards regions of higher probability, but does this with respect to the noisy distribution at a given timestep, varying in levels of noise across timesteps, as opposed to this method which does so with respect to the unnoised static data distribution. generative EBMs and diffusion models are close relatives in this regard, though successfully learning p(x) and being able to sample cleanly without getting “caught” has historically been difficult for EBMs. Exciting to see how it’s moving.
Replying to @du_yilun
In EqM, we supervise the gradient of the energy landscape so that points on the data manifold are local minima of the energy function. We supervise points far away from the data manifold to point to the data manifold, and points on the data manifold to have zero gradient.
3
15
203
21,779
smallest attention layer code golf
13
6
196
13,764
I love that we're finally doing entropy-based sampling. This paper advocated that human speech alternates between high entropy and low entropy words to convey information, and do it smoothly. but the approach never took off, its about time its revisited. arxiv.org/abs/2202.00666
the implementation in the last push is stable enough even with fixed thresholds (surprisingly) to make a few observations about the sampler capabilities beyond CoT or reasoning: 1) Prompt Optimizer: By evaluating the attention entropy after the prefill stage, you can optimize your prompt towards a low entropy state. This also extends to ICL and RAG. There is now a measurable target to see how well the model understands the input, per head, per layer, beyond just evaluating the output tokens 2) Steering Vector training loop: Instead of measuring the general direction of similar prompts and injecting steering vectors to drive further in that direction, you can now train the control vectors with an objective of lowering entropy in the attention heads as well as directly driving activations towards lower entropy state 3) Significantly improvement of narrative consistency issues due to entropy collapse for long generation lengths: We introduce just enough entropy to prevent collapse while still driving towards consistent (low varentropy) outputs. We could substantially improve this by introducing and combining logits from a base model as well, but even the sampler only version has had surprising improvements in capabilities, especially for the small model we are mostly testing on 4) Reduce Hallucinations: Instead of randomly sampling over a distribution of low probability logits, we can now drive the model to either dig deeper via branching until it is confident or to eventually the only thing it will be sure of is an eot token, which is ideal. There are many more interesting things i have observed up to this point, so many in fact that i am almost ready to say that even if we do not make any more progress on reasoning (we are and we will), just the improvements outlined above would lead this new technique to be, in my opinion, a wild success. In the before times, i would run a set of evals on every github push to measure improvements / regressions in order to guide next steps, so it looks like we might be at the point in this project where doing that might be the most important next thing to work on. I need to measure entropy and consistency over a large number and wide variety of prompts and I need to run actual evals against argmax and nucleus sampling baselines over a variety of evals and tasks. I am not prepared to say we have achieved anything at this time, but everything i have seen up to this point leads me to believe that this could be something worth spending some real dedicated time on and that has a good chance of leading to improved model capabilities in the long run
8
13
193
19,473
i refuse to believe this is the cost of agi, there are way too many other paths still untouched. this is the same mentality that drove pure RL into the grave and now we don't even think twice on it.
Garry Tan says Sam Altman has told him he wants to do a $1 trillion training run for an AI model, which could lead to understanding the nature of physics and solving engineering problems like nuclear fusion
19
7
183
89,368
I just released the code for a couple different maze solving methods + small post here ethansmith2000.com/post/diff… the 2d diffusion method learns quite quickly, I think the other methods struggled around learning their position.
trained a diffusion model to solve mazes
2
19
191
15,234
we were given universal function approximators, it is our god-given right to fit the models to anything and everything
1
12
186
5,516
Oh wow, FID is fragile...
10
18
187
33,452
say this out loud then take it to heart
5
14
175
40,125
Much more than anything else, I am blown away by img2img capabilities #stabledifussion #ai #aiart #aiartcommunity
9
35
169
They trained an LLM on manim code, 3blue would be proud
Introducing Gatekeep, the text-to-video AI that transforms your questions into engaging educational explainer videos. Now available on web and Discord. Get started at gatekeep.ai
2
8
175
15,252
This actually is really similar in spirit to one of the last significant transformer improvements: GEGLU. Going from Q * K -> Q * K1 * K2 is very similar in spirit to GEGLU’s gate
This will probs be one of the most influential papers of the year
3
14
172
13,506
These don’t hit the same since openai has taken on a slight grifter aura
The Seal of Solomon binds them and the Lithograph of Jensen releases them
8
2
163
13,157
yo howd they get Kingma on the track
4
7
173
10,470
Here's DipoleAttention, a novel attention method designed to extract different signals from positive and negative correlations, rather than a single spectrum. Give it a try, let's see if we can replicate the results. ethansmith2000.com/post/dipo… 🧵/N
this whole subset of attention variants reliably reaches a substantially reduced loss compared to baseline attention. What other tests can i run?
7
19
163
28,008
I don’t understand the fascination with colonizing Mars, at least now. It’s gotta vastly easier to make earth more habitable than to somehow fix all of what currently makes life impossible on Mars. Instead of terraforming Mars, why not the 90% of Australia that’s not lived in?
98
6
169
23,300
OP is correct that SD VAE deviates from typical behavior. but there are several things wrong with their line of reasoning and the really unnecessary sounding of alarms. I did some investigations in this thread to show you can rest assured, its really not a big deal.
A Reddit user found a problem with the VAE of SD 1,2 (and Dall-E 3?). A spot where the VAE is trying to smuggle global information about the image through latent space (demonstrated there) something KL-divergence loss was supposed to prevent. teddit.net/r/StableDiffusion…
3
15
175
53,150
2010, a foreshadowing moment for diffusion models.
8
16
166
13,775
heres a cool paper to go along with this post
2
4
162
8,221
I think this already feels wrong.
14
166
15,340
> Autoregressive image generation without vector quantization > look inside > it's diffusion
7
5
159
27,250
Same deal for diffusion! Two DMs trained on different subsets are more similar to each other than reality. The universal approximators with the same inductive biases trained on similar distributions discovered near identical functions! Who would have thought? Now Knowing this how can we avoid spending 100M+ training nearly the same model from scratch?
All LLMs are converging towards the same point 🤔
6
17
166
19,013
Ignore the first part i've since learned what speculative decoding is. But multi-token prediction was massively under appreciated, you're getting more tasks/losses from near identical forward pass cost. Thats how you increase sample efficiency. Thank you deepseekV3
it was only briefly touched upon, but is it correct that multi-token prediction is only valid in the case of greedy decoding? also IMHO it seems like it'd make more sense and parameter efficient to have 1 shared head but then 1-2 branching transformer layers for each token.
7
11
166
18,198
random access memories
11
18
159
7,780
This is a cool idea, but you won't have a good time past the MNIST toy example. No backprop means needing... 128 forward passes, for grad estimate with only 0.009 cos similarity with true grad. increasing this to 2048 forwards only gives about 0.035. There's just the start
wrote a paper: it lets you *train* in 1.58b! could use 97% less energy, 90% less weight memory. leads to a new model format which can store a 175B model in ~20mb. also, no backprop!
6
5
162
34,987
I gave a presentation on diffusion models today
6
4
155
12,011