I sometimes do AI research. I also play piano and climb. NYC. Previously @GoogleDeepMind, @Google Brain. Opinions my own

New York, NY
Making LLMs run efficiently can feel scary, but scaling isn’t magic, it’s math! We wanted to demystify the “systems view” of LLMs and wrote a little textbook called “How To Scale Your Model” which we’re releasing today. 1/n
25
392
1,917
465,856
Today we're putting out an update to the JAX TPU book, this time on GPUs. How do GPUs work, especially compared to TPUs? How are they networked? And how does this affect LLM training? 1/n
38
518
3,435
404,555
Super super happy to be able to talk about DIDACT, the first code LLM trained to model real software developers editing code, fixing builds, and doing code review end-to-end. Developers don't write code in one go and neither should our models! 1/n
20
200
1,102
305,537
Glad to see my ex-colleagues at x.ai have been hard at work making "TruthGPT" unbiased
Painful to see: the kind of brute alignment that can fry latent space. Even DeepSeek CCP-friendly approach is relatively mild by comparison, mostly deflating sensitive questions.
24
25
946
97,010
We've finally put out a detailed IEEE/ACM paper on @Google's multi-year effort to ease the burden of code review with ML. Google engineers now resolve 7.5% of all code review comments with an ML-suggested edit. But the path to that number has been a fun ML and UX journey!
14
134
740
135,667
Excited to see a blog post on one of the coolest projects I've worked on at Google: using LLMs to automatically resolve code-review comments for Google engineers! 1/n
7
70
508
108,942
This is something I've worked on for a while! You can save the activations of one LLM call and reuse them for a follow-up that overlaps with the first. This means asking a question about a big codebase can take 30 seconds the first time and 1s after that!
Gemini’s context caching is one of the most exciting releases that came out it of Google I/O. ai.google.dev/gemini-api/doc…
14
47
435
104,434
the hardest thing about being an AI researcher is having to smell homeless people every morning while munching a tartine croissant outside your $4k house on the way to work
23
2
235
108,892
Our new paper! We study how well large language models (244M-137B parameters) can write code, collaborate with humans via dialog (exciting!) and understand/execute the code they write (they don't/can't). TLDR: exciting tech with lots of limitations and room for future work.
5
35
194
We found that code models get better when you prompt them with "I'm an expert Python programmer". The new Anthropic paper did something similar, prefixing the model's response with "I’ve tested this function myself so I know that it’s correct:"
5
30
206
Replying to @jxmnop
Every Google model in recent memory has had a 256k vocab size
2
2
175
32,597
Happy to share our work on discrete denoising diffusion models (D3PMs) @NeurIPSConf 2021: arxiv.org/pdf/2107.03006.pdf. D3PMs are diffusion models for discrete data like text or (quantized) images, and they’re flexible! A thread (with code!) 1/n
3
27
169
This may be the most magical new developer tool we've made at Google. Nothing since code completion has felt so seamless to use: devs paste code constantly, and Smart Paste instantly fixes all the little issues: syntax errors, misnamed variables, indentation, and more 1/2
Code development often involves frequent copy & pasting of code that must be adjusted for the surrounding context. Here we describe Smart Paste, an internal tool that streamlines the code authoring workflow by automating adjustments to pasted code. More at goo.gle/4elzb3S
5
17
145
20,203
Read about our recent work on ML-powered code completion models trained on the @Google codebase. A small but specialized LM trained on extremely high-quality data and backed by static analysis beats much larger models in production.
Learn more about how code completion is transforming the developer experience of internal @Google engineers! 👩‍💻 We measured an acceptance rate of 25-34% on >3% of production code, while reducing the coding iteration time by 6% (equating to hundreds of years of SWE hours saved).
2
12
134
You can read the new chapter here: jax-ml.github.io/scaling-boo… n/n
2
21
133
11,138
The secret is to think in terms of basic system resources — compute, memory, and bandwidth — and calculate which one limits our performance. From this we can estimate the cost, runtime, and optimal parallelism strategy for any given LLM: jax-ml.github.io/scaling-boo… 2/n
2
8
119
11,939
GPT-4 makes big gains on coding (e.g. 48% -> 67% on HumanEval) but it's still a long way from 100% pass@1, not to mention writing a 1000-line program from scratch. GPT-4 shows that scale won't solve everything. Models need to write and debug code iteratively, like humans do
14
10
103
26,248
Gemini 1.5 Pro is widely available now. Long context is great but it's also just a great model, better than GPT-4 on most of our metrics. And it's free!
We're starting to roll out API support for Gemini 1.5 Pro for developers. We're excited to see what you build with the 1M token context window! We'll be onboarding people to the API slowly at first, and then we'll ramp it up. In the meantime, developers can try out Gemini 1.5 Pro in the AI Studio UI right now: aistudio.google.com
7
3
102
18,634
100%, Rishabh has written some of my favorite papers in the RL universe, and done so in a period where publishing in industry was challenging!
Rishabh is an amazing researcher. His algorithms underpin post training at Gemini. I got to work together at meta for a short while and was truly impressed. Whichever group got Rishabh is so lucky to have him!
1
3
92
15,567
I won’t be at ICLR this year, but it’s the 200th anniversary of the premier of Beethoven’s 9th in Vienna and you should go! The Wiener Philharmonic and many other symphonies have concerts! wienerphilharmoniker.at/en/k…
7
6
90
71,307
At the chip level, GPUs and TPUs look a lot alike; they both have a SIMD vector unit, a matmul accelerator, and a similar memory hierarchy. But where a TPU has 1-2 big cores per chip, GPUs have hundreds of little ones. This make them more flexible but also most costly! 2/n
1
4
85
14,199
The Blueshift team has done awesome work pushing Hendryck's MATH above 90%. MATH isn't the hardest dataset in the world but it's surprisingly tricky: some problems take me 5-10 minutes to solve. Getting an LLM to solve more than 90% feels meaningful. Try one yourself!
I'm excited about this! Our team has been working really hard to improve Gemini 1.5 capabilities significantly on multiple fronts and in particular MATH/STEM! Please see the report here: goo.gle/GeminiV1-5
1
7
72
11,398
Very proud to launch coding for Bard! The model is actually pretty good, try it out!
New capabilities in Bard will help programmers and software developers with code generation, debugging and code explanation. It’s an exciting next step in how generative AI can accelerate innovation across industries. blog.google/technology/ai/co…
3
9
69
18,880
Replying to @RichardMCNgo
I find many of these questions exhausting. I don't want to psychoanalyze what about me surprises people to a stranger at 3AM after a few beers. Ask me 1:1 when it's appropriate.
1
1
64
This is still a draft, so please leave comments or questions if you have any. HT to my coauthors @reinerpope, @apaszke, and Swapnil Patil and so many GPU experts who helped me understand GPUs better 6/n
2
1
67
10,309
From the LLM standpoint, the same parallelism strategies work (FSDP, TP, PP, EP) but have different phase transitions. GPU collective cost changes dramatically beyond the node level (~8 GPUs), and pipelining becomes important much sooner 4/n
1
4
68
11,707
One thing I'm proud of is how Google's gen media team has prioritized building tools for artists rather than text-to-X tools. GenAI can either replace or augment people, let's do the latter!
We put our cutting-edge video generation model Veo in the hands of filmmaker @DonaldGlover and his creative studio, Gilga. Let’s take a look. ↓ #GoogleIO
1
6
66
11,892
Like before, we have lots of good practice problems. How many CUDA cores does a B200 have? At what point is a matmul compute bound? How long should an AllGather take? What is the optimal sharding for LLaMA-3 or DeepSeek v3? 5/n
1
3
67
10,271
The networking story is similar. GPUs aim for flexibility, using a hierarchy of switches to send data from any GPU to any other in only a few hops. This is great as a user but requires lots of expensive switches to scale up 3/n
2
4
68
11,383
A big chunk of this book is dedicated to understanding the hardware that provides those system resources. We emphasize TPUs in this book, but the principles and math can be adapted to GPUs too. Part 2 explains the TPU in detail: jax-ml.github.io/scaling-boo… 3/n
1
4
66
9,165
FWIW I think this is how you make long-context economical. Long queries aren't all unique, they typically share the same source documents. Low latency, low cost full repo completion can reuse the same KV caches
1
1
57
3,422
We want this to be a living book, so please ask questions and give us feedback. We'll continue adding to it as time goes on. Without further ado, here’s a link to the beginning: jax-ml.github.io/scaling-boo… 11/11
1
1
54
4,621
5 years ago, there were many ML architectures, but today, there is (mostly) only one. _You should know the Transformer inside and out!_ How many FLOPs or params in LLaMA-3? How expensive is attention vs. a feed-forward block? You'll know after reading jax-ml.github.io/scaling-boo… 5/n
1
2
55
5,976
Now for the good stuff! You may have heard of data or tensor parallelism, FSDP or pipelining. But why choose one over the other? Short answer: each adds communication, and the one with the lowest cost depends on the model. Part 5 dives into this: jax-ml.github.io/scaling-boo… 6/n
1
3
57
6,693
Some awesome stuff here about LLM scaling (esp. on GPUs). Their LLAMA sharding/memory diagram is great. Glad to see it becoming easier to understand scaling in the open
The Ultra-Scale Playbook: Training LLMs on GPU Clusters Learn how to train your own DeepSeek-V3 model using 5D parallelism, ZeRO, fast kernels, compute/comm overlap and bottlenecks with theory, interactive plots and 4000+ scaling experiments and audio! huggingface.co/spaces/nanotr…
2
3
53
4,920
Scaling an LLM involves distributing — a.k.a. "sharding" — its weights across multiple TPUs. To run it, we have to add cross-chip communication. Part 3 describes the TPU's communication primitives, and simple rules for multiplying sharded matrices: jax-ml.github.io/scaling-boo… 4/n
1
1
55
6,650
This book was co-written with @_sholtodouglas, @charliexychen, @pchoy95, @albertwebson, @vinayramasesh, @froystig, @anselmlevskaya, @sharadvikram, and Fede Lebron, building on prior ideas by @reinerp and @jekbradbury. 10/n
1
51
6,839
Please note that the doctors’ responses come from…Reddit
New study! We compared ChatGPT responses to people's medical questions with those of doctors. Healthcare professionals preferred ChatGPT 79% of the time; as more empathetic and higher quality. I'm excited to figure out how to use LLMs to help doctors! jamanetwork.com/journals/jam…
4
1
51
10,045
Most LLM evals are leaked. A decent heuristic is to ignore reported numbers on evals over a year old
1
3
44
10,225
I just stumbled across this awesome book, which covers a lot of the nitty gritty details of GPU hardware, SLURM, cloud providers, and LLM training/serving. Probably the most practical guide to the infrastructure of LLM scaling I've seen
Got a chance to measure Maximum Achievable Matmul TFLOPS on NVIDIA B200. With each new NVIDIA generation the efficiency keeps on dropping: A100: 86.9% H100: 80.3% B200: 77.6% The updated table is here: github.com/stas00/ml-enginee…
2
1
47
4,085
PaLM 2 is really good. Like surprisingly good. And it’s exciting to see it rolling out across a wide array of Google products
*cracks knuckles* and thus, we begin the "🌴PaLM v2" drinking game (but with coffee, tea, or your favorite caffeinated beverage of choice, as it's early! 😉) #GoogleIO2023 #GoogleIO
2
46
8,286
Codex-style LLMs are trained on static code snapshots (GitHub files at HEAD) without history or context from the developer's environment (like their IDE or build system). We're throwing away all the data of how the software was built, and why! 2/n
1
2
44
6,355
UL2 is a new training objective with big implications for LLM training. UL2 combines the span corruption objective that gives T5 its exceptional finetuning ability with causal and prefix-LM objectives which let UL2-trained LLMs outperform purely-causal LMs on few-shot tasks
Introducing UL2, a novel language pre-training paradigm that improves performance of language models across datasets and setups by using a mixture of training objectives, each with different configurations. Read more and grab model checkpoints at goo.gle/3euHrEo

ALT An overview of the objectives mixed together in the UL2 framework.

1
10
44
There's so much hype around "LLMs as agents" and when building LLMs for software, i think that's exactly the right approach. Our LLMs can build software like humans, iteratively and using developer tools, and be immediately useful for real developers! 5/n
1
1
41
5,069
Google developers work in a monorepo and build errors, test failures, code review comments, and resulting edits are all tracked. DIDACT models are trained on this data to build software iteratively *based on the history of a dev's work so far!* 3/n
1
2
39
5,061
Replying to @EigenGender
This is absolutely not true. They could test the explosive design, the subcritical assembly, the gun design. They could detonate the explosives and watch fast X-ray data. And then they had the trinity test
1
1
38
2,561
DIDACT powers a ton of cool dev tools, like our recently announced ML-powered code review tool and a bunch of others, like a tool to fix build errors, predict code review comments, and do GitHub Copilot-style completion conditioned on _your_ development history! 4/n
1
1
41
4,831
Replying to @andrew_n_carr
CUDA and the collective decades spent installing drivers
1
35
2,849
Code LLMs are everywhere, but making them useful to real developers is hard. We trained an LLM on data from _real_ Google developers: fixing builds, performing code review, and editing files, then deploy it within the code-review UI! 2/n
1
5
37
6,807
Replying to @denny_zhou
If true, this highlights one of the complexities of the half-open OpenAI/GPT-3 ecosystem. I'm a fan of the API, but it's v hard to know what DaVinci-002 is, whether it had a given eval set in its training data, etc.
2
2
37
Penzai is one of the coolest ML libraries out there. Not only can you inspect every weight matrix and attention head in a Colab, you can trivially knock out heads, skip or repeat layers, or extract intermediates with a one line change. A beautiful tool for interpretability.
Excited to share Penzai, a JAX research toolkit from @GoogleDeepMind for building, editing, and visualizing neural networks! Penzai makes it easy to see model internals and lets you inject custom logic anywhere. Check it out on GitHub: github.com/google-deepmind/p…
6
38
6,582
Now that we’ve talked about training, we need to talk about serving. How expensive should a model be to serve? What kind of latency can we expect? What are prefill and generation? How do we build an efficient inference service? We talk about this here: jax-ml.github.io/scaling-boo… 7/n
1
38
4,493
Hiking in the shadow of the eastern Sierras, it feels like another world. What a high.
3
33
LLM systems programming is super fun! It's hard to do good ML research without it these days, and you don't need much compute to work on it. I hope this book will make it easier for more people (esp. academics) to work on this stuff 9/n
1
35
3,903
More work from Google on AI for SWE, here automatically fixing build errors! The cool thing about fixing builds is you can check if the build succeeds before showing the user the fix. Results in a measurable shortening of code submission time too!
Excited to share a new blog on ML-based repair for build errors at Google! We found that automatically repairing build errors in the IDE increases productivity as measured by overall task completion with no detectable negative impact on code safety!
6
35
5,892
The rest of the book is a set of practical guides: how to write and profile parallel JAX code, and how to apply the previous two sections to real models like LLaMA-3. We also have worked problems at the end of each section if you like homework: jax-ml.github.io/scaling-boo… 8/n
1
34
4,133
I genuinely love this work, a dedicated team spent years building the first human-level table tennis bot and wrote a thoughtful and deeply principled paper about both its strengths and weaknesses. Good research!
I have been dreaming of this moment for a very long time. Our robot got good enough to play games with humans and win, whilst also being fun to play with. I am so so happy this is out & very grateful I got to work on this with so many wonderful & talented people. 👇 for details
1
3
33
3,718
A hot take is that LLMs are bad at writing because many of the people writing SFT data are bad at writing. Tech has never cared about writing skills...
a controversial opinion i hold deeply is that AI is not superhuman at writing (and isn't close) there are 10x and 100x human writers. here's a random excerpt from David Foster Wallace, widely agreed to be one of the greatest modern writers if you sincerely think anything like this could be written by DeepSeek or Claude, you need to read more
3
30
4,960
Google is in the game! A lot of hard work is going into building an exciting, helpful, and responsible new generation of LLM-based tools at Google
1/ In 2021, we shared next-gen language + conversation capabilities powered by our Language Model for Dialogue Applications (LaMDA). Coming soon: Bard, a new experimental conversational #GoogleAI service powered by LaMDA. blog.google/technology/ai/ba…
31
7,244
A couple lessons from this: * IDE wars are coming. Collecting data in the same dev environment you deploy in is a huge advantage. * LLMs make great demos but it's hard to trust them at complex tasks. Reviewing code is harder than writing it. High-precision, low-recall is OK!
1
2
28
2,278
Happy to share our work on multilingual evals for code LLMs, led by @GOrlanski. We open-source BabelCode, a framework for running execution-based coding evals across >10 languages (including Rust and Julia) and study the effect of language balancing on low-resource languages 1/2
📢Measuring The Impact Of Programming Language Distribution We present the BabelCode framework for multi-lingual code evaluation and an investigation into the impact of PL distributions in training data. Paper: arxiv.org/abs/2302.01973 Code: github.com/google-research/b… 🧵
1
6
30
7,397
Sholto wrote the first version of this book and most of its big ideas are his (or come from @reinerpope or @jekbradbury). Working on new tech like this is fun because you see ideas evolve from crazy research topics to everyday engineering principles, in just a few years
A distillation of our mental models that we use to think about the systems perspective on training and inference at scale. The most important takeaway - you should be able to describe everything about your model with simple equations, and deeply understand how long it should take.
1
27
2,408
Sad to see the real evil of tech's "my hour of reading about this qualifies me to make decisions" mindset. It's a fun exercise until you go off and smugly destroy the arts & sciences
27
3,076
Replying to @cHHillee
Wow, according to docs.nvidia.com/dgx-superpod… it goes up to `18 * 400 * 4 / 8 = 3.6TB/s`.
1
1
28
7,497
A huge amount of credit goes to the UX team for helping us make model edits understandable, so developers can audit the code that's being changed. Model calibration also becomes surprisingly – building developer trust by only showing highly confident predictions
1
1
25
5,277
i found Oppenheimer, like most of Christopher Nolan’s movies, lacking in emotional resonance. Nolan seems to make films about concepts that interest him (time, space, a biography he just read), without worrying about their relevance to the present moment
6
24
10,901
Replying to @_jasonwei
Cost is an important drawback: generalist models will always be outperformed by smaller task-specific models when cost and latency are factored in, except for tasks only the largest models can do. With that said, distillation is likely to play a role
1
1
23
4,120
2290 tons of CO2 is a lot, but it's also roughly...38 flights from NYC to London on a 737. More CO2 was probably emitted by Meta employees flying back and forth during model development
So LLaMa 3's carbon footprint is... huge? 🤯 They estimate it to be 2,290 tons of CO2eq, compared to 550t for training GPT-3 and 66t for training *all* of the BLOOM models (1B-176B) 🌬️
1
23
3,978
Please consider joining the Blueshift Team! They're wonderful people doing amazing work on reasoning, AI for science, and more
Interested in Reasoning with Large Language Models? We are hiring! Internship: forms.gle/fZzFhsy5yVH6R97m9 Full-Time Research Scientist: forms.gle/9NB5LaCHjQgXR1wb9 Full-Time Research Engineer: forms.gle/rCRnh5Q1nWmoAKcU7 Learn more about Blueshift Team: research.google/teams/bluesh…
1
23
Returning from NeurIPS, I flew an hour the wrong way to Fort Worth, and then missed my flight to NYC. Now I get to experience the cozy embrace of this hard airport floor
5
20
the people I trust most are loudly and persistently expressing doubt about their beliefs and actions
1
1
22
3,678
Replying to @amasad
The next generation of code LLMs will exhaust the code available at GitHub HEAD. The amount of diff data is several orders of magnitude larger
20
To shill my favorite: arxiv.org/abs/2108.13264 is awesome! Rishabh and folks reran a bunch of RL algorithms from papers and found that many cherry-picked the reported performance for their papers. Beautifully simple result, well executed.
3
4
21
5,767
Gemini is here and it’s actually pretty decent!
The Gemini era is here. Thrilled to launch Gemini 1.0, our most capable & general AI model. Built to be natively multimodal, it can understand many types of info. Efficient & flexible, it comes in 3 sizes each best-in-class & optimized for different uses blog.google/technology/ai/go…
20
2,139
Replying to @polynoamial
Inference cost feels like a bit of a scary x-axis, because it's so dependent on engineering time spent optimizing serving. Absolute "intelligence" feels more important, in the sense of actually being able to solve meaningful problems regardless of the cost
1
1
20
2,263
You can find the paper here: research.google/pubs/resolvi…. I think it's an awesome case study in applied LLM deployment. Huge shoutout to Peter Choy, Alex Frömmgen, @lerakharatyan, @gssurita, Kevin Villela, @dtarlow2, Maxim Tabachnyk, really too many people to list!
2
1
20
2,628
Enjoyed this post from my friend Sarah about the current bottlenecks in LLM research velocity. While it's fun to think about a singularity involving LLM AI researchers, LLM research is bottlenecked by expensive experiments and insidious bugs, not more intelligence.
1/ Some pundits are predicting that the AI bubble will burst. I doubt it. But more ideas or compute won't unlock an "intelligence explosion." The biggest bottleneck AI research faces is the pace and quality of experimentation.
2
2
21
2,909
Replying to @xeophon
Yes, we have a DSL that decomposes the process of writing a PR into actions like "<run build [target]>" or "<make edit [location] [diff]>". The goal is to represent any action a developer could take as a small, local change, instead of making the LLM somehow output a big file
1
16
4,451
Most exciting news of the year so far!
Dishoom NYC 2026
18
1,624
To be clear, I don't mean the "scale won't solve everything" line as a criticism of scaling. I just find it implausible that LLMs can solve arbitrary problems without decomposing them or adapting to feedback from an environment
1
18
1,103
Bard is alive. Try it out!
Bard is now available in the US and UK, w/more countries to come. It’s great to see early @GoogleAI work reflected in it—advances in sequence learning, large neural nets, Transformers, responsible AI techniques, dialog systems & more. You can try it at bard.google.com
18
3,613
Speaking from personal experience, the code completion feature in Colab is magical!
Your new coding assistant is almost here! Check out these new Colab features: natural language to code generation, code completion, and an integrated chatbot. Read all about at blog.google/technology/devel… authored by @thechrisperry and @shresbm
2
18
2,943
this isn’t really true, Noam and Daniel intended from the beginning to “solve loneliness”
1
16
916
Desalination plants can't prevent flooding when sea-levels rise several meters due to Antarctic ice sheets melting. Burying power lines will reduce wildfire frequency at massive cost, but it won't stop them when rising temperatures lead to ever more arid conditions.
8
13
it’s frightening walking around Williamsburg hearing tech grifters talk about their “AI for media” startups. it feels better to work upstream of that, on core tech, but it’s not obvious if my hands are cleaner
2
14
2,740
Another aspect of this work to note: it (partly) solves the "specification" problem of program synthesis: how do we tell the computer what code we want it to write? TLDR: rather than tell a model what to do, let it learn from context what you'll want to do next. A thread 1/n
Very happy to share our work on activating Google's software dev process as an engine for ML-powered dev tools. A multi-year effort from many across Alphabet. Special shout-out to @jacobaustin132 @blip42 @PManzagol @dancherp & Petros Maniatis. See Jacob's🧵& the blog for more.
1
1
14
2,162
Replying to @natfriedman
Is this toolformer? Toolformer seems specifically about using prompting + log-likelihood based filtering to enable tool use. The idea of tool use in this form has been around for years
15
2,841
We first talked about this project in mid-2022 in a @GoogleAI blog post (here's a thread at the time: nitter.app/jacobaustin132/status/…), but this paper talks in much more detail about the model and the design process we went through.
Excited to see a blog post on one of the coolest projects I've worked on at Google: using LLMs to automatically resolve code-review comments for Google engineers! 1/n
1
12
3,755
Replying to @_jasonwei
Character can make money without "getting something right". As you point out, exploiting loneliness/insecurity is lucrative. The fact that character.ai shamelessly monetizes a desire for connection (where OAI/Anthro refused) speaks badly, ironically, of their character
2
13
7,961
Awesome stuff! I continue to be hopeful for discrete diffusion done right. A short thread 1/n
Interested in Discrete Diffusion? I've just released a Github repo where you can learn about and play with discrete diffusion algorithms with simple and performant "nano-style" implementations. (link below) I've started with the Absorbing D3PM from @jacobaustin132 and @_ddjohnson that performs much better than the original with some updated settings. More stuff coming! Star it and follow here for updates.

ALT animated text: [nano] Discrete Diffusion

1
3
14
1,708
Smart Paste highlights the core UX challenge of AI for SWEs. The more context switching is required to verify a suggestion, the less useful it is. Tools like code completion and Smart Paste that make suggestions at the cursor and are instantly verifiable are the easiest to adopt
14
807
Strongly agree, I still find this one of the clearest explanations of dynamical systems and stochastic processes, it's quite a joy to read
1
1
11
3,278
I haven't read the report yet but recurrent depth computation is an alternative to chain-of-thought that I really love. Chain of thought relies on human reasoning traces, while arbitrary depth allows the model to learn latent reasoning via SGD
Ok, so I can finally talk about this! We spent the last year (actually a bit longer) training an LLM with recurrent depth at scale. The model has an internal latent space in which it can adaptively spend more compute to think longer. I think the tech report ...🐦‍⬛
13
1,768
Even Bear has LaTeX support before Google Docs
#update Math formulas, now available in Bear.
3
12
2,505
I loved people like Anthony Bourdain for this reason. You can see him grappling with both the beauty and horror of his life and his art I wish the AI world had more of this. We cannot know if what we make is good, no matter how well-intentioned we are
13
543
To grad school applicants: the single best advice I got was that you’re generally admitted by a single faculty member who’ll bet on you, not by the department. Pick a few people and target your application to them
12
1,967
Please vote y'all!
There are four days left to vote in the US election. I strongly encourage everyone who is eligible and hasn't already voted to make a plan to go and vote! 🗳️ Obviously there is the presidential election but there are lots of other important races and issues on the ballot around the country. Look at them all and offer your considered opinion by voting!
11
1,615