Spent ~24 hrs on planes w/ a 1.5 year old over break (not advised!) and listened to A LOT of podcasts. Some notes from my favorite ones.
What is an agent?
@erikschluntz +
@barry_zyj define an agent as an LLM that autonomously performs actions (e.g., calling tools in a loop) [1], similar to the ReACT architecture [2].
How well do agents perform?
@erikschluntz + team achieved 49% on SWE-bench Verified with a Claude3.5-Sonnet ReACT agent [3].
@claybavor gives an overview of Tau-Bench, a customer support eval benchmark, but mentions that frontier models with ReACT have poor reliability (gpt-4o achieves 61% and 35% on retail and airline customer support evaluations respectively) [4, 5].
How to address agent shortcomings today?
@erikschluntz +
@barry_zyj define “agentic workflow” as a system where LLMs + tools are orchestrated through predefined code paths (chain, parallelization, orchestrator-worker, etc) [1].
@claybavor mentions these types of workflows (or “reasoning scaffolding”) perform better than ReACT on Tau-Bench [4]. But
@polynoamial argues that reasoning scaffolding may not scale w/ data and agent definition above w/ high capacity reasoning models (e.g., o-series, etc) + tool use ultimately may prevail [6].
What is happening with pre-training scaling?
@dylan522p highlights what the people who know the most are doing [7]: Anthropic is working on a 400k Trainium chip cluster with Amazon, Zuck has a 2GW datacenter planned in Louisiana [8], Elon will have a 100k H100 cluster come online in the next few months [9].
@DarioAmodei mentioned that we will probably see a $100B cluster by 2027 [10].
@polynoamial says to consider the economics of the scaling of pre-training rather than the idea of a hard (e.g., data) wall; we know it’s viable to spend ~hundreds of millions, but at what scale are returns no longer viable [6].
What about test time compute (TTC) scaling?
@polynoamial says we are much earlier in this curve [6]. But
@dylan522p does point out that that TTC is less profitable than pre-training [7]: MSFT has reported $10B inference at ~50-70% gross margin on hosting OpenAI models, but TTC in reasoning models (e.g., o-series) uses ~10x more tokens are used to generate answers with reduced batching (~4-5x more servers to handle the same number of users) so the cost may be ~50x more [7].
@polynoamial points out that some reasoning problems are extremely high value (pay ~millions to solve them).
What problems can TTC address?
@polynoamial framed this [11]: it works well on cases where there is a clear “generator-verifier gap” (it is hard to generate solutions, but easy to verify a correct one). Coding and math are obvious examples. SWE-bench Verified went from 49% w/ Sonnet-3.5 [12] to 71% with O3 [13].
@swyx mentioned he uses o1 for AI news (writing w/ strong curation / summarization) [14].
What do these scaling trends mean for NVDA?
@dylan522p argued that, for LLMs today, the software moat is smaller for inference vs training [7]: MSFT can justify deploying models on AMD if it lowers costs b/c they're running relatively few models at scale. But, for TTC this may shift this:
@polynoamial argues that scaling inference for TTC was one of his primary concerns about the timeline for AGI [6], but apparently they’ve done a lot of work to resolve this. Jensen said Blackwell plays into the higher inference load for TTC (e.g., 10k+ tokens of thinking and also the need for much greater demand on high bandwidth memory) [15], which is provided by SK Hynix + Micron.
What do these scaling trends mean for the application layer?
@chetanp points out the rapidly dropping cost of inference in part due to open source models (w/ routers that pass requests between different models to cost optimize) [16]. Benchmark has made 25 AI bets (21 are application layer and 4 are infrastructure), the most they’ve invested since 2009 (mobile) and 1995 (internet). He is seeing a fast sales cycle with application layer companies because typically it is workforce displacement and targeting big / incumbent spent markets (sales automation, legal, accounting, ad networks, game development, circuit board design, new document processing tools).
When and what is AGI?
@DarioAmodei argues there is not a discrete threshold for AGI, it’s a smooth progression of capabilities like the term “supercomputing” in the 1990s with one “we’ll know it when we see it” heuristic that we’ll see Nobel-prize level work across many domains [10].
@polynoamial says that he underestimated prior timelines (e.g., to solve inference for test time compute) and that he expects progress to accelerate in 2025: “the problems we’ve already solved are harder than the problems we have ahead.”
Sources
[1]
anthropic.com/research/build…
[2]
react-lm.github.io/
[3]
latent.space/p/claude-sonnet
[4]
sequoiacap.com/podcast/train…
[5]
sierra.ai/blog/benchmarking-…
[6]
piped.video/watch?v=OoL8K_AF…
[7]
piped.video/watch?v=QVcSBHhc…
[8]
datacenterdynamics.com/en/ne…
[9]
datacenterdynamics.com/en/ne…
[10]
piped.video/ugvHCXCOmm4?si=ZuxK…
[11]
piped.video/watch?v=jPluSXJp…
[12]
anthropic.com/research/swe-b…
[13]
nitter.app/arankomatsuzaki/status…
[14]
latent.space/p/2024-review
[15]
teddit.net/r/singularity/com…
[16]
open.spotify.com/episode/16Y…