Pinned Tweet
love this table from @theinformation
26
30
809
82,691
sonnet 4.5 dropped 2 weeks back – “the best coding model yet” we ran it through our internal agentic coding benchmark – real-world engineering tasks and codebases. results: indeed, claude 4.5 wins! (although gpt-5-codex is close and costs less than half) but something much more interesting surprised me: > about half the tasks each model failed were passed by the other. in other words: > they’re different types of coders. we deep dived one example. claude 4.5: the craftsman perfectionist. slow to err, obsessed with correctness, maybe a little neurotic about spacing but ultimately reliable. gpt-5-codex: the hacker-engineer. exploratory, error-prone, and a bit too eager to improvise. full benchmark + example deep dive -> surgehq.ai/blog/sonnet-4-5-c…
46
33
437
50,872
longwinded. eccentric. sick of VC-backed get rich quick schemes. the Michael Jordan of post-training data. profitable. an advocate for human intelligence and creativity. apparently, all of those things describe me now. only one of them matters. forbes.com/sites/phoebeliu/2…
17
16
339
41,687
my first ever podcast! great time jamming with @HarryStebbings @20vcFund about @HelloSurgeAI, why the best researchers don't sacrifice data quality for anything, and why startups should focus on building the best product instead of hype machines
$1BN+ in revenue. $0 in funding. “I would not sell to Zuck for $100BN.” Surge AI. The best company in tech that you might not have heard of. Their Founder, Edwin, never does interviews. Today he shares all with 20VC and my top 7 takeaways 👇
22
11
240
54,397
great time w/ @saranormous @eladgil about the perils of benchmarks, SV startup monoculture, why you'll throw out 95% of your synthetic data (6 mo after the mess it made), RL richness, and why frontier math is a lot like poetry.
New 🔥 episode drop! Edwin Chen, CEO of Surge AI, is one of the best placed to understand the model landscape and evolution of data needs in AI. His hot takes: differentiation, Scale/Meta, why human data will prevail, why LMArena is “trash” and why founders need to stop hiring.
4
7
95
29,265
smart ≠ useful was chatting w a researcher at a top lab. what he kept saying: "our models are really smart, but we need to make them useful” this is why looking at unsexy failures is so important why models can solve imo gold medals, but can’t parse pdf files the real challenge with agi won’t be about intelligence. it’ll be deploying them reliably (five 9’s, not 80%!) in the long tail chaos of reality
3
5
43
5,381
Reminder that topping Tau2-Bench Telecom or BrowseComp ≠ best agentic model. Just tried it out on our own WorldBench eval (150 customer service tasks). > tasks such as “How many refunds were there in July?” > or “Check which graphics card is compatible with the parts from my last order and how much would it cost?" +2% over Kimi K2 Turbo. still far behind GPT-5 and Sonnet 4.5. but nice upgrade and impressive for open source (congrats @Kimi_Moonshot)!
MoonshotAI has released Kimi K2 Thinking, a new reasoning variant of Kimi K2 that achieves #1 in the Tau2 Bench Telecom agentic benchmark and is potentially the new leading open weights model Kimi K2 Thinking is one of the largest open weights models ever, at 1T total parameters with 32B active. K2 Thinking is the first reasoning model release within @Kimi_Moonshot's Kimi K2 model family, following non-reasoning Kimi K2 Instruct models released previously in July and September 2025. Key takeaways: ➤ Strong performance on agentic tasks: Kimi K2 Thinking achieves 93% in 𝜏²-Bench Telecom, an agentic tool use benchmark where the model acts as a customer service agent. This is the highest score we have independently measured. Tool use in long horizon agentic contexts was a strength of Kimi K2 Instruct and it appears this new Thinking variant makes substantial gains ➤ Reasoning variant of Kimi K2 Instruct: The model, as per its naming, is a reasoning variant of Kimi K2 Instruct. The model has the same architecture and same number of parameters (though different precision) as Kimi K2 Instruct and like K2 Instruct only supports text as an input (and output) modality ➤ 1T parameters but INT4 instead of FP8: Unlike Moonshot’s prior Kimi K2 Instruct releases that used FP8 precision, this model has been released natively in INT4 precision. Moonshot used quantization aware training in the post-training phase to achieve this. The impact of this is that K2 Thinking is only ~594GB, compared to just over 1TB for K2 Instruct and K2 Instruct 0905 - which translates into efficiency gains for inference and training. A potential reason for INT4 is that pre-Blackwell NVIDIA GPUs do not have support for FP4, making INT4 more suitable for achieving efficiency gains on earlier hardware. Our full set of Artificial Analysis Intelligence Index benchmarks are in progress and we will provide an update as soon as they are complete.
3
3
44
6,727
Small, focused teams can achieve incredible things - we've been heads down quietly building the past few years, but very proud of our team @HelloSurgeAI! theinformation.com/articles/…
6
1
46
2,982
notes like this one from an astrophysicist remind me why we built Surge
6
43
4,078
Honored to be included in the TIME AI 2025 list. What I'm most glad about is that they highlighted the part of Surge's work that actually matters. ...
6
6
30
6,123
Had a great conversation with @l2k on Gradient Dissent about what's actually happening in post-training right now. We dug into why LMSYS is setting the industry back - researchers are literally being forced to make models hallucinate more to climb the leaderboard...
4
1
27
3,109
we made gpt-5, claude, and gemini do real wall street work. then we asked 200 finance pros to grade them. one model produced basel capital numbers that would get a real bank fined. 😱 study -> surgehq.ai/blog/finance-eval…
1
2
30
7,166
“engagement” sounds harmless until it becomes the ai’s goal. do we want systems that maximize engagement… or that maximize you? i’ve been thinking about this a lot lately and where our industry is headed >>>
1
1
32
5,796
most people think agi = superintelligence. winning imo medals. solving coding puzzles. breaking benchmarks. but that’s not my definition. for me, agi = being useful at every task we can imagine in the real world. and the real world is messy. It’s long-tail chaos: ambiguity, context, mistakes, recovery. going from “solve a math problem” to “be a reliable partner over days, weeks, years” is a whole different level of difficulty. it’ll be about reliability: five nines, not 80%. that’s why I believe agi is farther away than people think. superintelligence isn’t the same as street smarts.
2
24
2,225
very nice words from @timoreilly! we ❤️ people who care about good data i still dream sometimes of writing an o'reilly data science book (cover animal: wombat)... maybe it can be about post-training instead now
3
24
2,377
some of my favorite recent model behaviors in RL envs: > a model confidently operating in 2024 > another one passed “gold” to the customer_id field because loyalty tiers are people now > one that hallucinated an email address mid-task, then used it in a tool call like nothing happened confidence ≠ accuracy.
Everyone's acting like models are ready to replace humans in work settings. We put that to the test by creating an entire company and having 9 models act as a customer service agent handling 150 tickets and requests of increasing complexity. Verdict: without common sense, models are nowhere near ready. 👇 surgehq.ai/blog/rl-envs-real…
3
26
8,161
a lot of ai leaderboards are basically just tabloids does anyone think lmarena users fact check responses before they make their choices? (do they even spend more than 2 seconds reading them?) of course not. but this is the leaderboard that many researchers still tell me they have to optimize for here’s an emblematic example in the lmarena dataset. response A is objectively wrong – Dorothy says this line upon arriving in Oz, not when she first sees the Emerald City! – but was picked as better than B
4
1
18
2,077
why do SOTA coding agents still fail 1/3 of swe-bench? we ran gemini 2.5 pro, claude sonnet 4, and gpt-5 across the full suite and dissected every failure with professional coders the difference is not raw intelligence it’s epistemic humility
6
18
4,829
For some projects, our experts spend 100+ hours chasing one single failure.
19
2,267
i love looking at swe-bench failures there’s an art to post-training - where the best researchers sculpt models into the shape they want. it’s not a black box where you throw data at an optimizer, and magic comes out. so looking at model failures really illuminates what’s going wrong, and what it takes to advance in this case, the swe-bench task required a 2-line bug fix, but sota models were spiraling out of control trying to solve it – hallucinating 693 lines of code that led to a rabbit hole of more hallucinations one of the models invented classes, methods, and even fake terminal outputs! completely lost touch with reality and never realized and in some sense, graceful recovery is the biggest thing agents need to learn: the best race car driver in the world isn’t the one that speeds through the smoothest path, but the one that realizes when a boulder falls into the middle of the track and finishes the race anyways check out the failure here: surgehq.ai/blog/when-coding-…
1
16
7,450
haiku 4.5 just dropped: faster, cheaper, and more capable so we were curious how that progress translates in the real world we ran it through one of our rl envs: 150 tasks, 3x runs, benchmarked against 1,500+ rubrics here’s what we found: > haiku 4.5 is a big step up, roughly doubling 3.5’s performance. > but on these real-world tasks, it trails far behind sonnet 4.5. (in our overall assessment, gpt-5 still reigns supreme with 62%) it stumbled when it tried to: > refund a customer’s pc that wouldn’t boot > troubleshoot why orders were undelivered > pull 5 years of texas sales data for an ad campaign progress is real, but so is the remaining gap between benchmark scores and real-world usefulness.
1
2
16
2,163
a teacher wasted 25 min asking chatgpt to copy text from a pdf into word. here’s some of what she got: we collected 1k real-world failures like this in our new series, Unsexy AI Failures. surgehq.ai/blog/the-pdf-that…
2
1
16
2,544
was talking with a math professor who works on surge. just happens to have won a fields medal a couple of years ago. asked him why he helps us train ai models – why someone at the peak of human math spends time helping us build datasets to train them. he said something i’m still thinking about: > "i want ai that helps me prove the great mathematical mysteries in my lifetime."
1
1
15
1,643
two weeks ago, @ICONIQCapital invited me to speak about ai in singapore. they also had an extra ticket to the F1. i said yes because yeah, sure, i thought i knew what F1 was. they mentioned something about pr and i nodded. …turns out i did *not* know what F1 was. (it's a racing competition. not a kaggle.) after the race, i was lucky to run into Toto, who showed me around the garage. and it blew my mind: F1 might have one of the most insane data systems i've ever seen. each car has 300+ sensors streaming real-time telemetry. there’s a decked out room with giant monitors and analysts studying the data. teams make million-dollar decisions in seconds. get your model wrong and... kaboom? he also explained something about the gap between their racing simulations and the live, real world races. in ai, we talk about data distribution shifts like an annoying edge case. in F1, distribution shifts are the entire point. the track spikes 6 degrees. a rain cloud appears. the strategy from 2 laps ago? completely wrong now. and you have 200 ms to figure it out. so Toto showed me around, explained their data systems, and at the end, gave me a hat. in return, i gave him a lecture on machine learning evaluation metrics. (he was gracious about it.) thanks ICONIQ and Toto for the great time! p.s. if these are the events iconiq throws, no wonder they won the anthropic deal.
2
15
1,765
If we want AI systems to be safe, creative, and human-like, we need to treat the process of creating and curating training data with real respect and care. It's why we built Surge in the first place.
3
1
13
2,019
it's 3am on thurs. you're staring at slack. "payment processing is failing for 0.3% of transactions.” only on mobile. only for german users. figure it out (or you’re fired). you look. > tests? green 🟢 > staging? gorgeous ☀️ > can’t repro locally you’re one stack overflow prayer away from a breakdown. also, your 1yo heard you clacking away and started bawling too which ai do you turn to? option a: trained on leetcode. aces "reverse a linked list" on the first try. scores 99% on humaneval. or option b: trained by a senior staff software engineer. stack overflows as a hobby. has saved 1.2m devs from getting fired. model a’s advice: “have you considered refactoring your hash table?” model b: “the algorithms are fine. the problem is germany's mobile carrier is insane. heh.” this is why we choose that SO engineer every time. so he can teach ai to operate in the trenches. he's NOT teaching models how to: > solve problems with perfect inputs. he's teaching them: > it's 3am, suzie is bawling, and the payments only break for users with umlauts. help! phds and textbooks prepare you for: ✅ clean inputs ✅ single correct answers ✅ problems with an elegant solution production hits you with: 😱 "it only breaks on tuesdays" 😱 "the logs contradict each other" 😱 "it worked yesterday, i swear!" unix wizards aren't wizards because they memorized algorithms. they're wizards because they've been paged enough times to recognize oh, it’s a caching issue. again. because they know the difference between: > code that passes tests > code that survives angry klauses, felixes, and helgas wielding bratwurst smart != useful. we’re teaching ai to recognize the second. because when it’s 3am and you're googling "why does this only break in germany" for the 17th time... you don't need ai that’s read clrs and spits out big o. you need ai that’s faced – and solved – a thousand production meltdowns before.
1
1
16
2,988
Two years ago, I had the privilege of collaborating with @ThomasScialom and @mialon_gregoire at @MetaAI on GAIA, one of the first agentic benchmarks, designed to measure progress towards useful-in-the-real-world AGI. This week, the team launched Gaia2, built inside their new Agent RL Environment (ARE) framework.
3
4
14
2,484
ai models today are really smart: they can solve olympiad problems, debug your compiler, and probably explain string theory to your dog! but we want to make sure they have street smarts: > can they pick the right tone in a delicate email? > recover gracefully when they’re wrong? > tell when someone’s being sarcastic? you get that by training your models on the right objectives and optimizing for the messiness of the real world. more on this in my conversation with @dealbook nytimes.com/2025/10/11/busin…
1
13
1,566
superintelligence ≠ super useful chatGPT’s PDF parser spat out: “47 … Com position … J us t as dec oding is not suff ic ie” any human would instantly see it’s nonsense. chatGPT confidently passed it off as clean extraction. teaching models to “parse pdfs” isn’t as sexy as teaching them to climb leaderboards, but it’s one of the long tail of skills they’ll need to be useful in the real world. surgehq.ai/blog/the-pdf-that…
1
9
1,549
some of us grew up playing The Sims. who else removed the pool ladder? 😈 now we do it for a living. we built a world > Populated it with entities and tools. > Created tasks. > Added verifiers. > Then hit “run.” results: a messy, dynamic RL environment that looks a lot more like real life than a benchmark dataset.
2
10
1,355
Replying to @echen @l2k
We also talked about why we pushed customers toward RLHF over SFT years ago (spoiler: it worked), and the need for better data from real human experts to properly train and align AI. piped.video/X39OZndIWSY
11
2,005
there are many types of intelligence. which type do we want models to have? yesterday i was analyzing agent trajectories in our rl envs and digging into tool-calling performance (how models use functions): > gpt-5 made 10x fewer errors than every other model! > claude made 10x more errors – but it reflected on its errors and fixed them. and so even though its initial tool calls were much worse, its final performance was close to gpt-5’s. this shouldn't be possible under the standard paradigm. in rl, we're taught that only outcomes matter: the reward signal, the final state, the destination. but that’s why i love digging into individual trajectories to understand what’s going on. gpt-5 embodies precision intelligence: flawless execution, doesn't make the mistake in the first place. claude embodies adaptive intelligence. it makes errors… but possesses something possibly rarer? the wisdom to notice and correct them? it’s like that friend who shows up perfectly dressed, says exactly the right thing, never spills their drink vs. the one who trips walking in, knocks over a plant, makes a joke, and everyone laughs. both are intelligent. which is better? i don't know. but claude's error-recovery patterns seem closer to metacognition. it's monitoring its execution and thinking about its thinking. gpt-5 may not need this layer right now. its first-order thinking is so accurate that it doesn’t need to reflect. maybe that's fine enough right now, when problems are straightforward enough to one-shot. but what about when they're not? (note: i don't know if gpt-5 is equally good at error-recovery when it needs to be. maybe it is! but i've noticed claude's recovery capabilities in the past.) in an increasingly complex world, where problems get harder and harder, i wonder if resilience will matter more than perfection? do you want the model that never falls, or the model that knows how to stand back up?
1
1
11
1,318
was playing around with one of our newest rl envs – interestingly, gpt-5 makes 10x fewer tool-calling errors than every other model.
1
4
1,258
Replying to @ThomasScialom
amazing!! 🚀🚀 excited ARE is finally out :)
3
6
760
Not revenue or valuations - but how training data shapes the way AGI will think, feel, and interact with the world. High-quality data isn't just a technical input. It's the foundation of alignment.
1
5
458
we saw: one model spiraled into 693 hallucinated lines one guessed wrong, but re-investigated and recovered one re-checked context and solved cleanly leaderboards won’t show you this trajectories will full breakdown here 👉 surgehq.ai/blog/when-coding-…
1
5
1,164
it’s ai clickbait and lmarena rankings are the national enquirer
2
796
curious what others think… would you want a proof of p vs. np, if you couldn't understand why? when ai starts discovering truths that no human can follow, would you call that progress or abdication?
1
1
4
747
it's an open question for training llms too: should we enforce their reasoning to be legible to humans? or do we only care about the accuracy of the final answer? (you can guess where i stand!)
1
1
1
790
how do frontier models perform? > gemini 2.5 pro: <20% > claude 4.5: <50% > gpt-5 (the best): just above 60% models ace academic benchmarks. then stumble the moment they touch something real. that’s why we build RL environments – rich, unpredictable, messy, and grounded in reality. because the real world doesn’t come with a benchmark script.
1
3
758
it’s such a human sentence. some people want to travel the world. some people want to understand the big bang. some want to write poetry that gets passed down generations. he wants to unlock the secrets of the universe. then we started talking about the 4 color theorem and our conversation changed.
1
2
205
Gaia2 and ARE are major steps forward for reliable agents in the chaos of the everyday. I’m excited to see where ARE and Gaia2 take the field, and happy that Surge AI got to play a part. Congrats to the Meta Agents team! Link 👉 ai.meta.com/research/publica…
1
2
804
> do you notice what you don’t know? > do you verify guesses? > do you backtrack when assumptions break?
2
589
(ps - if you ever removed the pool ladder in The Sims, join us - we’re hiring for our rl env team!) surgehq.ai/careers
1
2
637
❤️ this made my day hahaha - i miss blogging!
1
2
140
we're planning on it!!
1
132
haha thanks ashita!!
1
159
it's a question deeper than math. there's a tension right now: do we want models that think *for* us, or *with* us? do we want collaboration, or to be replaced?
1
1
137
Replying to @umang @timoreilly
thank you!! it's been a while - hope you're doing well!
1
152
One line from the paper struck me: “In AI’s second half, progress increasingly depends on defining meaningful tasks and robust evaluations.” The hardest problems aren’t just about scaling models. They’re about asking the right questions and choosing the right objectives.
1
1
835
❤️❤️ I miss those days!!
1
1
142
> what happens when AI does start proving things – but in ways we can't comprehend? > when the proof is valid, but no human can follow it? > when we know what's true, but not why? > will that still count as knowledge? > or will we be spectators at our own intellectual sunset?
1
2
170