echen · Jul 5, 2025 · 10:22 PM UTC

echen

Pinned Tweet

echen

@echen

5 Jul 2025

love this table from @theinformation

809

82,691

echen · Oct 14, 2025 · 11:20 PM UTC

echen

@echen

14 Oct 2025

sonnet 4.5 dropped 2 weeks back – “the best coding model yet” we ran it through our internal agentic coding benchmark – real-world engineering tasks and codebases. results: indeed, claude 4.5 wins! (although gpt-5-codex is close and costs less than half) but something much more interesting surprised me: > about half the tasks each model failed were passed by the other. in other words: > they’re different types of coders. we deep dived one example. claude 4.5: the craftsman perfectionist. slow to err, obsessed with correctness, maybe a little neurotic about spacing but ultimately reliable. gpt-5-codex: the hacker-engineer. exploratory, error-prone, and a bit too eager to improvise. full benchmark + example deep dive -> surgehq.ai/blog/sonnet-4-5-c…

437

50,872

echen · Sep 18, 2025 · 10:42 PM UTC

echen

@echen

18 Sep 2025

longwinded. eccentric. sick of VC-backed get rich quick schemes. the Michael Jordan of post-training data. profitable. an advocate for human intelligence and creativity. apparently, all of those things describe me now. only one of them matters. forbes.com/sites/phoebeliu/2…

How A Google Alum Became A Low-Key AI Billionaire And The Youngest Member Of The Forbes 400

An alum of Google and Twitter, Edwin Chen built his data labeling company, Surge, in the shadows of the AI revolution. Now the 37-year-old wants to make his voice heard.

forbes.com

339

41,687

echen · Jul 22, 2025 · 3:54 PM UTC

echen

@echen

22 Jul 2025

my first ever podcast! great time jamming with @HarryStebbings @20vcFund about @HelloSurgeAI, why the best researchers don't sacrifice data quality for anything, and why startups should focus on building the best product instead of hype machines

Harry Stebbings

@HarryStebbings

21 Jul 2025

$1BN+ in revenue. $0 in funding. “I would not sell to Zuck for $100BN.” Surge AI. The best company in tech that you might not have heard of. Their Founder, Edwin, never does interviews. Today he shares all with 20VC and my top 7 takeaways 👇

240

54,397

echen · Jul 24, 2025 · 10:06 PM UTC

echen

@echen

24 Jul 2025

great time w/ @saranormous @eladgil about the perils of benchmarks, SV startup monoculture, why you'll throw out 95% of your synthetic data (6 mo after the mess it made), RL richness, and why frontier math is a lot like poetry.

No Priors

@NoPriorsPod

24 Jul 2025

New 🔥 episode drop! Edwin Chen, CEO of Surge AI, is one of the best placed to understand the model landscape and evolution of data needs in AI. His hot takes: differentiation, Scale/Meta, why human data will prevail, why LMArena is “trash” and why founders need to stop hiring.

29,265

echen · Sep 26, 2025 · 5:30 PM UTC

echen

@echen

26 Sep 2025

smart ≠ useful was chatting w a researcher at a top lab. what he kept saying: "our models are really smart, but we need to make them useful” this is why looking at unsexy failures is so important why models can solve imo gold medals, but can’t parse pdf files the real challenge with agi won’t be about intelligence. it’ll be deploying them reliably (five 9’s, not 80%!) in the long tail chaos of reality

5,381

echen · Nov 7, 2025 · 11:42 PM UTC

echen

@echen

7 Nov 2025

Reminder that topping Tau2-Bench Telecom or BrowseComp ≠ best agentic model. Just tried it out on our own WorldBench eval (150 customer service tasks). > tasks such as “How many refunds were there in July?” > or “Check which graphics card is compatible with the parts from my last order and how much would it cost?" +2% over Kimi K2 Turbo. still far behind GPT-5 and Sonnet 4.5. but nice upgrade and impressive for open source (congrats @Kimi_Moonshot)!

Artificial Analysis

@ArtificialAnlys

6 Nov 2025

MoonshotAI has released Kimi K2 Thinking, a new reasoning variant of Kimi K2 that achieves #1 in the Tau2 Bench Telecom agentic benchmark and is potentially the new leading open weights model Kimi K2 Thinking is one of the largest open weights models ever, at 1T total parameters with 32B active. K2 Thinking is the first reasoning model release within @Kimi_Moonshot's Kimi K2 model family, following non-reasoning Kimi K2 Instruct models released previously in July and September 2025. Key takeaways: ➤ Strong performance on agentic tasks: Kimi K2 Thinking achieves 93% in 𝜏²-Bench Telecom, an agentic tool use benchmark where the model acts as a customer service agent. This is the highest score we have independently measured. Tool use in long horizon agentic contexts was a strength of Kimi K2 Instruct and it appears this new Thinking variant makes substantial gains ➤ Reasoning variant of Kimi K2 Instruct: The model, as per its naming, is a reasoning variant of Kimi K2 Instruct. The model has the same architecture and same number of parameters (though different precision) as Kimi K2 Instruct and like K2 Instruct only supports text as an input (and output) modality ➤ 1T parameters but INT4 instead of FP8: Unlike Moonshot’s prior Kimi K2 Instruct releases that used FP8 precision, this model has been released natively in INT4 precision. Moonshot used quantization aware training in the post-training phase to achieve this. The impact of this is that K2 Thinking is only ~594GB, compared to just over 1TB for K2 Instruct and K2 Instruct 0905 - which translates into efficiency gains for inference and training. A potential reason for INT4 is that pre-Blackwell NVIDIA GPUs do not have support for FP4, making INT4 more suitable for achieving efficiency gains on earlier hardware. Our full set of Artificial Analysis Intelligence Index benchmarks are in progress and we will provide an update as soon as they are complete.

6,727

echen · Jun 23, 2025 · 7:12 PM UTC

echen

@echen

23 Jun 2025

Small, focused teams can achieve incredible things - we've been heads down quietly building the past few years, but very proud of our team @HelloSurgeAI! theinformation.com/articles/…

2,982

echen · Sep 26, 2025 · 1:15 AM UTC

echen

@echen

26 Sep 2025

notes like this one from an astrophysicist remind me why we built Surge

4,078

echen · Aug 28, 2025 · 9:43 PM UTC

echen

@echen

28 Aug 2025

Honored to be included in the TIME AI 2025 list. What I'm most glad about is that they highlighted the part of Surge's work that actually matters. ...

6,123

echen · Sep 16, 2025 · 6:10 PM UTC

echen

@echen

16 Sep 2025

Had a great conversation with @l2k on Gradient Dissent about what's actually happening in post-training right now. We dug into why LMSYS is setting the industry back - researchers are literally being forced to make models hallucinate more to climb the leaderboard...

3,109

echen · Nov 6, 2025 · 12:33 AM UTC

echen

@echen

6 Nov 2025

we made gpt-5, claude, and gemini do real wall street work. then we asked 200 finance pros to grade them. one model produced basel capital numbers that would get a real bank fined. 😱 study -> surgehq.ai/blog/finance-eval…

How do frontier models perform on real-world finance problems?

We stress-tested GPT-5, Gemini 2.5 Pro, and Claude Sonnet 4.5 on 200+ expert finance tasks. Here's where even the best models break when they move from benchmarks to Wall Street.

surgehq.ai

7,166

echen · Nov 10, 2025 · 11:50 PM UTC

echen

@echen

10 Nov 2025

“engagement” sounds harmless until it becomes the ai’s goal. do we want systems that maximize engagement… or that maximize you? i’ve been thinking about this a lot lately and where our industry is headed >>>

echen

@echen

7 Nov 2025

x.com/i/article/198622651529…

You are your objective function. Which fork do you choose?

Imagine two AI systems. Identical base models, pre-trained with the same knowledge. One is optimized for engagement. The other for usefulness. They start from the same foundation. Same architecture.

5,796

echen · Oct 1, 2025 · 4:45 PM UTC

echen

@echen

1 Oct 2025

most people think agi = superintelligence. winning imo medals. solving coding puzzles. breaking benchmarks. but that’s not my definition. for me, agi = being useful at every task we can imagine in the real world. and the real world is messy. It’s long-tail chaos: ambiguity, context, mistakes, recovery. going from “solve a math problem” to “be a reliable partner over days, weeks, years” is a whole different level of difficulty. it’ll be about reliability: five nines, not 80%. that’s why I believe agi is farther away than people think. superintelligence isn’t the same as street smarts.

2,225

echen · Jun 25, 2025 · 2:35 PM UTC

echen

@echen

25 Jun 2025

very nice words from @timoreilly! we ❤️ people who care about good data i still dream sometimes of writing an o'reilly data science book (cover animal: wombat)... maybe it can be about post-training instead now

2,377

echen · Nov 11, 2025 · 8:09 PM UTC

echen

@echen

11 Nov 2025

some of my favorite recent model behaviors in RL envs: > a model confidently operating in 2024 > another one passed “gold” to the customer_id field because loyalty tiers are people now > one that hallucinated an email address mid-task, then used it in a tool call like nothing happened confidence ≠ accuracy.

Surge AI

@HelloSurgeAI

11 Nov 2025

Everyone's acting like models are ready to replace humans in work settings. We put that to the test by creating an entire company and having 9 models act as a customer service agent handling 150 tickets and requests of increasing complexity. Verdict: without common sense, models are nowhere near ready. 👇 surgehq.ai/blog/rl-envs-real…

8,161

echen · Oct 7, 2025 · 9:40 PM UTC

echen

@echen

7 Oct 2025

a lot of ai leaderboards are basically just tabloids does anyone think lmarena users fact check responses before they make their choices? (do they even spend more than 2 seconds reading them?) of course not. but this is the leaderboard that many researchers still tell me they have to optimize for here’s an emblematic example in the lmarena dataset. response A is objectively wrong – Dorothy says this line upon arriving in Oz, not when she first sees the Emerald City! – but was picked as better than B

2,077

echen · Sep 23, 2025 · 12:32 AM UTC

echen

@echen

23 Sep 2025

why do SOTA coding agents still fail 1/3 of swe-bench? we ran gemini 2.5 pro, claude sonnet 4, and gpt-5 across the full suite and dissected every failure with professional coders the difference is not raw intelligence it’s epistemic humility

4,829

echen · Sep 17, 2025 · 10:55 PM UTC

echen

@echen

17 Sep 2025

For some projects, our experts spend 100+ hours chasing one single failure.

2,267

echen · Sep 30, 2025 · 11:39 PM UTC

echen

@echen

30 Sep 2025

i love looking at swe-bench failures there’s an art to post-training - where the best researchers sculpt models into the shape they want. it’s not a black box where you throw data at an optimizer, and magic comes out. so looking at model failures really illuminates what’s going wrong, and what it takes to advance in this case, the swe-bench task required a 2-line bug fix, but sota models were spiraling out of control trying to solve it – hallucinating 693 lines of code that led to a rabbit hole of more hallucinations one of the models invented classes, methods, and even fake terminal outputs! completely lost touch with reality and never realized and in some sense, graceful recovery is the biggest thing agents need to learn: the best race car driver in the world isn’t the one that speeds through the smoothest path, but the one that realizes when a boulder falls into the middle of the track and finishes the race anyways check out the failure here: surgehq.ai/blog/when-coding-…

When Coding Agents Spiral Into 693 Lines of Hallucinations

When coding models spiral into self-reinforcing hallucinations, small mistakes compound into catastrophic failure. In SWE-bench, we saw SOTA models invent whole classes, methods, and terminal outputs...

surgehq.ai

7,450

echen · Oct 16, 2025 · 10:02 PM UTC

echen

@echen

16 Oct 2025

haiku 4.5 just dropped: faster, cheaper, and more capable so we were curious how that progress translates in the real world we ran it through one of our rl envs: 150 tasks, 3x runs, benchmarked against 1,500+ rubrics here’s what we found: > haiku 4.5 is a big step up, roughly doubling 3.5’s performance. > but on these real-world tasks, it trails far behind sonnet 4.5. (in our overall assessment, gpt-5 still reigns supreme with 62%) it stumbled when it tried to: > refund a customer’s pc that wouldn’t boot > troubleshoot why orders were undelivered > pull 5 years of texas sales data for an ad campaign progress is real, but so is the remaining gap between benchmark scores and real-world usefulness.

2,163

echen · Sep 9, 2025 · 8:02 PM UTC

echen

@echen

9 Sep 2025

a teacher wasted 25 min asking chatgpt to copy text from a pdf into word. here’s some of what she got: we collected 1k real-world failures like this in our new series, Unsexy AI Failures. surgehq.ai/blog/the-pdf-that…

2,544

echen · Oct 13, 2025 · 2:10 PM UTC

echen

@echen

13 Oct 2025

was talking with a math professor who works on surge. just happens to have won a fields medal a couple of years ago. asked him why he helps us train ai models – why someone at the peak of human math spends time helping us build datasets to train them. he said something i’m still thinking about: > "i want ai that helps me prove the great mathematical mysteries in my lifetime."

1,643

echen · Oct 23, 2025 · 3:05 PM UTC

echen

@echen

23 Oct 2025

two weeks ago, @ICONIQCapital invited me to speak about ai in singapore. they also had an extra ticket to the F1. i said yes because yeah, sure, i thought i knew what F1 was. they mentioned something about pr and i nodded. …turns out i did *not* know what F1 was. (it's a racing competition. not a kaggle.) after the race, i was lucky to run into Toto, who showed me around the garage. and it blew my mind: F1 might have one of the most insane data systems i've ever seen. each car has 300+ sensors streaming real-time telemetry. there’s a decked out room with giant monitors and analysts studying the data. teams make million-dollar decisions in seconds. get your model wrong and... kaboom? he also explained something about the gap between their racing simulations and the live, real world races. in ai, we talk about data distribution shifts like an annoying edge case. in F1, distribution shifts are the entire point. the track spikes 6 degrees. a rain cloud appears. the strategy from 2 laps ago? completely wrong now. and you have 200 ms to figure it out. so Toto showed me around, explained their data systems, and at the end, gave me a hat. in return, i gave him a lecture on machine learning evaluation metrics. (he was gracious about it.) thanks ICONIQ and Toto for the great time! p.s. if these are the events iconiq throws, no wonder they won the anthropic deal.

1,765

echen · Aug 28, 2025 · 9:44 PM UTC

echen

@echen

28 Aug 2025

If we want AI systems to be safe, creative, and human-like, we need to treat the process of creating and curating training data with real respect and care. It's why we built Surge in the first place.

2,019

echen · Oct 21, 2025 · 10:48 PM UTC

echen

@echen

21 Oct 2025

it's 3am on thurs. you're staring at slack. "payment processing is failing for 0.3% of transactions.” only on mobile. only for german users. figure it out (or you’re fired). you look. > tests? green 🟢 > staging? gorgeous ☀️ > can’t repro locally you’re one stack overflow prayer away from a breakdown. also, your 1yo heard you clacking away and started bawling too which ai do you turn to? option a: trained on leetcode. aces "reverse a linked list" on the first try. scores 99% on humaneval. or option b: trained by a senior staff software engineer. stack overflows as a hobby. has saved 1.2m devs from getting fired. model a’s advice: “have you considered refactoring your hash table?” model b: “the algorithms are fine. the problem is germany's mobile carrier is insane. heh.” this is why we choose that SO engineer every time. so he can teach ai to operate in the trenches. he's NOT teaching models how to: > solve problems with perfect inputs. he's teaching them: > it's 3am, suzie is bawling, and the payments only break for users with umlauts. help! phds and textbooks prepare you for: ✅ clean inputs ✅ single correct answers ✅ problems with an elegant solution production hits you with: 😱 "it only breaks on tuesdays" 😱 "the logs contradict each other" 😱 "it worked yesterday, i swear!" unix wizards aren't wizards because they memorized algorithms. they're wizards because they've been paged enough times to recognize oh, it’s a caching issue. again. because they know the difference between: > code that passes tests > code that survives angry klauses, felixes, and helgas wielding bratwurst smart != useful. we’re teaching ai to recognize the second. because when it’s 3am and you're googling "why does this only break in germany" for the 17th time... you don't need ai that’s read clrs and spits out big o. you need ai that’s faced – and solved – a thousand production meltdowns before.

2,988

echen · Sep 25, 2025 · 1:41 AM UTC

echen

@echen

25 Sep 2025

Two years ago, I had the privilege of collaborating with @ThomasScialom and @mialon_gregoire at @MetaAI on GAIA, one of the first agentic benchmarks, designed to measure progress towards useful-in-the-real-world AGI. This week, the team launched Gaia2, built inside their new Agent RL Environment (ARE) framework.

2,484

echen · Oct 15, 2025 · 5:31 PM UTC

echen

@echen

15 Oct 2025

ai models today are really smart: they can solve olympiad problems, debug your compiler, and probably explain string theory to your dog! but we want to make sure they have street smarts: > can they pick the right tone in a delicate email? > recover gracefully when they’re wrong? > tell when someone’s being sarcastic? you get that by training your models on the right objectives and optimizing for the messiness of the real world. more on this in my conversation with @dealbook nytimes.com/2025/10/11/busin…

Searching for Meaning in the Gold Rally

The price of gold often rises during periods of economic turmoil. This time around, bond markets are stable and stocks are at record highs. So what gives?

nytimes.com

1,566

echen · Oct 3, 2025 · 3:40 PM UTC

echen

@echen

3 Oct 2025

superintelligence ≠ super useful chatGPT’s PDF parser spat out: “47 … Com position … J us t as dec oding is not suff ic ie” any human would instantly see it’s nonsense. chatGPT confidently passed it off as clean extraction. teaching models to “parse pdfs” isn’t as sexy as teaching them to climb leaderboards, but it’s one of the long tail of skills they’ll need to be useful in the real world. surgehq.ai/blog/the-pdf-that…

Unsexy AI Failures: The PDF That Broke ChatGPT

The AI world loves climbing leaderboards. Companies race to hit #1 on LMSYS, chase perfect scores on academic benchmarks, and demo SVGs of pelicans on bicycles. These achievements make for great...

surgehq.ai

1,549

echen · Nov 7, 2025 · 5:54 PM UTC

echen

@echen

7 Nov 2025

x.com/i/article/198622651529…

You are your objective function. Which fork do you choose?

Imagine two AI systems. Identical base models, pre-trained with the same knowledge. One is optimized for engagement. The other for usefulness. They start from the same foundation. Same architecture.

8,160

echen · Oct 9, 2025 · 6:15 PM UTC

echen

@echen

9 Oct 2025

some of us grew up playing The Sims. who else removed the pool ladder? 😈 now we do it for a living. we built a world > Populated it with entities and tools. > Created tasks. > Added verifiers. > Then hit “run.” results: a messy, dynamic RL environment that looks a lot more like real life than a benchmark dataset.

1,355

echen · Sep 16, 2025 · 6:10 PM UTC

echen

@echen

16 Sep 2025

Replying to @echen @l2k

We also talked about why we pushed customers toward RLHF over SFT years ago (spoiler: it worked), and the need for better data from real human experts to properly train and align AI. piped.video/X39OZndIWSY

2,005

echen · Oct 30, 2025 · 3:28 PM UTC

echen

@echen

30 Oct 2025

there are many types of intelligence. which type do we want models to have? yesterday i was analyzing agent trajectories in our rl envs and digging into tool-calling performance (how models use functions): > gpt-5 made 10x fewer errors than every other model! > claude made 10x more errors – but it reflected on its errors and fixed them. and so even though its initial tool calls were much worse, its final performance was close to gpt-5’s. this shouldn't be possible under the standard paradigm. in rl, we're taught that only outcomes matter: the reward signal, the final state, the destination. but that’s why i love digging into individual trajectories to understand what’s going on. gpt-5 embodies precision intelligence: flawless execution, doesn't make the mistake in the first place. claude embodies adaptive intelligence. it makes errors… but possesses something possibly rarer? the wisdom to notice and correct them? it’s like that friend who shows up perfectly dressed, says exactly the right thing, never spills their drink vs. the one who trips walking in, knocks over a plant, makes a joke, and everyone laughs. both are intelligent. which is better? i don't know. but claude's error-recovery patterns seem closer to metacognition. it's monitoring its execution and thinking about its thinking. gpt-5 may not need this layer right now. its first-order thinking is so accurate that it doesn’t need to reflect. maybe that's fine enough right now, when problems are straightforward enough to one-shot. but what about when they're not? (note: i don't know if gpt-5 is equally good at error-recovery when it needs to be. maybe it is! but i've noticed claude's recovery capabilities in the past.) in an increasingly complex world, where problems get harder and harder, i wonder if resilience will matter more than perfection? do you want the model that never falls, or the model that knows how to stand back up?

1,318

echen · Aug 28, 2025 · 9:52 PM UTC

echen

@echen

28 Aug 2025

time.com/collections/time100…

TIME100 AI 2025: Edwin Chen

Find out why Edwin Chen made TIME’s list of the most influential people in artificial intelligence

time.com

1,572

echen · Oct 29, 2025 · 3:05 PM UTC

echen

@echen

29 Oct 2025

was playing around with one of our newest rl envs – interestingly, gpt-5 makes 10x fewer tool-calling errors than every other model.

1,258

echen · Sep 22, 2025 · 3:03 PM UTC

echen

@echen

22 Sep 2025

Replying to @ThomasScialom

amazing!! 🚀🚀 excited ARE is finally out :)

760

echen · Aug 28, 2025 · 9:44 PM UTC

echen

@echen

28 Aug 2025

Not revenue or valuations - but how training data shapes the way AGI will think, feel, and interact with the world. High-quality data isn't just a technical input. It's the foundation of alignment.

458

echen · Sep 23, 2025 · 12:33 AM UTC

echen

@echen

23 Sep 2025

we saw: one model spiraled into 693 hallucinated lines one guessed wrong, but re-investigated and recovered one re-checked context and solved cleanly leaderboards won’t show you this trajectories will full breakdown here 👉 surgehq.ai/blog/when-coding-…

When Coding Agents Spiral Into 693 Lines of Hallucinations

surgehq.ai

1,164

echen · Oct 7, 2025 · 9:40 PM UTC

echen

@echen

7 Oct 2025

it’s ai clickbait and lmarena rankings are the national enquirer

796

echen · Oct 13, 2025 · 2:10 PM UTC

echen

@echen

13 Oct 2025

curious what others think… would you want a proof of p vs. np, if you couldn't understand why? when ai starts discovering truths that no human can follow, would you call that progress or abdication?

747

echen · Oct 13, 2025 · 2:10 PM UTC

echen

@echen

13 Oct 2025

it's an open question for training llms too: should we enforce their reasoning to be legible to humans? or do we only care about the accuracy of the final answer? (you can guess where i stand!)

790

echen · Oct 9, 2025 · 6:15 PM UTC

echen

@echen

9 Oct 2025

how do frontier models perform? > gemini 2.5 pro: <20% > claude 4.5: <50% > gpt-5 (the best): just above 60% models ace academic benchmarks. then stumble the moment they touch something real. that’s why we build RL environments – rich, unpredictable, messy, and grounded in reality. because the real world doesn’t come with a benchmark script.

758

echen · Oct 13, 2025 · 2:10 PM UTC

echen

@echen

13 Oct 2025

it’s such a human sentence. some people want to travel the world. some people want to understand the big bang. some want to write poetry that gets passed down generations. he wants to unlock the secrets of the universe. then we started talking about the 4 color theorem and our conversation changed.

205

echen · Sep 25, 2025 · 1:44 AM UTC

echen

@echen

25 Sep 2025

Gaia2 and ARE are major steps forward for reliable agents in the chaos of the everyday. I’m excited to see where ARE and Gaia2 take the field, and happy that Surge AI got to play a part. Congrats to the Meta Agents team! Link 👉 ai.meta.com/research/publica…

804

echen · Sep 23, 2025 · 12:32 AM UTC

echen

@echen

23 Sep 2025

> do you notice what you don’t know? > do you verify guesses? > do you backtrack when assumptions break?

589

echen · Oct 9, 2025 · 6:15 PM UTC

echen

@echen

9 Oct 2025

(ps - if you ever removed the pool ladder in The Sims, join us - we’re hiring for our rl env team!) surgehq.ai/careers

637

echen · Jul 5, 2025 · 4:22 PM UTC

echen

@echen

5 Jul 2025

Replying to @Mallrat9000 @gerstenzang

❤️ this made my day hahaha - i miss blogging!

140

echen · Jun 24, 2025 · 4:04 PM UTC

echen

@echen

24 Jun 2025

Replying to @pr337h4m @HelloSurgeAI

we're planning on it!!

132

echen · Jun 24, 2025 · 4:04 PM UTC

echen

@echen

24 Jun 2025

Replying to @ashitaa @HelloSurgeAI

haha thanks ashita!!

159

echen · Oct 13, 2025 · 2:10 PM UTC

echen

@echen

13 Oct 2025

it's a question deeper than math. there's a tension right now: do we want models that think *for* us, or *with* us? do we want collaboration, or to be replaced?

137

echen · Jul 5, 2025 · 10:15 PM UTC

echen

@echen

5 Jul 2025

Replying to @umang @timoreilly

thank you!! it's been a while - hope you're doing well!

152

echen · Sep 25, 2025 · 1:43 AM UTC

echen

@echen

25 Sep 2025

One line from the paper struck me: “In AI’s second half, progress increasingly depends on defining meaningful tasks and robust evaluations.” The hardest problems aren’t just about scaling models. They’re about asking the right questions and choosing the right objectives.

835

echen · Jul 3, 2025 · 3:29 AM UTC

echen

@echen

3 Jul 2025

Replying to @rajhans_samdani @Neeva @SurgeHQ

❤️❤️ I miss those days!!

142

echen · Oct 13, 2025 · 2:10 PM UTC

echen

@echen

13 Oct 2025

> what happens when AI does start proving things – but in ways we can't comprehend? > when the proof is valid, but no human can follow it? > when we know what's true, but not why? > will that still count as knowledge? > or will we be spectators at our own intellectual sunset?

170