Thoughtful · Jun 11, 2026 · 10:23 PM UTC

Thoughtful

Pinned Tweet

Thoughtful

@thoughtfullab

Jun 11

Fable 5 is doing something wild on our FrogsGame post-training task. It trains a weaker model to solve the puzzle, peaks at 68%, and produces the only ~10x improvement we see across the benchmark. It spent 17 hours, 25M tokens without human in sight. 34% pass@1, while every other frontier model averages under 4%. We will publish a more detailed analysis soon.

Thoughtful

@thoughtfullab

Apr 22

Model shaping is still a craft of a few. That's what AI agents are for: learning it and doing it for everyone else. As a part of FrontierSWE benchmark we built a 20-hour post-training task on @tinkerapi and found the real bottleneck is research intuition.

1,071

489,765

Thoughtful · Jul 2, 2026 · 6:52 PM UTC

Thoughtful

@thoughtfullab

22h

GLM 5.2 is 5x cheaper than Opus 4.8 and 11x than Fable 5, yet it tops PostTrainBench. That’s exciting because lower costs make personalized intelligence economically viable. Every company and country should be able to own models trained on its own data and have sovereignty over it. The future is millions of models, each crafted around the data, values, and decisions of the people who rely on them.

28,363

Thoughtful · Jun 27, 2026 · 7:03 PM UTC

Thoughtful

@thoughtfullab

Jun 27

Thank you to our friends at @OpenAI for featuring PostTrainBench in the new model card!

Karina

@karinanguyen

Jun 26

OpenAI evaluated its new models on PostTrainBench-Lite, a shortened version of our original benchmark that gives agents 5 hours instead of 10 to improve an open-source base model. GPT-5.6 Sol and Terra outperform GPT-5.5, but still rely on narrow strategies and sometimes overfit to the eval (common behavior). As we’ve reported before, the real frontier is research judgment and it remains one of the most exciting challenges for responsible RSI to solve.

3,856

Hardik Bhatnagar · Jun 25, 2026 · 8:35 PM UTC

Thoughtful retweeted

Hardik Bhatnagar

@hrdkbhatnagar

Jun 25

New #1 on PostTrainBench: GLM 5.2 (Max reasoning) hits 34.29%, narrowly beating Opus 4.8 Max (34.08%) What makes GLM 5.2 interesting: zero failed runs across 84 runs (vs ~10% failure rate for Opus agents). The most reliable agent we've seen Leaderboard: posttrainbench.com

496

103,205

Thoughtful · Jun 25, 2026 · 6:19 PM UTC

Thoughtful

@thoughtfullab

Jun 25

A musician gets better by playing, listening, and adjusting. AI is beginning to improve in a similar way: learning from feedback, evaluating its own performance, and refining how it behaves. We call the deliberate shaping of that process modelcrafting. Modelcrafting is the craft of deciding what a model becomes: its character, its judgment, what it pays attention to, and how it improves. It means giving people the ability to shape AI intentionally around their values, their expertise, and the realities of the environments in which it operates. We are building toward responsible recursive self-improvement in the real world: not AI improving itself in isolation, but AI becoming better through feedback from the people it serves. A team should be able to create a model attuned to its judgment, its customers’ needs, its domain’s sensitivities, and the human stakes behind its most consequential decisions. The future should not be defined by models improving on their own, but by people gaining the power to shape the intelligence they depend on. Machine systems should be able to improve without ever losing the human hand on the instrument.

1,696

Karina · Jun 16, 2026 · 7:31 PM UTC

Thoughtful retweeted

Karina

@karinanguyen

Jun 16

On Claude Fable 5's posttraining capabilities:

Thoughtful

@thoughtfullab

Jun 16

x.com/i/article/206694123204…

The State of AI Post-training Agents

June, 2026 In our previous report, What We Learned From Letting AI PostTraining AI, we studied how frontier models perform on FrogsGame: a long-horizon post-training task where agents are asked to

223

56,325

Mersad Abbasi · Jun 16, 2026 · 7:30 PM UTC

Thoughtful retweeted

Mersad Abbasi

@Mersad_Abbasi

Jun 16

models are becoming better at posttraining. Fable 5 specifically solves many of the shortcomings we discussed in our previous report. more mature decision making, questioning the default approaches and better calibration. it feels like a model that "gets it" as @karpathy mentioned. There is still room for improvement specifically understanding the big picture and better calibration but frogsgame is not a good benchmark for that. we need better real world post-training tasks. read more about what we found here.

Thoughtful

@thoughtfullab

Jun 16

x.com/i/article/206694123204…

The State of AI Post-training Agents

June, 2026 In our previous report, What We Learned From Letting AI PostTraining AI, we studied how frontier models perform on FrogsGame: a long-horizon post-training task where agents are asked to

5,938

Thoughtful · Jun 11, 2026 · 10:23 PM UTC

Thoughtful

@thoughtfullab

Jun 11

Thoughtful

@thoughtfullab

Apr 22

1,071

489,765

Thoughtful · Jun 16, 2026 · 7:29 PM UTC

Thoughtful

@thoughtfullab

Jun 16

Our report:

Thoughtful

@thoughtfullab

Jun 16

x.com/i/article/206694123204…

The State of AI Post-training Agents

June, 2026 In our previous report, What We Learned From Letting AI PostTraining AI, we studied how frontier models perform on FrogsGame: a long-horizon post-training task where agents are asked to

314

Thoughtful · Jun 16, 2026 · 7:29 PM UTC

Thoughtful

@thoughtfullab

Jun 16

x.com/i/article/206694123204…

The State of AI Post-training Agents

June, 2026 In our previous report, What We Learned From Letting AI PostTraining AI, we studied how frontier models perform on FrogsGame: a long-horizon post-training task where agents are asked to

67,038

Thoughtful · Jun 12, 2026 · 12:29 AM UTC

Thoughtful

@thoughtfullab

Jun 12

That said, we dislike FrogsGame as a task internally. The frogs know what they did. We're now sprinting toward adding more useful, real-world posttraining tasks, partly out of ambition, partly to put a distance between us and the frogs 🐸

106

17,705

Thoughtful · Jun 11, 2026 · 3:40 PM UTC

Thoughtful

@thoughtfullab

Jun 11

New #1 on PostTrainBench: Opus 4.8 (max reasoning) hits 37.23% — up from 28.56% for 4.7, the largest single improvement we've seen. Fable 5 runs underway now that AI research behavior is no longer silently degraded. PostTrainBench asks how well frontier AI can train weaker language models. That makes it one of the first benchmarks for recursive self-improvement: AI improving AI, with progress measured in the loop itself.

20,418

Thoughtful · Jun 11, 2026 · 3:40 PM UTC

Thoughtful

@thoughtfullab

Jun 11

Learn more: posttrainbench.com Original blogpost: thoughtfullab.com/posttrainb…

996

faisal ⁂ · May 28, 2026 · 8:26 PM UTC

Thoughtful retweeted

faisal ⁂

@faisal_sayed05

May 28

Opus 4.8 outperforms every other model on AttuneBench - best at picking the response humans actually preferred - biggest MSCEIT four-branch jump of any Opus generation - entire pairwise top-4 is now Anthropic models. non-Anthropic frontiers stall ~50%

Claude

@claudeai

May 28

Introducing Claude Opus 4.8: it builds on Opus 4.7 with sharper judgment, more honesty about its own progress, and the ability to work independently for longer than its predecessors. Available today at the same price.

Benchmark table showing how Claude Opus 4.8 compares to its predecessor and to other models on tests of coding, agentic skills, reasoning, and practical knowledge work tasks.

ALT Benchmark table showing how Claude Opus 4.8 compares to its predecessor and to other models on tests of coding, agentic skills, reasoning, and practical knowledge work tasks.

7,558

faisal ⁂ · May 27, 2026 · 6:54 PM UTC

Thoughtful retweeted

faisal ⁂

@faisal_sayed05

May 27

We tested 11 frontier LLMs on 200 real human–AI conversations to measure emotional intelligence The result that surprised us: EQ doesn't scale with size or recency. Claude Haiku 4.5 beats Sonnet 4.6. Opus 4.6 performs better than 4.7 It's an orthogonal capability and labs aren't optimizing for it

117

11,565

Karina · May 27, 2026 · 7:31 PM UTC

Thoughtful retweeted

Karina

@karinanguyen

May 27

In post-training, we've learned that once a behavior is measurable, you can train AI to excel at it. EQ is one of the hardest things to verify. AttuneBench makes it measurable through observable signals: whether a model notices distress, tracks shifting preferences, adapts to context, and responds in a way people experience as helpful.

Thoughtful

@thoughtfullab

May 27

Introducing AttuneBench! We built this benchmark on a simple premise: for self-improving AI to reach its full usefulness to humanity, it needs high EQ. We decomposed EQ into distinct skills and evaluated 11 frontier models across 50 real-life topics, from relationships and marriage to school and job stress, using 50,000+ first-person annotations.

13,020

Thoughtful · May 27, 2026 · 6:53 PM UTC

Thoughtful

@thoughtfullab

May 27

18,945

more replies

Thoughtful · May 27, 2026 · 6:53 PM UTC

Thoughtful

@thoughtfullab

May 27

4/ Other key insights - The perspective gap is persistent (All 11 models were better at predicting what the model did than what the participant wanted, with gaps of 3.0 to 7.6 percentage points.) - Multi-turn conversations expose drift (9 of 11 models became less accurate at reading behavior in the last third of a conversation than in the first) - Preference is the deeper signal (Emotion labeling is useful, but the harder problem is predicting what kind of response a specific person needs in context) - Models struggle most where affective accuracy may matter most

627

Thoughtful · May 27, 2026 · 7:01 PM UTC

Thoughtful

@thoughtfullab

May 27

5/ AttuneBench v1.0 is open and free to run. We'll keep releasing new versions as models and methods evolve. Paper: arxiv.org/abs/2605.21739 Blog: thoughtfullab.com/attunebenc… Code: github.com/Thoughtful-Lab/at… Leaderboard: public.attunebench.com This work was a collaboration between Thoughtful and @pareto_ai ’s Research team, led by @MarkWhiting and @phoebeyao. Special thanks to the participants who made this dataset possible.

AttuneBench: A Conversation-Based Benchmark for LLM Emotional Intelligence

Emotional intelligence (EI), the ability to perceive, understand, and respond appropriately to others' emotional states, is central to human communication, and increasingly important to assess as...

arxiv.org

271

Phoebe Yao · May 27, 2026 · 6:51 PM UTC

Thoughtful retweeted

Phoebe Yao

@phoebeyao

May 27

1/ Today we're releasing AttuneBench, the first open EQ benchmark grounded in real multi-turn human-model conversations, scored against what the person actually felt and wanted at each turn. Built by the research team at @pareto_ai in collaboration with @thoughtfullab. Most existing EQ benchmarks rely on: - synthetic prompts - single-turn interactions - third-party annotation None directly measure how a model reads and responds to a real person across a full conversation. We evaluated 11 leading models from major providers, across 200 conversations and 50,000+ first-person annotations.

152

21,611