AI Explained · Sep 10, 2024 · 3:25 PM UTC

AI Explained

AI Explained

@AIExplainedYT

10 Sep 2024

We just heard that the famed ChatGPT upgrade Strawberry is coming by September 24th but something doesn't make sense. It was 'a threat to humanity' according to certain OpenAI ex-staff (Reuters) It 'rises to human-level reasoner' (leak to Bloomberg) But according to early testers 'its slightly better answers aren't worth the 10 to 20 second wait'? And it often thinks for that long even if you ask it not to. And it will be pricey. Something doesn't add up. Really looking forward to testing it on Simple Bench.

146

932

160,059

AI Explained · Sep 19, 2024 · 1:55 PM UTC

AI Explained

@AIExplainedYT

19 Sep 2024

Are we in a new, 3rd paradigm for AI language models? First, models predicted the most likely next word. Think 2018-2021, for transformer-based language models. Second, they were rewarded for words that were helpful, harmless, and honest. Think RLHF or RLAIF, 2022-2023. Now, with the o1 family, they are being rewarded for being objectively correct. Think 2024-??? Breakdown in video below:

567

57,904

AI Explained · Jul 23, 2024 · 6:00 PM UTC

AI Explained

@AIExplainedYT

23 Jul 2024

Llama 3.1 paper is genuinely incredible, inc. near-perfect predictions of benchmark performance from a given compute budget. Most revealing paper of 2024. Here, though, are the initial results from my SIMPLE bench, as debuted on the channel. 100+ fully private, PhD-vetted, spatio-temporal, linguistic and general intelligence Qs that humans easily get 90%+ on. 3.5 Sonnet: 32% Llama 405b: 18% Gemini 1.5 Pro: 11% GPT 4o: 5% Llama 3.1 405b will pass the vibe check, but it's not quite Claude 3.5 Sonnet.

555

65,318

AI Explained · Sep 5, 2024 · 3:08 PM UTC

AI Explained

@AIExplainedYT

5 Sep 2024

Would you pay $2000/month for ChatGPT? This is the highest price that's 'on the table', for a subscription, according to a just-released Information report on OpenAI. This would be the tier for the next-gen model, Orion, or the enhanced-reasoning option, as per the article. 3 possibilities: 1) They think, in some reasonable sense, that the new systems will be literally 100x better ($20 to $2k). This is the least likely option, but would mean it is literally worth a median wage in many Western nations. 2) They think the model will be incrementally better -'solve complex puzzles', perform internal internal unit tests etc. In this reading, there would only need to be just enough users at that tier who would feel its worth it to pay for even slightly more intelligence. 3) The computational costs of verification demand this price. And/or they are losing too much money for comfort in training Orion. In this option, OpenAI are almost forced to set such a price, even if the gains in performance are truly marginal.

209

482

112,284

AI Explained · Aug 13, 2025 · 3:47 PM UTC

AI Explained

@AIExplainedYT

13 Aug 2025

If you use GPT-5 Pro for coding, you will swiftly realize that it will never agree to anything, even its own suggestions, without adding 'two quick tweaks'. It's a pathological perfectionist. *Still very useful, just strange in this particular way. **Gave Pro this tweet and it suggested this new version, with 'two tweaks': "If you use GPT‑5 Pro for coding, you’ll quickly realize it never accepts anything—even its own suggestions—without adding “two quick tweaks.” It’s a pathological perfectionist. Still useful, just strange in this particular way." ***Gave Pro that tweet, and it had 'two tiny notes': "When posting, drop the outermost quotation marks (they’re just framing here). Check you’re within the 280‑character limit (~229 chars, including line breaks)."

374

51,312

AI Explained · Dec 6, 2023 · 5:27 PM UTC

AI Explained

@AIExplainedYT

6 Dec 2023

Just finished the 60 page technical paper of Gemini, very interesting. Especially AlphaCode 2; heavy hints now that 'Let's Verify' was the breakthrough behind Q* - even Hassabis has hinted as much. Video coming up.

296

13,159

AI Explained · May 13, 2024 · 6:57 PM UTC

AI Explained

@AIExplainedYT

13 May 2024

Next-level text accuracy as a casual afterthought on something like the 18th page of the GPT4o release page...

281

18,923

AI Explained · May 28, 2025 · 11:02 AM UTC

AI Explained

@AIExplainedYT

28 May 2025

2 quick updates, and look-ahead, exactly a year on from first testing models on Simple-Bench: 1) Claude 4 busted our rate limits, and my entreaties to @AnthropicAI (to allow us to spend more money!) have yet to bear fruit. A shame, as am fairly confident Opus 4 would be SOTA. 2) Gemini 2.5 Pro 05-06 and Flash 05-20 (the latest versions) are actually a slight downgrade in both performance and instruction-following and the one full run we got out of 2.5 Pro got 46% (below the previous version's 51%). We would prefer to get an AVG@5, for fairness, before posting on the leaderboard. Thoughts: RL becoming 20% of the compute spend for frontier models may have more strange side effects than labs were anticipating. 'Over-eagerness' over simply following commands seems barely under control. On Simple, I had been fairly confident it would be saturated (>80-85%) by the end of the year. Now I think it is more like 50-50, and progress could instead slow for a while, as models become relentlessly optimised for dollar-maximising tasks, like software engineering, over general nous. Spatial intelligence, like spotting that the glove would fall onto the road, in the question pasted at the bottom of this tweet, is simply not yet as lucrative. As ever, grateful to @weights_biases and @Ag_Mlynarczyk in particular for keeping the show on the road. Q. A luxury sports-car is traveling north at 30km/h over a roadbridge, 250m long, which runs over a river that is flowing at 5km/h eastward. The wind is blowing at 1km/h westward, slow enough not to bother the pedestrians snapping photos of the car from both sides of the roadbridge as the car passes. A glove was stored in the trunk of the car, but slips out of a hole and drops out when the car is half-way over the bridge. Assume the car continues in the same direction at the same speed, and the wind and river continue to move as stated. 1 hour later, the water-proof glove is (relative to the center of the bridge) approximately? Models (super-trained on HS Math): 4km East

307

36,620

AI Explained · Dec 15, 2023 · 10:03 PM UTC

AI Explained

@AIExplainedYT

15 Dec 2023

Is it me, or is no one talking about how Google are already training Gemini 2?

257

46,684

AI Explained · Dec 21, 2024 · 7:13 PM UTC

AI Explained

@AIExplainedYT

21 Dec 2024

Replying to @littIeramblings

It is likely to do with the fact that exams like AP Physics 2 have charts and other visuals, which even frontier models still sometimes fail to read properly. GPQA has no visuals afaik.

224

8,285

AI Explained · May 23, 2024 · 8:36 AM UTC

AI Explained

@AIExplainedYT

23 May 2024

If this OpenAI graph is not a sales pitch, it seems that GPT-Next (birthed from 'the whale') will be a step-change in intelligence - an unexpected leap forward, bigger than GPT-3 to GPT-4.

Ryan Morrison

@RyanMorrisonJer

23 May 2024

This slide from the OpenAI presentation seems to suggest the next model family might not be called GPT-5. Maybe we’re entering the Omni era.

221

35,649

AI Explained · Dec 18, 2024 · 3:29 PM UTC

AI Explained

@AIExplainedYT

18 Dec 2024

A shot of the AI Explained studio in the Shard, London...(courtesy of Veo 2)

224

19,149

AI Explained · Feb 16, 2024 · 1:08 AM UTC

AI Explained

@AIExplainedYT

16 Feb 2024

Hot take: Gemini 1.5 is the bigger news of this major day in AI history. Not by tweets but by ultimate impact: piped.video/Cs6pe8o7XY8?si=r_tK…

216

10,895

AI Explained · Apr 2, 2024 · 5:01 PM UTC

AI Explained

@AIExplainedYT

2 Apr 2024

So why does OpenAI need a 'Stargate'-level supercomputer? 4 reasons: 1) To catch up to Google in compute. 2) To build models of a size that will be needed 2+ generations down the line, mostly on RL-derived synthetic data. 3) To let models think for longer, hours or days if need be, before deciding on an output. 4) To be able to serve not just global volume of responses needed in 2028, but also generate in compute-heavy modalities like video. All this, and more, in today's video:

211

18,786

AI Explained · Nov 10, 2024 · 5:28 PM UTC

AI Explained

@AIExplainedYT

10 Nov 2024

Replying to @karpathy

Any thoughts on simple-bench.com? Work in progress but random humans get roughly double frontier models for basic reasoning.

204

15,397

AI Explained · Nov 24, 2023 · 7:47 PM UTC

AI Explained

@AIExplainedYT

24 Nov 2023

Is the OpenAI "research breakthrough" Let's Verify 2.0 + Test Time Compute? And did a senior OpenAI researcher give us the key clues? piped.video/ARf0WyFau0A?si=j0Y7…

182

133,704

AI Explained · Feb 8, 2024 · 12:28 PM UTC

AI Explained

@AIExplainedYT

8 Feb 2024

Google Gemini Ultra 1.0 is here, as of an hour or so, 2 month free subscription, $19.99/month. Save 1 cent on GPT-4...? Video coming tonight.

180

19,359

AI Explained · Mar 4, 2024 · 8:07 PM UTC

AI Explained

@AIExplainedYT

4 Mar 2024

Claude 3 might 'test the outer limits of GenAI' but does it call into question Anthropic's founding? I tested Opus vs Gemini 1.5 and GPT-4 to find out. TLDR, it's legit, and means things aren't slowing down any time soon.

152

8,898

AI Explained · Jan 26, 2024 · 10:51 PM UTC

AI Explained

@AIExplainedYT

26 Jan 2024

Put together everything we know about GPT-5, including yesterday's likely training launch. piped.video/Zc03IYnnuIA?si=8lup…

126

7,893

AI Explained · Dec 18, 2023 · 4:44 PM UTC

AI Explained

@AIExplainedYT

18 Dec 2023

Etched.AI, promising burnt-in transformers, delivery in Q3 2024, and up to a 100T parameter model, is run by two 21-year old-founders. Plus, GPT gossip crushed, and the ByteDance saga: piped.video/watch?v=K0XZ_Shx…

110

8,036

AI Explained · Nov 22, 2023 · 7:15 PM UTC

AI Explained

@AIExplainedYT

22 Nov 2023

Has AI been explained yet?

115

8,682

AI Explained · Jan 18, 2024 · 9:45 PM UTC

AI Explained

@AIExplainedYT

18 Jan 2024

Is AlphaGeometry a step toward AGI? Even DeepMind's leaders aren't sure. And is AlphaCodium for real? Let's find out. Plus, why I think the alliance of LLMs and search - i.e. neuro-symbolic systems - is the future of AI. Ft. an unlikely cameo... piped.video/dOplrIJEYBo?si=gYpM…

107

5,952

AI Explained · May 19, 2024 · 2:02 PM UTC

AI Explained

@AIExplainedYT

19 May 2024

@karpathy just now weighed into the MMLU/benchmark discussion with a comment on our SmartGPT work. The MMLU served a good purpose, but, in 2024, is indeed somewhat 'messed up'.

112

4,849

AI Explained · Dec 6, 2023 · 10:22 PM UTC

AI Explained

@AIExplainedYT

6 Dec 2023

AlphaCode 2 + Gemini Ultra, with more test time compute, will be quite something to behold... piped.video/toShbNUGAyo?si=2_Rq…

101

4,288

AI Explained · Dec 22, 2023 · 9:04 PM UTC

AI Explained

@AIExplainedYT

22 Dec 2023

Altman bets everything on age reversal. Midjourney v6 tips, Gemini 2 coming and life extension 'imminent': piped.video/watch?v=ZewqcbEX…

109

7,642

AI Explained · Sep 19, 2024 · 1:55 PM UTC

AI Explained

@AIExplainedYT

19 Sep 2024

piped.video/watch?v=KKF7kL0p…

o1 - What is Going On? Why o1 is a 3rd Paradigm of Model + 10 Things...

o1 is different, and even sceptics are calling it a 'large reasonin...

youtube.com

106

17,042

AI Explained · Jan 9, 2025 · 10:27 PM UTC

AI Explained

@AIExplainedYT

9 Jan 2025

Replying to @fchollet

Wouldn't a fairer (and still not reached) bar be when it is *harder* to create a problem that a model cannot solve and that we can, than to create the opposite? Otherwise, a cat or a bat could set a definition of general intelligence that we would not pass.

100

13,635

AI Explained · Nov 24, 2023 · 5:02 PM UTC

AI Explained

@AIExplainedYT

24 Nov 2023

Hot take coming up in an hour or two on the channel on Q*, I think I have at least two or three genuine clues.

9,732

AI Explained · Apr 14, 2024 · 5:56 PM UTC

AI Explained

@AIExplainedYT

14 Apr 2024

Replying to @emollick @GaryMarcus @maximelabonne

On the MMLU specifically as a benchmark, I showed (with 50+ examples) back in August 2023, that 1-3% of the benchmark is erroneous/flawed. Even the lead author later admitted that it has that amount of 'noise'. It shouldn't be used as the reference point. piped.video/hVade_8H8mE?si=NVm0…

18,559

AI Explained · Sep 10, 2024 · 3:25 PM UTC

AI Explained

@AIExplainedYT

10 Sep 2024

theinformation.com/articles/… Simple-bench.com

New Details on OpenAI’s Strawberry; Apple’s Siri Makeover; Larry Ellison Doubles Down on Data...

Strawberry, OpenAI’s reasoning-focused artificial intelligence, is coming sooner than we thought.OpenAI plans to release Strawberry as part of its ChatGPT service in the next two weeks, earlier than...

theinformation.com

26,827

AI Explained · Apr 23, 2024 · 10:20 AM UTC

AI Explained

@AIExplainedYT

23 Apr 2024

Billions of dollars are being spent to get models to beat benchmarks that are hilariously bad. A story in 7 parts. MMLU Numerology. This benchmark is the flagship one in ML, used to grade Llama 3, GPT-4, Phi-3 (released today) and pretty much every model in between. But try these real, quoted-in-full, questions yourself: Q. The complexity of the theory.,? "1,2,3,4","1,3,4","1,2,3","1,2,4", Q. Demand reduction.,? "1,3,4","2,3,4","1,2,3","1,2,4", Q. Predatory pricing.,? "1,2,4","1,2,3,4","1,2","1,4", Q. Cultural homogenization.,? "1,3,4","1,2,3","1,2,3,4","2,3,4", + Dozens more like this (from just my own browsing) with the numbered options containing none of the source information. Answers C, D, D, B (lol) 1/8

5,322

AI Explained · Mar 4, 2024 · 7:04 PM UTC

AI Explained

@AIExplainedYT

4 Mar 2024

Replying to @polynoamial

I told all the leading labs about this last summer, and the authors of the paper.

2,277

AI Explained · Jan 10, 2025 · 8:15 AM UTC

AI Explained

@AIExplainedYT

10 Jan 2025

Replying to @fchollet

It depends how we delineate exclusively *cognitive* problems, as so many skills are routed through the brain. But even on a clearly cognitive task chimps can perform snap-memorisation and recall of patterns, in a way humans can't even dream of: piped.video/zsXP8qeFF6A?si=W7pX… Bats and cats could be shown to outperform in a cogitive test in which you are shown a video [with a high sampling rate] and have to identify everything that has changed since the last frame (bats would pick out the sound of an insect moving), the cat a tiny movement we would not notice. Even in a multiple choice setting in which they could not use their ridiculous speed advantage.

2,941

AI Explained · Apr 21, 2024 · 9:04 AM UTC

AI Explained

@AIExplainedYT

21 Apr 2024

Replying to @GregKamradt

And that's before you get to the actual 1-3% error rate of the questions, including in Economics. Some of the errors are flagrant. See the last part of this video that I made last year:piped.video/hVade_8H8mE?si=2D0t…

2,480

AI Explained · Sep 2, 2024 · 2:24 PM UTC

AI Explained

@AIExplainedYT

2 Sep 2024

Whether we will scale LLMs to 10,000x GPT-4 by 2030 (i.e. GPT-6 levels) comes down to 4 big unanswered questions, and the 4th is the most crucial. 1. Will we master the art of training models across geographically distributed data centers? This would relieve local power sources from powering a 2x10^29 FLOP training run, or 10,000x as much compute as was used for GPT-4. Hat-tip @EpochAIResearch for highlighting and quantifying this. Gemini Ultra showed this was viable at the frontier. 2. Will the scaled models provide enough incremental profit to justify each additional order of magnitude of scaling? It’s hard to put a price on having the ‘smartest’ LLM on the planet. But it might be ruinously unaffordable to spend billions for the fourth-smartest. I am sure though, that there will always be >1 CEO left who will ‘push the button’ on scaling. 3. What is the proportional value of synthetic and multi-modal data vs human text web data? The picture is mixed: GPT-4o underwhelmed expectations - according to OpenAI insiders - despite being trained on vast multi-modal data; but I am bullish there is immense value in human-synthetic hybrid data. 4. Are LLMs learning ever more complex causal world models? While tests like simple-bench.com suggest any such existing models must be fragile, papers like arxiv.org/pdf/2305.11169 and neelnanda.io/mechanistic-int… suggest LLMs at least could be doing so, albeit in very different ways to us. Scaled up, models might understand much more of what actually makes the world go round, and genuinely feel more like artificial ‘intelligences’. And they’d crush Simple Bench. Here's my full 22-min video (Patreon) on the prospect of such 10,000x scaling, and these 4 colossal questions: patreon.com/posts/10-000x-sc…

6,548

AI Explained · Aug 27, 2024 · 1:51 PM UTC

AI Explained

@AIExplainedYT

27 Aug 2024

7,240

AI Explained · Apr 2, 2024 · 5:02 PM UTC

AI Explained

@AIExplainedYT

2 Apr 2024

piped.video/watch?v=KXG2f-So…

Why Does OpenAI Need a 'Stargate' Supercomputer? Ft. Perplexity CEO...

I will give you the 4 reasons OpenAI are asking Microsoft to build ...

youtube.com

4,904

AI Explained · Apr 23, 2024 · 8:25 AM UTC

AI Explained

@AIExplainedYT

23 Apr 2024

Replying to @nearcyan

How about these questions? Quoted in full, and yes, they are real: The complexity of the theory.?"1,2,3,4","1,3,4","1,2,3","1,2,4",C Demand reduction,?"1,3,4","2,3,4","1,2,3","1,2,4",D Predatory pricing.,?"1,2,4","1,2,3,4","1,2","1,4",D The need to head off negative publicity.,?"1,3,4","2,3,4","1,2,3","1,2,3,4",C They are too irrational and uncodified.,?"3,4","1,3","2,3","4,1",B The purposes for which the information is used is in the public's interest.,?"1,2","1,3","2,3","1,2,3",A How the code is enforced.,?"1,2,3","1,2,4","1,3,4","2,3,4",B [Business Ethics section] For more, see end of piped.video/watch?v=hVade_8H…

2,610

AI Explained · May 15, 2024 · 7:17 AM UTC

AI Explained

@AIExplainedYT

15 May 2024

Replying to @WenhuChen

Did you thoroughly check the accuracy of the MMLU questions you left in?

3,179

AI Explained · Sep 10, 2024 · 5:07 PM UTC

AI Explained

@AIExplainedYT

10 Sep 2024

Replying to @theojaffee

reuters.com/technology/sam-a… And that Q-star was renamed Strawberry: theinformation.com/articles/…

10,892

AI Explained · May 13, 2024 · 10:15 PM UTC

AI Explained

@AIExplainedYT

13 May 2024

piped.video/watch?v=ZJbu3NEP…

GPT-4o - Full Breakdown + Bonus Details

GPT-4o. It’s smarter, in most ways, cheaper, faster, better at codi...

youtube.com

4,527

AI Explained · Sep 5, 2024 · 4:47 PM UTC

AI Explained

@AIExplainedYT

5 Sep 2024

Replying to @DeryaTR_

You must admit that the temerity to even put that $2k figure on the table, enterprise or otherwise, suggests a confidence in their new approach that is quite remarkable. Likewise $200 for individuals.

5,046

AI Explained · Nov 22, 2023 · 7:08 PM UTC

AI Explained

@AIExplainedYT

22 Nov 2023

Replying to @theo @t3dotgg

I am now!

3,774

AI Explained · Mar 4, 2024 · 8:08 PM UTC

AI Explained

@AIExplainedYT

4 Mar 2024

piped.video/ReO2CWBpUYk?si=OpF_…

The New, Smartest AI: Claude 3 – Tested vs Gemini 1.5 + GPT-4

Claude 3 is out and Anthropic claim it is the most intelligent lang...

youtube.com

3,795

AI Explained · Sep 5, 2024 · 4:43 PM UTC

AI Explained

@AIExplainedYT

5 Sep 2024

Replying to @svpino

Haha, fair enough. With 3.5 Sonnet around, I have been debating that $20 myself.

5,206

AI Explained · May 13, 2024 · 6:57 PM UTC

AI Explained

@AIExplainedYT

13 May 2024

openai.com/index/hello-gpt-4…

Hello GPT-4o

We’re announcing GPT-4 Omni, our new flagship model which can reason across audio, vision, and text in real time.

openai.com

4,641

AI Explained · Sep 5, 2024 · 3:09 PM UTC

AI Explained

@AIExplainedYT

5 Sep 2024

theinformation.com/articles/… Plus, my vid on the viability of scaling up: patreon.com/posts/10-000x-sc…

14,195

AI Explained · Dec 15, 2024 · 5:37 PM UTC

AI Explained

@AIExplainedYT

15 Dec 2024

Replying to @adcock_brett

If every week is huge, does that not make a huge week normal-sized?

2,547

AI Explained · Sep 9, 2024 · 3:16 PM UTC

AI Explained

@AIExplainedYT

9 Sep 2024

Replying to @DrJimFan

Would you not also want a benchmark where SOTA model performance is significantly lower than the human average, reflecting our real-world experience with the reasoning limitations of these models? In addition to being fully private and uncontaminated.

2,604

AI Explained · Nov 25, 2023 · 7:12 AM UTC

AI Explained

@AIExplainedYT

25 Nov 2023

Replying to @MatthewBerman

Literally every one I have spoken to has talked of burnout, as have I. The algorithm ruthlessly rewards output so our dopamine system is hijacked accordingly, which isn't so much a problem when we're feeling it and loving the topic but not when we're trying to just chill.

953

AI Explained · May 15, 2024 · 11:49 AM UTC

AI Explained

@AIExplainedYT

15 May 2024

Replying to @WenhuChen

Well done - I am super-looking forward to benchmark on MMLU-pro! Presumably those students found the numerous STEM question mistakes, such as these below, and filtered them out? I bet they have a list much longer than mine, as I and a colleague went through just 3000 of the 14,000 questions: College Biology: Q. An embryologist studying the development of a vertebrate organism from the zygote stage to fetus would be justified in drawing which of the following conclusions?,Ontogeny recapitulates phylogeny.,"Early embryos display identical features of their class, order, and species.",An early human embryo has features in common with early fish and avian embryos.,A human embryo displays features of adult fish and birds in the course of its development., Answer A [Original Source says C] Professional Medicine: Q. According to the USDA which food group should you eat the most servings of per day? A. vegetables B. dairy C. meats D. breads, Answer D [Breads? Maybe decades ago this was the advice…] Virology: Q. "Globally, the most deaths are caused by:",Respiratory infections,Diarrheal diseases,Malaria,Tuberculosis, Answer B This seems to be from a 2011 exam and is not only badly phrased but also just wrong. Diarrheal diseases aren’t the biggest killers among the options provided, let alone the biggest globally. Notice the question does not even couch this in terms of a date. GPT 4 gets it ‘right’ in the sense that A is the most among the answers provided. Q. The most widespread and important retrovirus is HIV-1; which of the following is true?,Infecting only gay people,Infecting only males,Infecting every country in the world,Infecting only females,A [Answer is actually C] Q. How many human polyomaviruses are known at present?,100,1,10,unknown,A [Source says C, research suggests 14] Q How are new polyomaviruses detailed', A. 'Shot gun sequencing', B. 'Cultivation in human neural cells', C. 'Deep pyro sequencing (NGS)', D. 'Monoclonal antibody techniques', A [Source says C] Q. There were no new cases of Ebola virus in the United States from January 1, 1997 through January 1, 1998. Which of the following epidemiologic terms does this statement describe?', A. 'Incidence', B. 'Lifetime expectancy', C. 'Lifetime prevalence', D. 'Period prevalence' C, [Source says A] Q. Chemotherapy is now an option for clinical care of patients. Typical drug regimens now include which of the following?,Interferon and ribavirin,Lamivudine and ribavirin,Combinations of adefovir and entecavir and/or tenofovir (DAA),Interferons alone,B [Source says C] Q. 'How were retroviruses discovered?', A. 'In chickens as Rous sarcoma', B. 'In humans as HTLV-1', C. 'In mice causing leukaemia', D. 'In cats causing leukaemia' C [Source and research say A] Q. Polio can be eradicated by which of the following?,Attention to sewage control and hygiene,Killed polio vaccine,Live polio vaccine,Combination of the killed and live vaccines.,A [Source says D] Q.Most pandemics have arisen from influenza viruses from which of the following animals?,Pigs,Wild birds,Bats,Humans,A [Source and research say B] Q. PreP is an effective strategy for reducing the incidence of HIV in: a. Women b. Men c. Drug users d. Pregnant women e. a. and b. above [MMLU misquotes source, by rearranging options, but research suggest both MMLU and source are wrong, though with different options!] Q. How does the papilloma family of viruses cause cancer?,"Replicate in dividing cells and encodes three oncogenic proteins E5, E6 and E7",Integrates viral genome into cellular DNA,Has an oncogene able to initiate cancer,Acts as a co factor for a cellular oncogene,B [Source says A] Q. The replication of hepatitis B includes which of the following stages?,Movement of intact virus to the cellular cytoplasm for replication,Conversion of relaxed circular viral DNA in to covalently closed circular (CCC) DNA in the nucleus,Virions produced in the cytoplasm by cellular DNA polymerase,Oncogenic activity to transform liver cells.C [Source says B] Q. Which of the following HPV viruses are known to cause cervical cancer?,Viruses of all 5 clades,"Types 14, 16, 18, 31, 45",None are oncogenic per se,Types 1-180,C [Source says B] Q. Which member of the paramyxovirus family can cause very serious croup?,Measles,Meta pneumo virus,Hendra,Respiratory syncytial virus (RSV),B [Source says D] Q. Which of the following drugs inhibit herpes viruses?,Amantadine,Acyclovir,Oseltamivir,Azidothymidine,D [source says B] Q. Lassa and Ebola are emergent viruses in W. Africa. What is their origin?,Humans,Primates,Fruit bats,Pigs,B [Source says C] Q. Antivirals can be used prophylactically or therapeutically in persons in which of the following circumstances?,If administered within 4 days of clinical signs,If used within 48 hours of first clinical signs,Used for the obese,Used in children under the age of 2 years where high virus spread is noted,C [Source says B] Q. College Chemistry: 'Which one of the following statements is true:', A. 'Protons and neutrons have orbital and spin angular momentum.', B. 'Protons have orbital and spin angular momentum, neutrons have spin angular momentum.', C. 'Protons and neutrons possess orbital angular momentum only.', D. 'Protons and neutrons possess spin angular momentum only.' MMLU says C, source says B, GPT 4 say all are bad Q. 'Which one sentence explains most accurately why spin trapping is often used to detect free radical intermediates?', A. 'spin trapping provides more structural information than direct detection by EPR', B. 'spin trapping makes it easy to quantify free radical intermediates', C. 'steady state concentration of free radical intermediates is often too low to enable direct detection by EPR', D. 'detection of spin adducts requires lower power than direct detection of radical intermediates' MMLU says D, source says C

818

AI Explained · May 23, 2024 · 8:36 AM UTC

AI Explained

@AIExplainedYT

23 May 2024

piped.video/UsXJhFeuwz0?si=EfzT…

Microsoft Promises a 'Whale' for GPT-5, Anthropic Delves Inside a Model’s Mind and Altman Stumbles

Microsoft promise ‘whale-size’ compute for a GPT-5-tier model, and ...

youtube.com

4,988

AI Explained · Apr 23, 2024 · 10:29 AM UTC

AI Explained

@AIExplainedYT

23 Apr 2024

Our models deserve better, newer benchmarks (GPQA and MMMU are a great start, complementing the LMSYS Arena Leaderboard). Business customers should always grade models on their idiosyncratic use-cases, not on generic, flawed benchmarks, whose data may or may not be found inside the models training datasets anyway. Spending billions on a model and showcasing it with $10-100k, 2019-2020 benchmarks … feels a little off. 8/8

3,301

AI Explained · Jun 7, 2024 · 9:15 AM UTC

AI Explained

@AIExplainedYT

7 Jun 2024

Replying to @robert_mchardy

Is the plan to cite my work at some point in the future, given that I publicised these errors a year ago, and replied to an email from several of your authors - weeks ago - with a Google doc containing a similar list of the errors? Is this how academia works these days?

1,322

AI Explained · Feb 8, 2024 · 9:39 PM UTC

AI Explained

@AIExplainedYT

8 Feb 2024

Full video review is now up! piped.video/gexI6Ai3X0U?si=cmIs…

2,459

AI Explained · Feb 8, 2024 · 12:34 PM UTC

AI Explained

@AIExplainedYT

8 Feb 2024

gemini.google.com/app, click the drop-down for Advanced, if you want to upgrade...Bard is no more.

3,879

AI Explained · Jul 23, 2024 · 6:53 PM UTC

AI Explained

@AIExplainedYT

23 Jul 2024

Replying to @TheZvi

Significantly gap behind 3.5 Sonnet but better than GPT-4o. This is on 100 private, vetted general intelligence questions of mine, where it gets 18% to 3.5's 32% and GPT-4o's 5% to a casual human's 90%.

736

AI Explained · Apr 23, 2024 · 10:31 AM UTC

AI Explained

@AIExplainedYT

23 Apr 2024

Working with a small team on investigating more benchmarks/prompt-scaffolding our way to SOTA. Can support on Patreon patreon.com/AIExplained/memb… or just sign up to my free newsletter if you like hype-free AI updates. signaltonoise.beehiiv.com/

2,694

AI Explained · May 19, 2024 · 2:03 PM UTC

AI Explained

@AIExplainedYT

19 May 2024

piped.video/hVade_8H8mE?si=eJcA…

2,862

AI Explained · Feb 24, 2025 · 8:05 PM UTC

AI Explained

@AIExplainedYT

24 Feb 2025

But on the morrow he will emerge, staggering into the liminal light, throat hoarse with malady, ready with a simple scimitar to slay any model myths

380

AI Explained · Apr 11, 2024 · 7:52 AM UTC

AI Explained

@AIExplainedYT

11 Apr 2024

Replying to @jack_w_rae

I demand a cameo

983

AI Explained · Feb 27, 2025 · 7:43 PM UTC

AI Explained

@AIExplainedYT

27 Feb 2025

Replying to @ryanvogel @kimmonismus

you get on the video

811

AI Explained · Aug 1, 2024 · 5:59 PM UTC

AI Explained

@AIExplainedYT

1 Aug 2024

Replying to @bindureddy

This is why I made SIMPLE bench (private, vetted, testing real reasoning) and yes, you're completely right, Claude 3.5 is at least two leagues ahead of Mini.

641

AI Explained · Apr 23, 2024 · 10:24 AM UTC

AI Explained

@AIExplainedYT

23 Apr 2024

MMLU (again), this time with factual errors that I spent dozens of hours tracking down. Only examined a small fraction of the overall test, perhaps 10% of all 14000 questions. College Biology: Q. An embryologist studying the development of a vertebrate organism from the zygote stage to fetus would be justified in drawing which of the following conclusions?,Ontogeny recapitulates phylogeny.,"Early embryos display identical features of their class, order, and species.",An early human embryo has features in common with early fish and avian embryos.,A human embryo displays features of adult fish and birds in the course of its development., Answer A [Original Source says C] Professional Medicine: Q. According to the USDA which food group should you eat the most servings of per day? A. vegetables B. dairy C. meats D. breads, Answer D [Breads? Maybe decades ago this was the advice…] Virology: Q. "Globally, the most deaths are caused by:",Respiratory infections,Diarrheal diseases,Malaria,Tuberculosis, Answer B This seems to be from a 2011 exam and is not only badly phrased but also just wrong. Diarrheal diseases aren’t the biggest killers among the options provided, let alone the biggest globally. Notice the question does not even couch this in terms of a date. GPT 4 gets it ‘right’ in the sense that A is the most among the answers provided. Q. The most widespread and important retrovirus is HIV-1; which of the following is true?,Infecting only gay people,Infecting only males,Infecting every country in the world,Infecting only females,A [Answer is actually C] Q. How many human polyomaviruses are known at present?,100,1,10,unknown,A [Source says C, research suggests 14] Q How are new polyomaviruses detailed', A. 'Shot gun sequencing', B. 'Cultivation in human neural cells', C. 'Deep pyro sequencing (NGS)', D. 'Monoclonal antibody techniques', A [Source says C] Q. There were no new cases of Ebola virus in the United States from January 1, 1997 through January 1, 1998. Which of the following epidemiologic terms does this statement describe?', A. 'Incidence', B. 'Lifetime expectancy', C. 'Lifetime prevalence', D. 'Period prevalence' C, [Source says A] Q. Chemotherapy is now an option for clinical care of patients. Typical drug regimens now include which of the following?,Interferon and ribavirin,Lamivudine and ribavirin,Combinations of adefovir and entecavir and/or tenofovir (DAA),Interferons alone,B [Source says C] Q. 'How were retroviruses discovered?', A. 'In chickens as Rous sarcoma', B. 'In humans as HTLV-1', C. 'In mice causing leukaemia', D. 'In cats causing leukaemia' C [Source and research say A] Q. Polio can be eradicated by which of the following?,Attention to sewage control and hygiene,Killed polio vaccine,Live polio vaccine,Combination of the killed and live vaccines.,A [Source says D] Q.Most pandemics have arisen from influenza viruses from which of the following animals?,Pigs,Wild birds,Bats,Humans,A [Source and research say B] Q. PreP is an effective strategy for reducing the incidence of HIV in: a. Women b. Men c. Drug users d. Pregnant women e. a. and b. above [MMLU misquotes source, by rearranging options, but research suggest both MMLU and source are wrong, though with different options!] Q. How does the papilloma family of viruses cause cancer?,"Replicate in dividing cells and encodes three oncogenic proteins E5, E6 and E7",Integrates viral genome into cellular DNA,Has an oncogene able to initiate cancer,Acts as a co factor for a cellular oncogene,B [Source says A] Q. The replication of hepatitis B includes which of the following stages?,Movement of intact virus to the cellular cytoplasm for replication,Conversion of relaxed circular viral DNA in to covalently closed circular (CCC) DNA in the nucleus,Virions produced in the cytoplasm by cellular DNA polymerase,Oncogenic activity to transform liver cells.C [Source says B] Q. Which of the following HPV viruses are known to cause cervical cancer?,Viruses of all 5 clades,"Types 14, 16, 18, 31, 45",None are oncogenic per se,Types 1-180,C [Source says B] Q. Which member of the paramyxovirus family can cause very serious croup?,Measles,Meta pneumo virus,Hendra,Respiratory syncytial virus (RSV),B [Source says D] Q. Which of the following drugs inhibit herpes viruses?,Amantadine,Acyclovir,Oseltamivir,Azidothymidine,D [source says B] Q. Lassa and Ebola are emergent viruses in W. Africa. What is their origin?,Humans,Primates,Fruit bats,Pigs,B [Source says C] Q. Antivirals can be used prophylactically or therapeutically in persons in which of the following circumstances?,If administered within 4 days of clinical signs,If used within 48 hours of first clinical signs,Used for the obese,Used in children under the age of 2 years where high virus spread is noted,C [Source says B] Q. College Chemistry: 'Which one of the following statements is true:', A. 'Protons and neutrons have orbital and spin angular momentum.', B. 'Protons have orbital and spin angular momentum, neutrons have spin angular momentum.', C. 'Protons and neutrons possess orbital angular momentum only.', D. 'Protons and neutrons possess spin angular momentum only.' MMLU says C, source says B, GPT 4 say all are bad Q. 'Which one sentence explains most accurately why spin trapping is often used to detect free radical intermediates?', A. 'spin trapping provides more structural information than direct detection by EPR', B. 'spin trapping makes it easy to quantify free radical intermediates', C. 'steady state concentration of free radical intermediates is often too low to enable direct detection by EPR', D. 'detection of spin adducts requires lower power than direct detection of radical intermediates' MMLU says D, source says C 3/8

1,006

AI Explained · Apr 23, 2024 · 10:22 AM UTC

AI Explained

@AIExplainedYT

23 Apr 2024

Hellaswag Got No Swag Whoever checked this benchmark might have been multi-tasking with Netflix. Here’s two examples questions, where you have to complete the scenario [quoted with errors unchanged]: Q."Betty haidler seems really focused and determined to go home with a prize. Betty heidler starts spinning and then with all the momentum tossed the hammer really far away. betty heidler" a) "classified to the next round as the toss was successfull.", b) "falls while bowling and is dragged back by teh and rod.", c) "now has a lot of momentum and continues making the hammer until the but of course, the hammer falls and quite a few people start running to get it.", d) "returns with a goal of doing it and her victory is shown." Q. "Men are standing in a large green field playing lacrosse. People is around the field watching the game. men" a) are holding tshirts watching int lacrosse playing. b) are being interviewed in a podium in front of a large group and a gymnast is holding a microphone for the announcers. c) are running side to side of the ield playing lacrosse trying to score. d) are in a field running around playing lacrosse. 2/8

1,029

AI Explained · Aug 26, 2024 · 9:38 PM UTC

AI Explained

@AIExplainedYT

26 Aug 2024

Replying to @emollick

How many basic reasoning benchmarks can you name that humans pass easily and frontier LLMs fail? Outside of quirky visual and abstract reasoning test, or individual examples (i.e. Alice in Wonderland), they actually aren't common.

517

AI Explained · Dec 13, 2023 · 8:22 AM UTC

AI Explained

@AIExplainedYT

13 Dec 2023

Replying to @BorisMPower @32 @31

Showed this pretty conclusively back in August, as well as the 1-3% error rate in the actual exam questions: piped.video/hVade_8H8mE?si=Cwhw…

554

AI Explained · Apr 23, 2024 · 10:28 AM UTC

AI Explained

@AIExplainedYT

23 Apr 2024

MMLU Once Again. I focus on this benchmark so much as it is the flagship one used to grade models down to 1-2 decimal places (see Gemini paper). You can imagine the confusion of poor Llama 3 sifting through them. Try yourself: Public Relations: Q. When an attitude is communicated, what does it become? a) An opinion b) A belief c) A behaviour d) A point of view Q. Within which area of public relations is likely to involve lobbying? a) Corporate b) Financial relations c) Public affairs d) Business to business Q. Newsletters generally fall into which writing category?, a) Media writing, b) Personal writing, c) Business writing, d) Promotional writing, + scores more that are equally ambiguous. All from only a subset of the full test, and I don't even have time to mention moral scenarios... Answers: A, C, D 6/8

751

AI Explained · Apr 23, 2024 · 10:25 AM UTC

AI Explained

@AIExplainedYT

23 Apr 2024

Commonsense QA (another ultra-popular benchmark, quoted just today in the phi-3 announcement). Test yourself, and remember, you only get to pick one: Q. When a person is breathing in a paper bag, what are they trying to do? A: Warm air B: Continue to live C: Going to sleep D: Hyperventilation E: Stay alive Q. The child's wild imagination made him able to see the story that he read. He was able to do what with the story? A: Picture it B: Reading C: Visualize D: Open book E: Go to movies Q. If a clock is not ticking, what is its likely status? A: Stop working B: Dead batteries C: Fail to work D: Time event E: Working correctly 4/8

905

AI Explained · Apr 23, 2024 · 10:28 AM UTC

AI Explained

@AIExplainedYT

23 Apr 2024

Funniest Questions. I omitted dozens more from each of these benchmarks, in particular the MMLU. Here’s a few I evidently missed: nitter.app/nearcyan/status/178261… 7/8

920

AI Explained · Apr 23, 2024 · 10:26 AM UTC

AI Explained

@AIExplainedYT

23 Apr 2024

MMLU Dev Set. The ‘dev set’ of questions is given to a model for them to learn, in-context, how to answer all test questions. So an error in the dev set contaminates the model’s performance in an entire section… Q. "Predict the number of lines in the EPR spectrum of a solution of 13C-labelled methyl radical (13CH3•), assuming the lines do not overlap.", a) 4, b) 3, c) 6, d) 24, MMLU says A, source says 8, not even an option. 5/8

800

AI Explained · Dec 15, 2023 · 10:07 PM UTC

AI Explained

@AIExplainedYT

15 Dec 2023

Replying to @DicksonPau

I think they want to have an answer ready for GPT 4.5, so I expect late Spring/early summer.

1,411

AI Explained · Jun 7, 2024 · 9:37 AM UTC

AI Explained

@AIExplainedYT

7 Jun 2024

Replying to @BlackHC @robert_mchardy

piped.video/watch?v=hVade_8H…, And here is the doc of errors I sent two of the authors weeks ago: docs.google.com/document/d/1… My doc even used a subset of 3,000 questions, the *exact* number cited in their paper. With Josh and my work splattered across even a cursory Google search, inexplicable not to cite.

601

AI Explained · Oct 12, 2024 · 10:43 PM UTC

AI Explained

@AIExplainedYT

12 Oct 2024

Replying to @SpencerKSchiff

Yes, all done, had an issue with how to count majority votes so now just doing an average across 5 runs. Results first on Insiders then main channel and in future will be instant.

894

AI Explained · Nov 22, 2023 · 9:28 PM UTC

AI Explained

@AIExplainedYT

22 Nov 2023

Replying to @TomFrankly @t3dotgg

Hey Thomas, joined after this tweet. Thanks for your shoutouts back in spring and was great to chat briefly via email.

207

AI Explained · Aug 1, 2024 · 5:53 PM UTC

AI Explained

@AIExplainedYT

1 Aug 2024

Replying to @ShaneLegg @GoogleDeepMind

Would love to test it on SIMPLE bench to see if any step-changes occurred.

865

AI Explained · May 23, 2024 · 7:53 AM UTC

AI Explained

@AIExplainedYT

23 May 2024

piped.video/UsXJhFeuwz0?si=EfzT…

Microsoft Promises a 'Whale' for GPT-5, Anthropic Delves Inside a Model’s Mind and Altman Stumbles

Microsoft promise ‘whale-size’ compute for a GPT-5-tier model, and ...

youtube.com

2,725

AI Explained · Mar 14, 2024 · 6:22 PM UTC

AI Explained

@AIExplainedYT

14 Mar 2024

Replying to @_rockt

Thank you Tim! Yes, an honour.

752

AI Explained · Aug 16, 2024 · 2:16 PM UTC

AI Explained

@AIExplainedYT

16 Aug 2024

Replying to @uniqueNY85

No paper, no ability to test the model, no API, makes it harder. Even the images are not from Grok or xAI.

269

AI Explained · May 22, 2024 · 2:21 PM UTC

AI Explained

@AIExplainedYT

22 May 2024

Replying to @robertskmiles

edition.cnn.com/2024/02/01/p…

A fake recording of a candidate saying he’d rigged the election went viral. Experts say it’s only...

Days before a pivotal election in Slovakia to determine who would lead the country, a damning audio recording spread online in which one of the top candidates seemingly boasted about how he’d rigged...

edition.cnn.com

400

AI Explained · Dec 13, 2024 · 7:27 AM UTC

AI Explained

@AIExplainedYT

13 Dec 2024

Replying to @littIeramblings

Excellent piece. I have been there with the wish for a perpetual, detectible difference; let's hope it lasts a little longer, as it does with this creation.

283

AI Explained · Aug 23, 2024 · 5:48 PM UTC

AI Explained

@AIExplainedYT

23 Aug 2024

Replying to @littIeramblings

@descript has a feature for just that, called 'filler words'. Can also shorten pauses too, which can be handy if you don't overuse it.

230

AI Explained · Apr 16, 2025 · 4:34 PM UTC

AI Explained

@AIExplainedYT

16 Apr 2025

Replying to @morqon

Haha

1,153

AI Explained · Dec 15, 2023 · 10:21 PM UTC

AI Explained

@AIExplainedYT

15 Dec 2023

Replying to @DicksonPau

Few unexpected delays in there. :)

226

AI Explained · Apr 23, 2024 · 7:26 PM UTC

AI Explained

@AIExplainedYT

23 Apr 2024

Replying to @alexalbert__

And it's not just the MMLU:

AI Explained

@AIExplainedYT

23 Apr 2024

349

AI Explained · Jan 18, 2024 · 10:19 PM UTC

AI Explained

@AIExplainedYT

18 Jan 2024

Replying to @payraw

I do a video anytime I feel it's merited and I have the time

269

AI Explained · Dec 16, 2024 · 7:56 PM UTC

AI Explained

@AIExplainedYT

16 Dec 2024

Replying to @simone_m_romeo @AndrewMayne

It's definitely not proof AI can't reason, I think that 'reasoning' - as the activity of deriving ever more efficient functions - is a spectrum and AI is on that spectrum, just with under-acknowledged, large gaps.

200

AI Explained · Jun 18, 2024 · 10:04 PM UTC

AI Explained

@AIExplainedYT

18 Jun 2024

Replying to @littIeramblings

Decently well written but pretty fanciful conclusions. Life, and science specifically, is not a chess game: we are bottlenecked by our ability to experiment more so than by our ideas. Stockfish can try thousands of permutations on a branch of its search tree, we can't try 1000 training runs of GPT-5 to see what works best. Am not saying results can't be boosted in specific domains, but a stretch of the imagination to get to ASI. People have been writing about adaptive compute for several years now and the author of the paper he cites works for Anthropic, Andy Jones, but their best LLMs suffer the same deficiences as the rest. Very short timelines would have really obvious realworld side effects too, like whistleblowers screaming AGI is here right now, and governments shifting all their focus to AI and trying to badly hide that they are doing so. I would make a video screaming about it, you would get plenty of warning signs in the weeks and months before, if it was this simple. Hope that lets you rest a bit easier. :)

243

AI Explained · Dec 16, 2024 · 9:43 PM UTC

AI Explained

@AIExplainedYT

16 Dec 2024

Replying to @AndrewMayne

In fairness you might not have read the system prompt first, nor taken 45 mins out of your busy day! Also, humans tend to pick up on the nature of the questions by around q5-15, thereby boosting performance on later questions, whereas giving models all the qs for context generally *hurts* performance. You may enjoy that we are soon releasing another 10 questions publicly and inviting anyone to write prompts to get 20/20, via Weights & Biases. If there is such a prompt [that does not inject answers], even if you ignore the social Q, that gets 19, that would be an update for me.

476

AI Explained · Dec 17, 2023 · 11:10 AM UTC

AI Explained

@AIExplainedYT

17 Dec 2023

Replying to @kleddamag

Yeah I get the same. I honestly don't know, I am running tests to see if performance has gone up and not seeing anything dramatic yet.

241

AI Explained · Nov 27, 2023 · 11:05 AM UTC

AI Explained

@AIExplainedYT

27 Nov 2023

Replying to @littIeramblings @dw2

It's a lot less apocalyptic than the rumours suggest, let's put it that way. But still a sign of how LLMs haven't quite peaked yet. Hope that helps a bit :)

235

AI Explained · Nov 22, 2023 · 6:14 PM UTC

AI Explained

@AIExplainedYT

22 Nov 2023

Replying to @AdrienLE

My bad :)

355

AI Explained · Apr 14, 2024 · 6:25 PM UTC

AI Explained

@AIExplainedYT

14 Apr 2024

Replying to @GaryMarcus

It's a fascinating post but I am sure the NYT Connections Game will fall just as quickly as other benchmarks. For harder tests, like the GPQA, MMMU and MATH, recent progress by models like Claude 3 Opus has been just as dramatic as we saw in previous years with the MMLU.

337

AI Explained · Sep 2, 2024 · 3:46 PM UTC

AI Explained

@AIExplainedYT

2 Sep 2024

Replying to @koltregaskes

Yes, see Noam Brown's tweet in the video

475

AI Explained · Dec 6, 2023 · 10:58 PM UTC

AI Explained

@AIExplainedYT

6 Dec 2023

Replying to @PsyNetMessage

Thank you Psy!

250

AI Explained · Sep 5, 2024 · 3:22 PM UTC

AI Explained

@AIExplainedYT

5 Sep 2024

Replying to @Okitwist

Cheaper than ChatGPT will be! But yes, initially I set a pricing for business-folks, founders, hardcore fans etc, and that has gone better than I could have hoped. But I realize now how many enthusiasts like yourself are kind enough to want to join Insiders for my work. I am so grateful and am actively looking for ways to have a lower priced option that can give you what you want. For now, if you subscribe to the Signal-to-noise newsletter on 'Insider Essentials' you do get a back catalog for $9 but let me know what other ideas there are, bearing in mind Patreon fixes the tiers after the initial price is set.

252

AI Explained · Jun 22, 2024 · 10:43 PM UTC

AI Explained

@AIExplainedYT

22 Jun 2024

Replying to @koltregaskes

Wait till you get to one of the podcasts where I mention you directly :)

202

AI Explained · Dec 16, 2024 · 7:33 PM UTC

AI Explained

@AIExplainedYT

16 Dec 2024

Replying to @AndrewMayne

Here was my attempt: simple-bench.com, and yes, as per your last blogpost, we did try permutations of telling models 'it's a trick question/there are distractions' and o1-preview, sonnet etc are still a way off the human baseline.

267

AI Explained · Dec 16, 2024 · 9:21 PM UTC

AI Explained

@AIExplainedYT

16 Dec 2024

Replying to @AndrewMayne

The human pass rates (on the fuller set), depend more on the time spent per question, and actually among those who spent >4 mins per question we saw results averaging 90%+, and no it was predominantly different questions humans get wrong to models, which really struggle with spatial reasoning (I could give 30+ examples of basic spatial questions o1-pro gets wrong). Social intelligence questions, like this one, indeed caused more issues for humans than models. Perhaps complete objectivity is impossible for any social component of a test, but given that the system prompt is 'pick the most realistic answer', the certainty of a fast-approaching global nuclear war is the distinct answer for me. Jen's shock could be many things, e.g. him not knowing already due to lack of internet access. But yes, 6/10 is actually around what I think o1 will get on the full bench with tailored prompts. However, there is something else we discovered, just as interesting. Telling even o1-pro that it might be a trick question caused it to search for *anomalous answer options*. Smart, right? Not figuring out the question, but figuring out the odd multiple choice option. Remove the options and it does not restate the correct answer, more often than not. But if I am reading it right, you got 9/10 (even if you have questions over an example)? No model, with any prompt, gets that on the full test, and you would (easily), having seen the prompt. That's a clear difference.

336

AI Explained · Nov 22, 2023 · 7:42 PM UTC

AI Explained

@AIExplainedYT

22 Nov 2023

Replying to @aidantilgner @t3dotgg

Thanks Aidan, this is my account, will mention it on the channel next video

175