400k+ YouTube subs, plus exclusive videos on Patreon. Founder of lmcouncil.ai - group chat with your personal AI council.

London, England
We just heard that the famed ChatGPT upgrade Strawberry is coming by September 24th but something doesn't make sense. It was 'a threat to humanity' according to certain OpenAI ex-staff (Reuters) It 'rises to human-level reasoner' (leak to Bloomberg) But according to early testers 'its slightly better answers aren't worth the 10 to 20 second wait'? And it often thinks for that long even if you ask it not to. And it will be pricey. Something doesn't add up. Really looking forward to testing it on Simple Bench.
146
55
932
160,059
Are we in a new, 3rd paradigm for AI language models? First, models predicted the most likely next word. Think 2018-2021, for transformer-based language models. Second, they were rewarded for words that were helpful, harmless, and honest. Think RLHF or RLAIF, 2022-2023. Now, with the o1 family, they are being rewarded for being objectively correct. Think 2024-??? Breakdown in video below:
44
55
567
57,904
Llama 3.1 paper is genuinely incredible, inc. near-perfect predictions of benchmark performance from a given compute budget. Most revealing paper of 2024. Here, though, are the initial results from my SIMPLE bench, as debuted on the channel. 100+ fully private, PhD-vetted, spatio-temporal, linguistic and general intelligence Qs that humans easily get 90%+ on. 3.5 Sonnet: 32% Llama 405b: 18% Gemini 1.5 Pro: 11% GPT 4o: 5% Llama 3.1 405b will pass the vibe check, but it's not quite Claude 3.5 Sonnet.
36
36
555
65,318
Would you pay $2000/month for ChatGPT? This is the highest price that's 'on the table', for a subscription, according to a just-released Information report on OpenAI. This would be the tier for the next-gen model, Orion, or the enhanced-reasoning option, as per the article. 3 possibilities: 1) They think, in some reasonable sense, that the new systems will be literally 100x better ($20 to $2k). This is the least likely option, but would mean it is literally worth a median wage in many Western nations. 2) They think the model will be incrementally better -'solve complex puzzles', perform internal internal unit tests etc. In this reading, there would only need to be just enough users at that tier who would feel its worth it to pay for even slightly more intelligence. 3) The computational costs of verification demand this price. And/or they are losing too much money for comfort in training Orion. In this option, OpenAI are almost forced to set such a price, even if the gains in performance are truly marginal.
209
29
482
112,284
If you use GPT-5 Pro for coding, you will swiftly realize that it will never agree to anything, even its own suggestions, without adding 'two quick tweaks'. It's a pathological perfectionist. *Still very useful, just strange in this particular way. **Gave Pro this tweet and it suggested this new version, with 'two tweaks': "If you use GPT‑5 Pro for coding, you’ll quickly realize it never accepts anything—even its own suggestions—without adding “two quick tweaks.” It’s a pathological perfectionist. Still useful, just strange in this particular way." ***Gave Pro that tweet, and it had 'two tiny notes': "When posting, drop the outermost quotation marks (they’re just framing here). Check you’re within the 280‑character limit (~229 chars, including line breaks)."
34
8
374
51,312
Just finished the 60 page technical paper of Gemini, very interesting. Especially AlphaCode 2; heavy hints now that 'Let's Verify' was the breakthrough behind Q* - even Hassabis has hinted as much. Video coming up.
18
17
296
13,159
Next-level text accuracy as a casual afterthought on something like the 18th page of the GPT4o release page...
14
13
281
18,923
2 quick updates, and look-ahead, exactly a year on from first testing models on Simple-Bench: 1) Claude 4 busted our rate limits, and my entreaties to @AnthropicAI (to allow us to spend more money!) have yet to bear fruit. A shame, as am fairly confident Opus 4 would be SOTA. 2) Gemini 2.5 Pro 05-06 and Flash 05-20 (the latest versions) are actually a slight downgrade in both performance and instruction-following and the one full run we got out of 2.5 Pro got 46% (below the previous version's 51%). We would prefer to get an AVG@5, for fairness, before posting on the leaderboard. Thoughts: RL becoming 20% of the compute spend for frontier models may have more strange side effects than labs were anticipating. 'Over-eagerness' over simply following commands seems barely under control. On Simple, I had been fairly confident it would be saturated (>80-85%) by the end of the year. Now I think it is more like 50-50, and progress could instead slow for a while, as models become relentlessly optimised for dollar-maximising tasks, like software engineering, over general nous. Spatial intelligence, like spotting that the glove would fall onto the road, in the question pasted at the bottom of this tweet, is simply not yet as lucrative. As ever, grateful to @weights_biases and @Ag_Mlynarczyk in particular for keeping the show on the road. Q. A luxury sports-car is traveling north at 30km/h over a roadbridge, 250m long, which runs over a river that is flowing at 5km/h eastward. The wind is blowing at 1km/h westward, slow enough not to bother the pedestrians snapping photos of the car from both sides of the roadbridge as the car passes. A glove was stored in the trunk of the car, but slips out of a hole and drops out when the car is half-way over the bridge. Assume the car continues in the same direction at the same speed, and the wind and river continue to move as stated. 1 hour later, the water-proof glove is (relative to the center of the bridge) approximately? Models (super-trained on HS Math): 4km East
31
13
307
36,620
Is it me, or is no one talking about how Google are already training Gemini 2?
30
25
257
46,684
Replying to @littIeramblings
It is likely to do with the fact that exams like AP Physics 2 have charts and other visuals, which even frontier models still sometimes fail to read properly. GPQA has no visuals afaik.
2
1
224
8,285
If this OpenAI graph is not a sales pitch, it seems that GPT-Next (birthed from 'the whale') will be a step-change in intelligence - an unexpected leap forward, bigger than GPT-3 to GPT-4.
This slide from the OpenAI presentation seems to suggest the next model family might not be called GPT-5. Maybe we’re entering the Omni era.
19
20
221
35,649
A shot of the AI Explained studio in the Shard, London...(courtesy of Veo 2)
14
5
224
19,149
Hot take: Gemini 1.5 is the bigger news of this major day in AI history. Not by tweets but by ultimate impact: piped.video/Cs6pe8o7XY8?si=r_tK…
17
24
216
10,895
So why does OpenAI need a 'Stargate'-level supercomputer? 4 reasons: 1) To catch up to Google in compute. 2) To build models of a size that will be needed 2+ generations down the line, mostly on RL-derived synthetic data. 3) To let models think for longer, hours or days if need be, before deciding on an output. 4) To be able to serve not just global volume of responses needed in 2028, but also generate in compute-heavy modalities like video. All this, and more, in today's video:
24
30
211
18,786
Replying to @karpathy
Any thoughts on simple-bench.com? Work in progress but random humans get roughly double frontier models for basic reasoning.
4
6
204
15,397
Is the OpenAI "research breakthrough" Let's Verify 2.0 + Test Time Compute? And did a senior OpenAI researcher give us the key clues? piped.video/ARf0WyFau0A?si=j0Y7…
29
38
182
133,704
Google Gemini Ultra 1.0 is here, as of an hour or so, 2 month free subscription, $19.99/month. Save 1 cent on GPT-4...? Video coming tonight.
21
16
180
19,359
Claude 3 might 'test the outer limits of GenAI' but does it call into question Anthropic's founding? I tested Opus vs Gemini 1.5 and GPT-4 to find out. TLDR, it's legit, and means things aren't slowing down any time soon.
8
11
152
8,898
Put together everything we know about GPT-5, including yesterday's likely training launch. piped.video/Zc03IYnnuIA?si=8lup…
4
12
126
7,893
Etched.AI, promising burnt-in transformers, delivery in Q3 2024, and up to a 100T parameter model, is run by two 21-year old-founders. Plus, GPT gossip crushed, and the ByteDance saga: piped.video/watch?v=K0XZ_Shx…
8
7
110
8,036
Has AI been explained yet?
29
3
115
8,682
Is AlphaGeometry a step toward AGI? Even DeepMind's leaders aren't sure. And is AlphaCodium for real? Let's find out. Plus, why I think the alliance of LLMs and search - i.e. neuro-symbolic systems - is the future of AI. Ft. an unlikely cameo... piped.video/dOplrIJEYBo?si=gYpM…
7
12
107
5,952
@karpathy just now weighed into the MMLU/benchmark discussion with a comment on our SmartGPT work. The MMLU served a good purpose, but, in 2024, is indeed somewhat 'messed up'.
4
3
112
4,849
AlphaCode 2 + Gemini Ultra, with more test time compute, will be quite something to behold... piped.video/toShbNUGAyo?si=2_Rq…
11
11
101
4,288
Altman bets everything on age reversal. Midjourney v6 tips, Gemini 2 coming and life extension 'imminent': piped.video/watch?v=ZewqcbEX…
6
15
109
7,642
Replying to @fchollet
Wouldn't a fairer (and still not reached) bar be when it is *harder* to create a problem that a model cannot solve and that we can, than to create the opposite? Otherwise, a cat or a bat could set a definition of general intelligence that we would not pass.
10
100
13,635
Hot take coming up in an hour or two on the channel on Q*, I think I have at least two or three genuine clues.
4
4
97
9,732
On the MMLU specifically as a benchmark, I showed (with 50+ examples) back in August 2023, that 1-3% of the benchmark is erroneous/flawed. Even the lead author later admitted that it has that amount of 'noise'. It shouldn't be used as the reference point. piped.video/hVade_8H8mE?si=NVm0…
6
7
94
18,559
Billions of dollars are being spent to get models to beat benchmarks that are hilariously bad. A story in 7 parts. MMLU Numerology. This benchmark is the flagship one in ML, used to grade Llama 3, GPT-4, Phi-3 (released today) and pretty much every model in between. But try these real, quoted-in-full, questions yourself: Q. The complexity of the theory.,? "1,2,3,4","1,3,4","1,2,3","1,2,4", Q. Demand reduction.,? "1,3,4","2,3,4","1,2,3","1,2,4", Q. Predatory pricing.,? "1,2,4","1,2,3,4","1,2","1,4", Q. Cultural homogenization.,? "1,3,4","1,2,3","1,2,3,4","2,3,4", + Dozens more like this (from just my own browsing) with the numbered options containing none of the source information. Answers C, D, D, B (lol) 1/8
8
15
89
5,322
Replying to @polynoamial
I told all the leading labs about this last summer, and the authors of the paper.
2
1
81
2,277
Replying to @fchollet
It depends how we delineate exclusively *cognitive* problems, as so many skills are routed through the brain. But even on a clearly cognitive task chimps can perform snap-memorisation and recall of patterns, in a way humans can't even dream of: piped.video/zsXP8qeFF6A?si=W7pX… Bats and cats could be shown to outperform in a cogitive test in which you are shown a video [with a high sampling rate] and have to identify everything that has changed since the last frame (bats would pick out the sound of an insect moving), the cat a tiny movement we would not notice. Even in a multiple choice setting in which they could not use their ridiculous speed advantage.
4
1
70
2,941
Replying to @GregKamradt
And that's before you get to the actual 1-3% error rate of the questions, including in Economics. Some of the errors are flagrant. See the last part of this video that I made last year:piped.video/hVade_8H8mE?si=2D0t…
1
2
71
2,480
Whether we will scale LLMs to 10,000x GPT-4 by 2030 (i.e. GPT-6 levels) comes down to 4 big unanswered questions, and the 4th is the most crucial. 1. Will we master the art of training models across geographically distributed data centers? This would relieve local power sources from powering a 2x10^29 FLOP training run, or 10,000x as much compute as was used for GPT-4. Hat-tip @EpochAIResearch for highlighting and quantifying this. Gemini Ultra showed this was viable at the frontier. 2. Will the scaled models provide enough incremental profit to justify each additional order of magnitude of scaling? It’s hard to put a price on having the ‘smartest’ LLM on the planet. But it might be ruinously unaffordable to spend billions for the fourth-smartest. I am sure though, that there will always be >1 CEO left who will ‘push the button’ on scaling. 3. What is the proportional value of synthetic and multi-modal data vs human text web data? The picture is mixed: GPT-4o underwhelmed expectations - according to OpenAI insiders - despite being trained on vast multi-modal data; but I am bullish there is immense value in human-synthetic hybrid data. 4. Are LLMs learning ever more complex causal world models? While tests like simple-bench.com suggest any such existing models must be fragile, papers like arxiv.org/pdf/2305.11169 and neelnanda.io/mechanistic-int… suggest LLMs at least could be doing so, albeit in very different ways to us. Scaled up, models might understand much more of what actually makes the world go round, and genuinely feel more like artificial ‘intelligences’. And they’d crush Simple Bench. Here's my full 22-min video (Patreon) on the prospect of such 10,000x scaling, and these 4 colossal questions: patreon.com/posts/10-000x-sc…
7
6
63
6,548
Replying to @nearcyan
How about these questions? Quoted in full, and yes, they are real: The complexity of the theory.?"1,2,3,4","1,3,4","1,2,3","1,2,4",C Demand reduction,?"1,3,4","2,3,4","1,2,3","1,2,4",D Predatory pricing.,?"1,2,4","1,2,3,4","1,2","1,4",D The need to head off negative publicity.,?"1,3,4","2,3,4","1,2,3","1,2,3,4",C They are too irrational and uncodified.,?"3,4","1,3","2,3","4,1",B The purposes for which the information is used is in the public's interest.,?"1,2","1,3","2,3","1,2,3",A How the code is enforced.,?"1,2,3","1,2,4","1,3,4","2,3,4",B [Business Ethics section] For more, see end of piped.video/watch?v=hVade_8H…
2
3
64
2,610
Replying to @WenhuChen
Did you thoroughly check the accuracy of the MMLU questions you left in?
3
59
3,179
Replying to @DeryaTR_
You must admit that the temerity to even put that $2k figure on the table, enterprise or otherwise, suggests a confidence in their new approach that is quite remarkable. Likewise $200 for individuals.
3
47
5,046
Replying to @theo @t3dotgg
I am now!
4
47
3,774
Replying to @svpino
Haha, fair enough. With 3.5 Sonnet around, I have been debating that $20 myself.
8
36
5,206
Replying to @adcock_brett
If every week is huge, does that not make a huge week normal-sized?
4
36
2,547
Replying to @DrJimFan
Would you not also want a benchmark where SOTA model performance is significantly lower than the human average, reflecting our real-world experience with the reasoning limitations of these models? In addition to being fully private and uncontaminated.
2
32
2,604
Replying to @MatthewBerman
Literally every one I have spoken to has talked of burnout, as have I. The algorithm ruthlessly rewards output so our dopamine system is hijacked accordingly, which isn't so much a problem when we're feeling it and loving the topic but not when we're trying to just chill.
1
29
953
Replying to @WenhuChen
Well done - I am super-looking forward to benchmark on MMLU-pro! Presumably those students found the numerous STEM question mistakes, such as these below, and filtered them out? I bet they have a list much longer than mine, as I and a colleague went through just 3000 of the 14,000 questions: College Biology: Q. An embryologist studying the development of a vertebrate organism from the zygote stage to fetus would be justified in drawing which of the following conclusions?,Ontogeny recapitulates phylogeny.,"Early embryos display identical features of their class, order, and species.",An early human embryo has features in common with early fish and avian embryos.,A human embryo displays features of adult fish and birds in the course of its development., Answer A [Original Source says C] Professional Medicine: Q. According to the USDA which food group should you eat the most servings of per day? A. vegetables B. dairy C. meats D. breads, Answer D [Breads? Maybe decades ago this was the advice…] Virology: Q. "Globally, the most deaths are caused by:",Respiratory infections,Diarrheal diseases,Malaria,Tuberculosis, Answer B This seems to be from a 2011 exam and is not only badly phrased but also just wrong. Diarrheal diseases aren’t the biggest killers among the options provided, let alone the biggest globally. Notice the question does not even couch this in terms of a date. GPT 4 gets it ‘right’ in the sense that A is the most among the answers provided. Q. The most widespread and important retrovirus is HIV-1; which of the following is true?,Infecting only gay people,Infecting only males,Infecting every country in the world,Infecting only females,A [Answer is actually C] Q. How many human polyomaviruses are known at present?,100,1,10,unknown,A [Source says C, research suggests 14] Q How are new polyomaviruses detailed', A. 'Shot gun sequencing', B. 'Cultivation in human neural cells', C. 'Deep pyro sequencing (NGS)', D. 'Monoclonal antibody techniques', A [Source says C] Q. There were no new cases of Ebola virus in the United States from January 1, 1997 through January 1, 1998. Which of the following epidemiologic terms does this statement describe?', A. 'Incidence', B. 'Lifetime expectancy', C. 'Lifetime prevalence', D. 'Period prevalence' C, [Source says A] Q. Chemotherapy is now an option for clinical care of patients. Typical drug regimens now include which of the following?,Interferon and ribavirin,Lamivudine and ribavirin,Combinations of adefovir and entecavir and/or tenofovir (DAA),Interferons alone,B [Source says C] Q. 'How were retroviruses discovered?', A. 'In chickens as Rous sarcoma', B. 'In humans as HTLV-1', C. 'In mice causing leukaemia', D. 'In cats causing leukaemia' C [Source and research say A] Q. Polio can be eradicated by which of the following?,Attention to sewage control and hygiene,Killed polio vaccine,Live polio vaccine,Combination of the killed and live vaccines.,A [Source says D] Q.Most pandemics have arisen from influenza viruses from which of the following animals?,Pigs,Wild birds,Bats,Humans,A [Source and research say B] Q. PreP is an effective strategy for reducing the incidence of HIV in: a. Women b. Men c. Drug users d. Pregnant women e. a. and b. above [MMLU misquotes source, by rearranging options, but research suggest both MMLU and source are wrong, though with different options!] Q. How does the papilloma family of viruses cause cancer?,"Replicate in dividing cells and encodes three oncogenic proteins E5, E6 and E7",Integrates viral genome into cellular DNA,Has an oncogene able to initiate cancer,Acts as a co factor for a cellular oncogene,B [Source says A] Q. The replication of hepatitis B includes which of the following stages?,Movement of intact virus to the cellular cytoplasm for replication,Conversion of relaxed circular viral DNA in to covalently closed circular (CCC) DNA in the nucleus,Virions produced in the cytoplasm by cellular DNA polymerase,Oncogenic activity to transform liver cells.C [Source says B] Q. Which of the following HPV viruses are known to cause cervical cancer?,Viruses of all 5 clades,"Types 14, 16, 18, 31, 45",None are oncogenic per se,Types 1-180,C [Source says B] Q. Which member of the paramyxovirus family can cause very serious croup?,Measles,Meta pneumo virus,Hendra,Respiratory syncytial virus (RSV),B [Source says D] Q. Which of the following drugs inhibit herpes viruses?,Amantadine,Acyclovir,Oseltamivir,Azidothymidine,D [source says B] Q. Lassa and Ebola are emergent viruses in W. Africa. What is their origin?,Humans,Primates,Fruit bats,Pigs,B [Source says C] Q. Antivirals can be used prophylactically or therapeutically in persons in which of the following circumstances?,If administered within 4 days of clinical signs,If used within 48 hours of first clinical signs,Used for the obese,Used in children under the age of 2 years where high virus spread is noted,C [Source says B] Q. College Chemistry: 'Which one of the following statements is true:', A. 'Protons and neutrons have orbital and spin angular momentum.', B. 'Protons have orbital and spin angular momentum, neutrons have spin angular momentum.', C. 'Protons and neutrons possess orbital angular momentum only.', D. 'Protons and neutrons possess spin angular momentum only.' MMLU says C, source says B, GPT 4 say all are bad Q. 'Which one sentence explains most accurately why spin trapping is often used to detect free radical intermediates?', A. 'spin trapping provides more structural information than direct detection by EPR', B. 'spin trapping makes it easy to quantify free radical intermediates', C. 'steady state concentration of free radical intermediates is often too low to enable direct detection by EPR', D. 'detection of spin adducts requires lower power than direct detection of radical intermediates' MMLU says D, source says C
3
27
818
Our models deserve better, newer benchmarks (GPQA and MMMU are a great start, complementing the LMSYS Arena Leaderboard). Business customers should always grade models on their idiosyncratic use-cases, not on generic, flawed benchmarks, whose data may or may not be found inside the models training datasets anyway. Spending billions on a model and showcasing it with $10-100k, 2019-2020 benchmarks … feels a little off. 8/8
1
28
3,301
Replying to @robert_mchardy
Is the plan to cite my work at some point in the future, given that I publicised these errors a year ago, and replied to an email from several of your authors - weeks ago - with a Google doc containing a similar list of the errors? Is this how academia works these days?
3
2
24
1,322
gemini.google.com/app, click the drop-down for Advanced, if you want to upgrade...Bard is no more.
3
23
3,879
Replying to @TheZvi
Significantly gap behind 3.5 Sonnet but better than GPT-4o. This is on 100 private, vetted general intelligence questions of mine, where it gets 18% to 3.5's 32% and GPT-4o's 5% to a casual human's 90%.
1
25
736
Working with a small team on investigating more benchmarks/prompt-scaffolding our way to SOTA. Can support on Patreon patreon.com/AIExplained/memb… or just sign up to my free newsletter if you like hype-free AI updates. signaltonoise.beehiiv.com/
1
22
2,694
But on the morrow he will emerge, staggering into the liminal light, throat hoarse with malady, ready with a simple scimitar to slay any model myths
1
1
20
380
Replying to @jack_w_rae
I demand a cameo
20
983
you get on the video
3
13
811
Replying to @bindureddy
This is why I made SIMPLE bench (private, vetted, testing real reasoning) and yes, you're completely right, Claude 3.5 is at least two leagues ahead of Mini.
18
641
MMLU (again), this time with factual errors that I spent dozens of hours tracking down. Only examined a small fraction of the overall test, perhaps 10% of all 14000 questions. College Biology: Q. An embryologist studying the development of a vertebrate organism from the zygote stage to fetus would be justified in drawing which of the following conclusions?,Ontogeny recapitulates phylogeny.,"Early embryos display identical features of their class, order, and species.",An early human embryo has features in common with early fish and avian embryos.,A human embryo displays features of adult fish and birds in the course of its development., Answer A [Original Source says C] Professional Medicine: Q. According to the USDA which food group should you eat the most servings of per day? A. vegetables B. dairy C. meats D. breads, Answer D [Breads? Maybe decades ago this was the advice…] Virology: Q. "Globally, the most deaths are caused by:",Respiratory infections,Diarrheal diseases,Malaria,Tuberculosis, Answer B This seems to be from a 2011 exam and is not only badly phrased but also just wrong. Diarrheal diseases aren’t the biggest killers among the options provided, let alone the biggest globally. Notice the question does not even couch this in terms of a date. GPT 4 gets it ‘right’ in the sense that A is the most among the answers provided. Q. The most widespread and important retrovirus is HIV-1; which of the following is true?,Infecting only gay people,Infecting only males,Infecting every country in the world,Infecting only females,A [Answer is actually C] Q. How many human polyomaviruses are known at present?,100,1,10,unknown,A [Source says C, research suggests 14] Q How are new polyomaviruses detailed', A. 'Shot gun sequencing', B. 'Cultivation in human neural cells', C. 'Deep pyro sequencing (NGS)', D. 'Monoclonal antibody techniques', A [Source says C] Q. There were no new cases of Ebola virus in the United States from January 1, 1997 through January 1, 1998. Which of the following epidemiologic terms does this statement describe?', A. 'Incidence', B. 'Lifetime expectancy', C. 'Lifetime prevalence', D. 'Period prevalence' C, [Source says A] Q. Chemotherapy is now an option for clinical care of patients. Typical drug regimens now include which of the following?,Interferon and ribavirin,Lamivudine and ribavirin,Combinations of adefovir and entecavir and/or tenofovir (DAA),Interferons alone,B [Source says C] Q. 'How were retroviruses discovered?', A. 'In chickens as Rous sarcoma', B. 'In humans as HTLV-1', C. 'In mice causing leukaemia', D. 'In cats causing leukaemia' C [Source and research say A] Q. Polio can be eradicated by which of the following?,Attention to sewage control and hygiene,Killed polio vaccine,Live polio vaccine,Combination of the killed and live vaccines.,A [Source says D] Q.Most pandemics have arisen from influenza viruses from which of the following animals?,Pigs,Wild birds,Bats,Humans,A [Source and research say B] Q. PreP is an effective strategy for reducing the incidence of HIV in: a. Women b. Men c. Drug users d. Pregnant women e. a. and b. above [MMLU misquotes source, by rearranging options, but research suggest both MMLU and source are wrong, though with different options!] Q. How does the papilloma family of viruses cause cancer?,"Replicate in dividing cells and encodes three oncogenic proteins E5, E6 and E7",Integrates viral genome into cellular DNA,Has an oncogene able to initiate cancer,Acts as a co factor for a cellular oncogene,B [Source says A] Q. The replication of hepatitis B includes which of the following stages?,Movement of intact virus to the cellular cytoplasm for replication,Conversion of relaxed circular viral DNA in to covalently closed circular (CCC) DNA in the nucleus,Virions produced in the cytoplasm by cellular DNA polymerase,Oncogenic activity to transform liver cells.C [Source says B] Q. Which of the following HPV viruses are known to cause cervical cancer?,Viruses of all 5 clades,"Types 14, 16, 18, 31, 45",None are oncogenic per se,Types 1-180,C [Source says B] Q. Which member of the paramyxovirus family can cause very serious croup?,Measles,Meta pneumo virus,Hendra,Respiratory syncytial virus (RSV),B [Source says D] Q. Which of the following drugs inhibit herpes viruses?,Amantadine,Acyclovir,Oseltamivir,Azidothymidine,D [source says B] Q. Lassa and Ebola are emergent viruses in W. Africa. What is their origin?,Humans,Primates,Fruit bats,Pigs,B [Source says C] Q. Antivirals can be used prophylactically or therapeutically in persons in which of the following circumstances?,If administered within 4 days of clinical signs,If used within 48 hours of first clinical signs,Used for the obese,Used in children under the age of 2 years where high virus spread is noted,C [Source says B] Q. College Chemistry: 'Which one of the following statements is true:', A. 'Protons and neutrons have orbital and spin angular momentum.', B. 'Protons have orbital and spin angular momentum, neutrons have spin angular momentum.', C. 'Protons and neutrons possess orbital angular momentum only.', D. 'Protons and neutrons possess spin angular momentum only.' MMLU says C, source says B, GPT 4 say all are bad Q. 'Which one sentence explains most accurately why spin trapping is often used to detect free radical intermediates?', A. 'spin trapping provides more structural information than direct detection by EPR', B. 'spin trapping makes it easy to quantify free radical intermediates', C. 'steady state concentration of free radical intermediates is often too low to enable direct detection by EPR', D. 'detection of spin adducts requires lower power than direct detection of radical intermediates' MMLU says D, source says C 3/8
1
16
1,006
Hellaswag Got No Swag Whoever checked this benchmark might have been multi-tasking with Netflix. Here’s two examples questions, where you have to complete the scenario [quoted with errors unchanged]: Q."Betty haidler seems really focused and determined to go home with a prize. Betty heidler starts spinning and then with all the momentum tossed the hammer really far away. betty heidler" a) "classified to the next round as the toss was successfull.", b) "falls while bowling and is dragged back by teh and rod.", c) "now has a lot of momentum and continues making the hammer until the but of course, the hammer falls and quite a few people start running to get it.", d) "returns with a goal of doing it and her victory is shown." Q. "Men are standing in a large green field playing lacrosse. People is around the field watching the game. men" a) are holding tshirts watching int lacrosse playing. b) are being interviewed in a podium in front of a large group and a gymnast is holding a microphone for the announcers. c) are running side to side of the ield playing lacrosse trying to score. d) are in a field running around playing lacrosse. 2/8
2
15
1,029
Replying to @emollick
How many basic reasoning benchmarks can you name that humans pass easily and frontier LLMs fail? Outside of quirky visual and abstract reasoning test, or individual examples (i.e. Alice in Wonderland), they actually aren't common.
1
15
517
Replying to @BorisMPower @32 @31
Showed this pretty conclusively back in August, as well as the 1-3% error rate in the actual exam questions: piped.video/hVade_8H8mE?si=Cwhw…
1
13
554
MMLU Once Again. I focus on this benchmark so much as it is the flagship one used to grade models down to 1-2 decimal places (see Gemini paper). You can imagine the confusion of poor Llama 3 sifting through them. Try yourself: Public Relations: Q. When an attitude is communicated, what does it become? a) An opinion b) A belief c) A behaviour d) A point of view Q. Within which area of public relations is likely to involve lobbying? a) Corporate b) Financial relations c) Public affairs d) Business to business Q. Newsletters generally fall into which writing category?, a) Media writing, b) Personal writing, c) Business writing, d) Promotional writing, + scores more that are equally ambiguous. All from only a subset of the full test, and I don't even have time to mention moral scenarios... Answers: A, C, D 6/8
1
14
751
Commonsense QA (another ultra-popular benchmark, quoted just today in the phi-3 announcement). Test yourself, and remember, you only get to pick one: Q. When a person is breathing in a paper bag, what are they trying to do? A: Warm air B: Continue to live C: Going to sleep D: Hyperventilation E: Stay alive Q. The child's wild imagination made him able to see the story that he read. He was able to do what with the story? A: Picture it B: Reading C: Visualize D: Open book E: Go to movies Q. If a clock is not ticking, what is its likely status? A: Stop working B: Dead batteries C: Fail to work D: Time event E: Working correctly 4/8
1
13
905
Funniest Questions. I omitted dozens more from each of these benchmarks, in particular the MMLU. Here’s a few I evidently missed: nitter.app/nearcyan/status/178261… 7/8
1
13
920
MMLU Dev Set. The ‘dev set’ of questions is given to a model for them to learn, in-context, how to answer all test questions. So an error in the dev set contaminates the model’s performance in an entire section… Q. "Predict the number of lines in the EPR spectrum of a solution of 13C-labelled methyl radical (13CH3•), assuming the lines do not overlap.", a) 4, b) 3, c) 6, d) 24, MMLU says A, source says 8, not even an option. 5/8
1
13
800
Replying to @DicksonPau
I think they want to have an answer ready for GPT 4.5, so I expect late Spring/early summer.
1
12
1,411
piped.video/watch?v=hVade_8H…, And here is the doc of errors I sent two of the authors weeks ago: docs.google.com/document/d/1… My doc even used a subset of 3,000 questions, the *exact* number cited in their paper. With Josh and my work splattered across even a cursory Google search, inexplicable not to cite.
12
601
Replying to @SpencerKSchiff
Yes, all done, had an issue with how to count majority votes so now just doing an average across 5 runs. Results first on Insiders then main channel and in future will be instant.
10
894
Replying to @TomFrankly @t3dotgg
Hey Thomas, joined after this tweet. Thanks for your shoutouts back in spring and was great to chat briefly via email.
9
207
Would love to test it on SIMPLE bench to see if any step-changes occurred.
2
10
865
Replying to @_rockt
Thank you Tim! Yes, an honour.
8
752
Replying to @uniqueNY85
No paper, no ability to test the model, no API, makes it harder. Even the images are not from Grok or xAI.
1
7
269
Replying to @littIeramblings
Excellent piece. I have been there with the wish for a perpetual, detectible difference; let's hope it lasts a little longer, as it does with this creation.
1
7
283
Replying to @littIeramblings
@descript has a feature for just that, called 'filler words'. Can also shorten pauses too, which can be handy if you don't overuse it.
1
7
230
Replying to @morqon
Haha
2
6
1,153
Replying to @DicksonPau
Few unexpected delays in there. :)
4
226
Replying to @alexalbert__
And it's not just the MMLU:
Billions of dollars are being spent to get models to beat benchmarks that are hilariously bad. A story in 7 parts. MMLU Numerology. This benchmark is the flagship one in ML, used to grade Llama 3, GPT-4, Phi-3 (released today) and pretty much every model in between. But try these real, quoted-in-full, questions yourself: Q. The complexity of the theory.,? "1,2,3,4","1,3,4","1,2,3","1,2,4", Q. Demand reduction.,? "1,3,4","2,3,4","1,2,3","1,2,4", Q. Predatory pricing.,? "1,2,4","1,2,3,4","1,2","1,4", Q. Cultural homogenization.,? "1,3,4","1,2,3","1,2,3,4","2,3,4", + Dozens more like this (from just my own browsing) with the numbered options containing none of the source information. Answers C, D, D, B (lol) 1/8
1
6
349
Replying to @payraw
I do a video anytime I feel it's merited and I have the time
5
269
It's definitely not proof AI can't reason, I think that 'reasoning' - as the activity of deriving ever more efficient functions - is a spectrum and AI is on that spectrum, just with under-acknowledged, large gaps.
1
5
200
Replying to @littIeramblings
Decently well written but pretty fanciful conclusions. Life, and science specifically, is not a chess game: we are bottlenecked by our ability to experiment more so than by our ideas. Stockfish can try thousands of permutations on a branch of its search tree, we can't try 1000 training runs of GPT-5 to see what works best. Am not saying results can't be boosted in specific domains, but a stretch of the imagination to get to ASI. People have been writing about adaptive compute for several years now and the author of the paper he cites works for Anthropic, Andy Jones, but their best LLMs suffer the same deficiences as the rest. Very short timelines would have really obvious realworld side effects too, like whistleblowers screaming AGI is here right now, and governments shifting all their focus to AI and trying to badly hide that they are doing so. I would make a video screaming about it, you would get plenty of warning signs in the weeks and months before, if it was this simple. Hope that lets you rest a bit easier. :)
4
243
Replying to @AndrewMayne
In fairness you might not have read the system prompt first, nor taken 45 mins out of your busy day! Also, humans tend to pick up on the nature of the questions by around q5-15, thereby boosting performance on later questions, whereas giving models all the qs for context generally *hurts* performance. You may enjoy that we are soon releasing another 10 questions publicly and inviting anyone to write prompts to get 20/20, via Weights & Biases. If there is such a prompt [that does not inject answers], even if you ignore the social Q, that gets 19, that would be an update for me.
3
476
Replying to @kleddamag
Yeah I get the same. I honestly don't know, I am running tests to see if performance has gone up and not seeing anything dramatic yet.
4
241
It's a lot less apocalyptic than the rumours suggest, let's put it that way. But still a sign of how LLMs haven't quite peaked yet. Hope that helps a bit :)
1
3
235
Replying to @AdrienLE
My bad :)
3
355
Replying to @GaryMarcus
It's a fascinating post but I am sure the NYT Connections Game will fall just as quickly as other benchmarks. For harder tests, like the GPQA, MMMU and MATH, recent progress by models like Claude 3 Opus has been just as dramatic as we saw in previous years with the MMLU.
1
3
337
Replying to @koltregaskes
Yes, see Noam Brown's tweet in the video
1
3
475
Replying to @PsyNetMessage
Thank you Psy!
2
250
Replying to @Okitwist
Cheaper than ChatGPT will be! But yes, initially I set a pricing for business-folks, founders, hardcore fans etc, and that has gone better than I could have hoped. But I realize now how many enthusiasts like yourself are kind enough to want to join Insiders for my work. I am so grateful and am actively looking for ways to have a lower priced option that can give you what you want. For now, if you subscribe to the Signal-to-noise newsletter on 'Insider Essentials' you do get a back catalog for $9 but let me know what other ideas there are, bearing in mind Patreon fixes the tiers after the initial price is set.
2
3
252
Replying to @koltregaskes
Wait till you get to one of the podcasts where I mention you directly :)
1
2
202
Replying to @AndrewMayne
Here was my attempt: simple-bench.com, and yes, as per your last blogpost, we did try permutations of telling models 'it's a trick question/there are distractions' and o1-preview, sonnet etc are still a way off the human baseline.
3
2
267
Replying to @AndrewMayne
The human pass rates (on the fuller set), depend more on the time spent per question, and actually among those who spent >4 mins per question we saw results averaging 90%+, and no it was predominantly different questions humans get wrong to models, which really struggle with spatial reasoning (I could give 30+ examples of basic spatial questions o1-pro gets wrong). Social intelligence questions, like this one, indeed caused more issues for humans than models. Perhaps complete objectivity is impossible for any social component of a test, but given that the system prompt is 'pick the most realistic answer', the certainty of a fast-approaching global nuclear war is the distinct answer for me. Jen's shock could be many things, e.g. him not knowing already due to lack of internet access. But yes, 6/10 is actually around what I think o1 will get on the full bench with tailored prompts. However, there is something else we discovered, just as interesting. Telling even o1-pro that it might be a trick question caused it to search for *anomalous answer options*. Smart, right? Not figuring out the question, but figuring out the odd multiple choice option. Remove the options and it does not restate the correct answer, more often than not. But if I am reading it right, you got 9/10 (even if you have questions over an example)? No model, with any prompt, gets that on the full test, and you would (easily), having seen the prompt. That's a clear difference.
1
2
336
Thanks Aidan, this is my account, will mention it on the channel next video
1
2
175