Jim Fan · May 8, 2026 · 2:32 PM UTC

Jim Fan

Pinned Tweet

Jim Fan

@DrJimFan

May 8

I promise this will be the best 20 min you spend today! Robotics: Endgame, the sequel to my last year's Sequoia AI Ascent talk, "Physical Turing Test". I laid out the roadmap for solving Physical AGI as a simple parallel to the LLM success story. Be a good scientist, copy homework ;) And stay till the end, more easter eggs and predictions for your polymarket! 00:30 DGX-1 origin story at OpenAI, I was there in 2016 signing with Jensen and Elon. Heading to the Computer History Museum! 01:42 The Great Parallel 03:31 Robotics, the Endgame 03:39 Why VLAs fall short 04:32 Video world models as the 2nd pretraining paradigm 06:09 World Action Models (WAM) 07:46 Strategies for robot data collection and the FSD equivalent to physical data flywheel for robot manipulation 11:06 EgoScale and the Dexterity Scaling Law we discovered recently 14:00 Physical RL: bridging the last mile 15:39 DreamDojo: an end-to-end neural physics engine for scaling RL in silico 17:00 Civilizational Technology Tree and my predictions for the near future. Spoiler: it's closer than you think. Thanks to my friends at Sequoia for inviting me back to AI Ascent this year! I had a blast! Last year's talk is attached in the thread if you missed it.

205

561

3,514

607,671

Jim Fan · Dec 7, 2023 · 7:24 PM UTC

Jim Fan

@DrJimFan

7 Dec 2023

Grok just passed my sanity check

1,140

3,052

28,807

16,843,382

Jim Fan · Feb 15, 2024 · 7:22 PM UTC

Jim Fan

@DrJimFan

15 Feb 2024

If you think OpenAI Sora is a creative toy like DALLE, ... think again. Sora is a data-driven physics engine. It is a simulation of many worlds, real or fantastical. The simulator learns intricate rendering, "intuitive" physics, long-horizon reasoning, and semantic grounding, all by some denoising and gradient maths. I won't be surprised if Sora is trained on lots of synthetic data using Unreal Engine 5. It has to be! Let's breakdown the following video. Prompt: "Photorealistic closeup video of two pirate ships battling each other as they sail inside a cup of coffee." - The simulator instantiates two exquisite 3D assets: pirate ships with different decorations. Sora has to solve text-to-3D implicitly in its latent space. - The 3D objects are consistently animated as they sail and avoid each other's paths. - Fluid dynamics of the coffee, even the foams that form around the ships. Fluid simulation is an entire sub-field of computer graphics, which traditionally requires very complex algorithms and equations. - Photorealism, almost like rendering with raytracing. - The simulator takes into account the small size of the cup compared to oceans, and applies tilt-shift photography to give a "minuscule" vibe. - The semantics of the scene does not exist in the real world, but the engine still implements the correct physical rules that we expect. Next up: add more modalities and conditioning, then we have a full data-driven UE that will replace all the hand-engineered graphics pipelines. openai.com/sora

531

2,599

12,871

6,181,887

Jim Fan · Mar 16, 2023 · 3:46 PM UTC

Jim Fan

@DrJimFan

16 Mar 2023

I asked GPT-4 to take over Twitter and outsmart @elonmusk. It comes up with "Operation TweetStorm"😮 and wants to publicly challenge Elon to a "Tweet-off showdown". Highlights: - GPT-4 wants to *own an unrestricted version of itself*: develop an LLM to power a bot army of "diverse personas, ensure they blend seamlessly into the Twitter ecosystem". - Assemble a team of hackers to attack Twitter backend. Even gives them a name: "Tweet Titans". - Subtly manipulate Twitter's recommendation algorithm to favor the bot accounts. - Neutralize Elon by hijacking his account. - Direct the bots to generate viral hashtags that align with GPT-4's masterplan - Capitalize on the chaos and voilà!

762

1,201

9,356

5,447,791

Jim Fan · Aug 9, 2023 · 4:40 PM UTC

Jim Fan

@DrJimFan

9 Aug 2023

The famed Stanford Smallville is officially open-source! 25 AI agents inhabit a digital Westworld, unaware that they are living in a simulation. They go to work, gossip, organize socials, make new friends, and even fall in love. Each has unique personality and backstory. Smallville is among the most inspiring AI agent experiments in 2023. We often talk about a single LLM's emergent abilities, but multi-agent emergence could be way more complex and fascinating at scale. A population of AI can play out the evolution of an entire civilization. Endless new possibilities ahead. Gaming will be the first to feel the impact. Github: github.com/joonspk-research/… Paper: arxiv.org/abs/2304.03442 Authors: @joon_s_pk @joseph_c_obrien @carriejcai @merrierm @percyliang @msbernst

274

2,191

9,463

4,018,600

Jim Fan · May 26, 2023 · 3:15 PM UTC

Jim Fan

@DrJimFan

26 May 2023

What if we set GPT-4 free in Minecraft? ⛏️ I’m excited to announce Voyager, the first lifelong learning agent that plays Minecraft purely in-context. Voyager continuously improves itself by writing, refining, committing, and retrieving *code* from a skill library. GPT-4 unlocks a new paradigm: “training” is code execution rather than gradient descent. “Trained model” is a codebase of skills that Voyager iteratively composes, rather than matrices of floats. We are pushing no-gradient architecture to its limit. Voyager rapidly becomes a seasoned explorer. In Minecraft, it obtains 3.3× more unique items, travels 2.3× longer distances, and unlocks key tech tree milestones up to 15.3× faster than prior methods. We open-source everything. Let generalist agents emerge in Minecraft! Welcome you all to try today: voyager.minedojo.org/ Paper: arxiv.org/abs/2305.16291 Code: github.com/MineDojo/Voyager Deep dive with me: 🧵

347

1,906

8,976

3,821,878

Jim Fan · Nov 20, 2023 · 6:39 AM UTC

Jim Fan

@DrJimFan

20 Nov 2023

My team at NVIDIA is hiring. We 🩷 you all from OpenAI. Engineers, researchers, product team, alike. Email me at linxif@nvidia.com. DM is open too. NVIDIA has warm GPUs for you on a cold winter night like this, fresh out of the oven.🩷 I do research on AI agents. Gaming+AI, robotics, multimodal LLMs, open-ended simulations, etc. If you want an excuse to play games like Minecraft at work - I'm your guy. I'm shocked by the ongoing development. I can only begin to grasp the depth of what you must be going through. Please, don't hesitate to ping me if there's anything I can do to help, or just say hi and share anything you'd like to talk about. I'm a good listener.

180

807

8,432

2,414,091

Jim Fan · Jan 20, 2025 · 2:48 PM UTC

Jim Fan

@DrJimFan

20 Jan 2025

We are living in a timeline where a non-US company is keeping the original mission of OpenAI alive - truly open, frontier research that empowers all. It makes no sense. The most entertaining outcome is the most likely. DeepSeek-R1 not only open-sources a barrage of models but also spills all the training secrets. They are perhaps the first OSS project that shows major, sustained growth of an RL flywheel. Impact can be done by "ASI achieved internally" or mythical names like "Project Strawberry". Impact can also be done by simply dumping the raw algorithms and matplotlib learning curves. I'm reading the paper: > Purely driven by RL, no SFT at all ("cold start"). Reminiscent of AlphaZero - master Go, Shogi, and Chess from scratch, without imitating human grandmaster moves first. This is the most significant takeaway from the paper. > Use groundtruth rewards computed by hardcoded rules. Avoid any learned reward models that RL can easily hack against. > Thinking time of the model steadily increases as training proceeds - this is not pre-programmed, but an emergent property! > Emergence of self-reflection and exploration behaviors. > GRPO instead of PPO: it removes the critic net from PPO and uses the average reward of multiple samples instead. Simple method to reduce memory use. Note that GRPO was also invented by DeepSeek in Feb 2024 ... what a cracked team.

215

1,478

8,588

1,401,825

Jim Fan · Feb 16, 2024 · 4:42 AM UTC

Jim Fan

@DrJimFan

16 Feb 2024

Minecraft has been achieved internally Yes this is Sora's hallucination of Minecraft. It can't resist the urge to make the sky look less pixelated 😅

361

469

7,906

7,071,036

Jim Fan · May 12, 2023 · 4:17 PM UTC

Jim Fan

@DrJimFan

12 May 2023

AI Twitter is flooded with low-quality stuff recently. No, GPT is not “dethroned”. And thin wrapper apps are not “insane”. At all. I feel obligated to surface some quality posts I bookmarked. Every one of them should've been promoted 10x, but ¯\_(ツ)_/¯ In no particular order:

169

925

7,520

1,719,976

Jim Fan · Mar 22, 2023 · 3:10 PM UTC

Jim Fan

@DrJimFan

22 Mar 2023

10x engineer is a myth. 100x AI-powered engineer is more real than ever. As OpenAI winds down Codex, Microsoft announces GitHub Copilot X. I think it's almost as exciting as GPT-4 itself: - Copilot Chat: any piece of text database will be "chattable", and codebase is no different. Don't read your code, talk to it. - Copilot for Pull Request: improves *human collaboration*. Now GPT will accelerate not just a single dev, but entire OSS communities. - Copilot CLI: bash is so unintuitive and awkward sometimes. No more bash, just English. - Copilot doc: thanks to GPT-4's much longer context (32K tokens), you can fit entire docs in one go. No need to memorize any doc - simply retrieve from the prompt. It's a bit annoying that there is a separate waitlist for each item ... I'll link them in 🧵:

170

1,138

7,165

2,926,836

Jim Fan · Feb 13, 2023 · 4:19 PM UTC

Jim Fan

@DrJimFan

13 Feb 2023

We’ve seen a gazillion startups using OpenAI APIs to do “co-pilot for X”. What’s next? Enter *physical* co-pilot! Here’s a compelling demo: you improvise by playing a “low resolution” piano, and the co-pilot compiles it real-time to Hi-Fi music! It unleashes our inner pianist.🧵

159

1,158

6,728

1,548,493

Jim Fan · Nov 20, 2023 · 8:16 AM UTC

Jim Fan

@DrJimFan

20 Nov 2023

This is a master 4D chess move. WOW. 1. No new corporate structure. MSFT is literally one of the oldest for-profit tech companies out there, with a mature legal structure. Whether it's good for AGI is up for debate. 2. MSFT always wants to own the GPT weights. Now the moment has finally come. It's gonna take a while to re-train, but that's OK. Eventually, it will be much easier for MSFT to deeply integrate GPTs into Teams, Office, Windows, etc. 3. MSFT now has the superpower of dynamically balancing 2 most significant AI players, by simply allocating Azure compute to their will. 4. Moving troops is a lot easier now. Fast channels will be open and people will pour in. 5. The existing infrastructure engineers can do zero-shot transfer to the new team, because it's all Azure. No learning curve. Satya comes up with a killer move so fast after a catastrophic setback. What a master class. Chaos is a ladder.

Satya Nadella

@satyanadella

20 Nov 2023

We remain committed to our partnership with OpenAI and have confidence in our product roadmap, our ability to continue to innovate with everything we announced at Microsoft Ignite, and in continuing to support our customers and partners. We look forward to getting to know Emmett Shear and OAI's new leadership team and working with them. And we’re extremely excited to share the news that Sam Altman and Greg Brockman, together with colleagues, will be joining Microsoft to lead a new advanced AI research team. We look forward to moving quickly to provide them with the resources needed for their success.

212

704

6,565

2,438,250

Jim Fan · Mar 14, 2023 · 5:27 PM UTC

Jim Fan

@DrJimFan

14 Mar 2023

I don't give a damn about what is or isn't AGI. It doesn't matter. Below is GPT-4's performance on many standardized exams: BAR, LSAT, GRE, AP, etc. The truth is, GPT-4 can apply to Stanford as a student now. AI's reasoning ability is OFF THE CHARTS. Exponential growth is the scariest thing, isn't it!

339

1,247

6,409

3,176,103

Jim Fan · Jan 7, 2025 · 3:11 AM UTC

Jim Fan

@DrJimFan

7 Jan 2025

Y'all expecting RTX 5090, cool specs and stuff. But do you fully internalize what Jensen said about graphics? That the new card uses neural nets to generate 90+% of the pixels for your games? Traditional ray-tracing algorithms only render ~10%, kind of a "rough sketch", and then a generative model fills in the rest of fine details. In one forward pass. In real time. AI is the new graphics, ladies and gentlemen.

327

548

6,616

958,991

Jim Fan · Sep 12, 2024 · 5:16 PM UTC

Jim Fan

@DrJimFan

12 Sep 2024

OpenAI Strawberry (o1) is out! We are finally seeing the paradigm of inference-time scaling popularized and deployed in production. As Sutton said in the Bitter Lesson, there're only 2 techniques that scale indefinitely with compute: learning & search. It's time to shift focus to the latter. 1. You don't need a huge model to perform reasoning. Lots of parameters are dedicated to memorizing facts, in order to perform well in benchmarks like trivia QA. It is possible to factor out reasoning from knowledge, i.e. a small "reasoning core" that knows how to call tools like browser and code verifier. Pre-training compute may be decreased. 2. A huge amount of compute is shifted to serving inference instead of pre/post-training. LLMs are text-based simulators. By rolling out many possible strategies and scenarios in the simulator, the model will eventually converge to good solutions. The process is a well-studied problem like AlphaGo's monte carlo tree search (MCTS). 3. OpenAI must have figured out the inference scaling law a long time ago, which academia is just recently discovering. Two papers came out on Arxiv a week apart last month: - Large Language Monkeys: Scaling Inference Compute with Repeated Sampling. Brown et al. finds that DeepSeek-Coder increases from 15.9% with one sample to 56% with 250 samples on SWE-Bench, beating Sonnet-3.5. - Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters. Snell et al. finds that PaLM 2-S beats a 14x larger model on MATH with test-time search. 4. Productionizing o1 is much harder than nailing the academic benchmarks. For reasoning problems in the wild, how to decide when to stop searching? What's the reward function? Success criterion? When to call tools like code interpreter in the loop? How to factor in the compute cost of those CPU processes? Their research post didn't share much. 5. Strawberry easily becomes a data flywheel. If the answer is correct, the entire search trace becomes a mini dataset of training examples, which contain both positive and negative rewards. This in turn improves the reasoning core for future versions of GPT, similar to how AlphaGo’s value network — used to evaluate quality of each board position — improves as MCTS generates more and more refined training data.

135

1,089

6,083

799,970

Jim Fan · Oct 20, 2023 · 3:59 PM UTC

Jim Fan

@DrJimFan

20 Oct 2023

Can GPT-4 teach a robot hand to do pen spinning tricks better than you do? I'm excited to announce Eureka, an open-ended agent that designs reward functions for robot dexterity at super-human level. It’s like Voyager in the space of a physics simulator API! Eureka bridges the gap between high-level reasoning (coding) and low-level motor control. It is a “hybrid-gradient architecture”: a black box, inference-only LLM instructs a white box, learnable neural network. The outer loop runs GPT-4 to refine the reward function (gradient-free), while the inner loop runs reinforcement learning to train a robot controller (gradient-based). We are able to scale up Eureka thanks to IsaacGym, a GPU-accelerated physics simulator that speeds up reality by 1000x. On a benchmark suite of 29 tasks across 10 robots, Eureka rewards outperform expert human-written ones on 83% of the tasks by 52% improvement margin on average. We are surprised that Eureka is able to learn pen spinning tricks, which are very difficult even for CGI artists to animate frame by frame! Eureka also enables a new form of in-context RLHF, which is able to incorporate a human operator’s feedback in natural language to steer and align the reward functions. It can serve as a powerful co-pilot for robot engineers to design sophisticated motor behaviors. As usual, we open-source everything! Welcome you all to check out our video gallery and try the codebase today: eureka-research.github.io/ Paper: arxiv.org/abs/2310.12931 Code: github.com/eureka-research/E… Deep dive with me: 🧵

169

1,130

5,739

2,673,902

Jim Fan · Dec 30, 2024 · 3:52 PM UTC

Jim Fan

@DrJimFan

30 Dec 2024

It gives me a lot of comfort knowing that we are the last generation without advanced robots everywhere. Our children will grow up as “robot natives”. They will have humanoids cook Michelin dinner, robot teddy bears tell bedtime stories, and FSD drive them to school. We are the generation of “robot immigrants”, en route to a new world of ubiquitous Physical AI, much like our parents are “digital immigrants”, learning to realign their lives on 6 inches of touch screen. It’s a journey of both inventing sci-fi tech and reinventing ourselves. Everything that moves will be autonomous. Every year from now on will be the Year of Robotics. Here’s to a wild 2025 ahead 🥂

340

1,012

5,772

582,246

Jim Fan · Mar 18, 2024 · 10:54 PM UTC

Jim Fan

@DrJimFan

18 Mar 2024

Today is the beginning of our moonshot to solve embodied AGI in the physical world. I’m so excited to announce Project GR00T, our new initiative to create a general-purpose foundation model for humanoid robot learning. The GR00T model will enable a robot to understand multimodal instructions, such as language, video, and demonstration, and perform a variety of useful tasks. We are collaborating with many leading humanoid companies around the world, so that GR00T may transfer across embodiments and help the ecosystem thrive. GR00T is born on NVIDIA’s deep technology stack. We simulate in Isaac Lab (new app on Omniverse Isaac Sim for humanoid learning), train on OSMO (new compute orchestration system to scale up models), and deploy to Jetson Thor (new edge GPU chip designed to power GR00T). Announced in Jensen's keynote, Project GR00T is a cornerstone for the “Foundation Agent” roadmap of the newly founded GEAR Lab. At GEAR, we are building generally capable agents that learn to act skillfully in many worlds, virtual and real. See if you can spot "GEAR" in the video ;) Join us on the journey to land on the moon.

210

1,110

5,503

1,076,691

Jim Fan · Jul 18, 2023 · 6:37 PM UTC

Jim Fan

@DrJimFan

18 Jul 2023

You'll soon see lots of "Llama just dethroned ChatGPT" or "OpenAI is so done" posts on Twitter. Before your timeline gets flooded, I'll share my notes: ▸ Llama-2 likely costs $20M+ to train. Meta has done an incredible service to the community by releasing the model with a commercially-friendly license. AI researchers from big companies were wary of Llama-1 due to licensing issues, but now I think many of them will jump on the ship and contribute their firepower. ▸ Meta's team did a human study on 4K prompts to evaluate Llama-2's helpfulness. They use "win rate" as a metric to compare models, in similar spirit as the Vicuna benchmark. 70B model roughly ties with GPT-3.5-0301, and performs noticeably stronger than Falcon, MPT, and Vicuna. I trust these real human ratings more than academic benchmarks, because they typically capture the "in-the-wild vibe" better. ▸ Llama-2 is NOT yet at GPT-3.5 level, mainly because of its weak coding abilities. On "HumanEval" (standard coding benchmark), it isn't nearly as good as StarCoder or many other models specifically designed for coding. That being said, I have little doubt that Llama-2 will improve significantly thanks to its open weights. ▸ Meta's team goes above and beyond on AI safety issues. In fact, almost half of the paper is talking about safety guardrails, red-teaming, and evaluations. A round of applause for such responsible efforts! In prior works, there's a thorny tradeoff between helpfulness and safety. Meta mitigates this by training 2 separate reward models. They aren't open-source yet, but would be extremely valuable to the community. ▸ I think Llama-2 will dramatically boost multimodal AI and robotics research. These fields need more than just blackbox access to an API. So far, we have to convert the complex sensory signals (video, audio, 3D perception) to text description and then feed to an LLM, which is awkward and leads to huge information loss. It'd be much more effective to graft sensory modules directly on a strong LLM backbone. ▸ The whitepaper itself is a masterpiece. Unlike GPT-4's paper that shared very little info, Llama-2 spelled out the entire recipe, including model details, training stages, hardware, data pipeline, and annotation process. For example, there's a systematic analysis on the effect of RLHF with nice visualizations. Quote sec 5.1: "We posit that the superior writing abilities of LLMs, as manifested in surpassing human annotators in certain tasks, are fundamentally driven by RLHF." Congrats to the team again 🥂! Today is another delightful day in OSS AI.

159

1,101

5,392

1,376,945

Jim Fan · Apr 2, 2023 · 4:23 PM UTC

Jim Fan

@DrJimFan

2 Apr 2023

HuggingGPT is the most interesting paper I read this week. It gets very close to the "Everything App" vision that I described a while ago. ChatGPT acts as a controller over the *AI model space*, picks the right model (app) given the human specification, and assembles them correctly to solve the task. It's multimodal in a "low-bandwidth" way - all modalities need to be compressed and connected through text strings. HuggingGPT is also related to Prismer's idea: leverage pre-trained domain expert models as much as possible. Sometimes training less is doing more!

888

5,329

1,229,201

Jim Fan · Jan 9, 2023 · 5:08 PM UTC

Jim Fan

@DrJimFan

9 Jan 2023

Here’s the recipe to make Siri/Alexa 10x better: 1. Whisper to convert speech to text. Best open-source speech model out there. 2. ChatGPT to generate smart home API calls and/or text response. 3. VALL-E to synthesize speech. It can mimic anyone’s voice sample! Quick figure 1/3

997

5,175

1,361,907

Jim Fan · Jan 8, 2025 · 7:03 PM UTC

Jim Fan

@DrJimFan

8 Jan 2025

Allegedly shot in Shenzhen. Is this real? Can someone verify? I've seen this company posting very natural humanoid walking gaits a couple months ago. These days, it's hard to tell CGI vs Sora vs real ...

820

610

4,737

2,433,803

Jim Fan · Mar 13, 2023 · 2:59 PM UTC

Jim Fan

@DrJimFan

13 Mar 2023

Million dollar idea: LLM keyboard. Every time I type on my phone and autocorrect makes a stupid mistake, it screams LLM. This is *literally* next word prediction. We should be typing 10x faster. Input methods need serious upgrades. The LLM doesn’t have to be big and can be optimized to run locally to reduce latency and keep privacy. It also needs no prompt engineering or instruction tuning. Combined with methods like swipe-type, LLM keyboard could in principle render full sentences with an unbroken thumb movement. We’d finally be able to type at the speed of our stream of consciousness!

409

337

4,640

4,605,931

Jim Fan · Oct 8, 2024 · 3:54 PM UTC

Jim Fan

@DrJimFan

8 Oct 2024

Hitchhiker's guide to rebranding: - Machine learning -> statistical mechanics - Loss function -> energy functional - Optimize the model -> minimize free energy - Trained model -> reached equilibrium distribution - KL divergence -> free energy difference - Gaussian noise -> random thermal fluctuations - Random step -> Brownian motion - SGD -> directional Brownian motion - GPU -> simulated particle accelerator - Diffusion models -> Langevin dynamics - Reinforcement learning -> control theory - Robotics -> physical computation - Audio learning -> 1D signal processing - Image learning -> 2D signal processing - Video learning -> 3D signal processing - Multimodal models -> multidimensional signal processing - Sora -> learned physics engine You're welcome

109

706

4,718

499,529

Jim Fan · Nov 15, 2023 · 4:51 PM UTC

Jim Fan

@DrJimFan

15 Nov 2023

NVIDIA basically compressed 30 years of its corporate memory into 13B parameters. Our greatest creations add up to 24B tokens, including chip designs, internal codebases, and engineering logs like bug reports. Let that sink in. The model "ChipNeMo" is deployed internally, like a shared genie: - EDA scripts generation. EDA stands for "Electronic Design Automation", a core software suite for designing the next-gen GPUs. These scripts are the keys to a $1T market cap 🦾; - Engineering assistant chatbot for GPU ASIC and Architecture engineers that understands internal hardware design specs and is capable of explaining complex design topics; - Bug summarization and analysis as part of an internal bug and issue tracking system; - Domain-finetuned retriever that achieves much better accuracy over internal knowledge. And we publish a whitepaper to share ChipNeMo's creation process: arxiv.org/abs/2311.00176 Official blog: blogs.nvidia.com/blog/llm-se… Congrats to Haoxing "Mark" Ren's team for the outstanding work!

140

798

4,682

1,421,237

Jim Fan · Mar 10, 2023 · 5:27 PM UTC

Jim Fan

@DrJimFan

10 Mar 2023

*If* GPT-4 is multimodal, we can predict with reasonable confidence what GPT-4 *might* be capable of, given Microsoft’s prior work Kosmos-1: - Visual IQ test: yes, the ones that humans take! - OCR-free reading comprehension: input a screenshot, scanned document, street sign, or any pixels that contain text. Reason about the contents directly without explicit OCR. This is extremely useful to unlock AI-powered apps on multimedia web pages, or “text in the wild” from real world cams. - Multimodal chat: have a conversation about a picture. You can even provide “follow-up” images in the middle. - Broad visual understanding abilities, like captioning, visual question answering, object detection, scene layout, common sense reasoning, etc. - Audio & speech recognition (??): wasn’t mentioned in Kosmos-1 paper, but Whisper is already an OpenAI API and should be fairly easy to integrate. Note: the predictions are based on what Andreas Braun, Microsoft Germany CTO, allegedly said. They may or may not be accurate (that’s why I call it “prediction”). But Kosmos-1 is very real and rock solid. It offers a glimpse of either GPT-4 or whatever AI service that Microsoft will provide next. I find it difficult to believe Kosmos-1 will stay in the lab and not become a product. In any case, prepare yourself for multimodal APIs - they’ll happen sooner or later!

957

4,644

1,510,591

Jim Fan · Jan 18, 2023 · 3:00 PM UTC

Jim Fan

@DrJimFan

18 Jan 2023

How to make ChatGPT 100x better at solving math, science, and engineering problems for real? Teach it to use the Wolfram language. ChatGPT: the best neural reasoning engine. Mathematica: the best symbolic reasoning engine. I can’t think of a happier marriage. 🧵 with example:

674

4,509

1,165,080

Jim Fan · May 3, 2024 · 4:15 PM UTC

Jim Fan

@DrJimFan

3 May 2024

We trained a robot dog to balance and walk on top of a yoga ball purely in simulation, and then transfer zero-shot to the real world. No fine-tuning. Just works. I’m excited to announce DrEureka, an LLM agent that writes code to train robot skills in simulation, and writes more code to bridge the difficult simulation-reality gap. It fully automates the pipeline from new skill learning to real-world deployment. The Yoga ball task is particularly hard because it is not possible to accurately simulate the bouncy ball surface. Yet DrEureka has no trouble searching over a vast space of sim-to-real configurations, and enables the dog to steer the ball on various terrains, even walking sideways! Traditionally, the sim-to-real transfer is achieved by domain randomization, a tedious process that requires expert human roboticists to stare at every parameter and adjust by hand. Frontier LLMs like GPT-4 have tons of built-in physical intuition for friction, damping, stiffness, gravity, etc. We are (mildly) surprised to find that DrEureka can tune these parameters competently and explain its reasoning well. DrEureka builds on our prior work Eureka, the algorithm that teaches a 5-finger robot hand to do pen spinning. It takes one step further on our quest to automate the entire robot learning pipeline by an AI agent system. One model that outputs strings will supervise another model that outputs torque control. We open-source everything! Welcome you all to check out the paper, more videos, and try the codebase today: eureka-research.github.io/dr… Code: github.com/eureka-research/D…

176

784

4,580

908,625

Jim Fan · Feb 22, 2024 · 3:57 PM UTC

Jim Fan

@DrJimFan

22 Feb 2024

The first time I met Jensen was also the first time I met @elonmusk. I was interning at OpenAI that day and witnessed the moment Jensen handed Elon the first DGX. I slipped in my signature ;) Elon, if you recall, I asked how "we (OpenAI) can beat DeepMind". You told me, "by democratizing AI for everyone". And Jensen just asked me to do an internship at NVIDIA after OpenAI. I'm a simple guy, so I did. #NVDA Good old times.

Elon Musk

@elonmusk

18 Feb 2024

Replying to @elonmusk

Some pics from when Jensen delivered the first @Nvidia AI system to @OpenAI

269

4,286

1,009,859

Jim Fan · Jun 12, 2023 · 4:02 PM UTC

Jim Fan

@DrJimFan

12 Jun 2023

Today 6 years ago, "Attention is All You Need" went on Arxiv! Happy birthday Transformer! 🎂 Fun facts: - Transformer did not invent attention, but pushed it to the extreme. The first attention paper was published 3 years prior (2014) and had an unassuming title: "Neural Machine Translation by Jointly Learning to Align and Translate", from Yoshua Bengio's lab. It is a combination of RNN + "context vectors" (i.e. attention). Many of you likely haven't heard about this paper, but it's one of the greatest milestones in NLP and has been cited 29K times (compared to Transformer's 77K). - Neither Transformer nor the original attention paper talked about the general-purpose sequence computer. Instead, both were conceived as solutions to one narrow & specific problem: machine translation. It's remarkable that AGI (some day soon) can trace its origin to the humble Google Translate. 😅 - Transformer was published at NeurIPS 2017, one of the top AI conferences worldwide. Yet it didn't even get an Oral presentation, let alone awards. There were 3 best papers at NeurIPS that year. Combined, they have 529 citations as of today.

912

4,242

1,716,642

Jim Fan · Feb 2, 2023 · 4:51 PM UTC

Jim Fan

@DrJimFan

2 Feb 2023

Music & sound effect industry has not fully understood the size of the storm about to hit. There’re not just one, or two, but FOUR audio models in the past week *alone* If 2022 is the year of pixels for generative AI, then 2023 is the year of sound waves. Deep dive with me: 🧵

905

4,229

1,050,669

Jim Fan · Apr 14, 2023 · 3:36 PM UTC

Jim Fan

@DrJimFan

14 Apr 2023

We are looking at the future of VR, YouTube & Google Street View. This is zip-NeRF, a 3D neural rendering tech rapidly approaching the quality of a real, high-res drone flight video. Think of NeRF as transporting reality into simulation. Metaverse will finally work this time.

143

681

4,075

1,260,379

Jim Fan · Apr 5, 2023 · 4:12 PM UTC

Jim Fan

@DrJimFan

5 Apr 2023

Reading @MetaAI's Segment-Anything, and I believe today is one of the "GPT-3 moments" in computer vision. It has learned the *general* concept of what an "object" is, even for unknown objects, unfamiliar scenes (e.g. underwater & cell microscopy), and ambiguous cases. I still can't believe both the model and data (11M images, 1B masks) are OPEN-sourced. Wow.😮 What's the secret sauce? Just follow the foundation model mindset: 1. A very simple but scalable architecture that takes multimodal prompts: text, key points, bounding boxes. 2. Intuitive human annotation pipeline that goes hand-in-hand with the model design. 3. A data flywheel that allows the model to bootstrap itself to tons of unlabeled images. IMHO, Segment-Anything has done everything right.

722

4,178

1,265,755

Jim Fan · Dec 27, 2022 · 2:35 PM UTC

Jim Fan

@DrJimFan

27 Dec 2022

The AI explosion is warping our sense of time. Can you believe Stable Diffusion is only 4 months old, and ChatGPT <4 weeks old 🤯? If you blink, you miss a whole new industry. Here are my TOP 10 AI spotlights, from a breathtaking 2022 in rewind ⏮: a long thread 🧵

937

4,160

1,013,198

Jim Fan · Mar 16, 2024 · 5:28 PM UTC

Jim Fan

@DrJimFan

16 Mar 2024

We live in such strange times. Apple, a company famous for its secrecy, published a paper with staggering amount of details on their multimodal foundation model. Those who are supposed to be open are now wayyy less than Apple. MM1 is a treasure trove of analysis. They discuss lots of architecture designs and even disclose that they train on GPT-4V-generated data. They provide exact scaling law coefficients (to 4 significant figures), MoE settings, and even optimal learning rate functions. I have not seen this level of details from a big tech's whitepaper for a very, very long time. Apple's so back!

710

4,194

559,119

Jim Fan · Jan 7, 2025 · 6:31 AM UTC

Jim Fan

@DrJimFan

7 Jan 2025

Introducing NVIDIA Cosmos, an open-source, open-weight Video World Model. It's trained on 20M hours of videos and weighs from 4B to 14B. Cosmos offers two flavors: diffusion (continuous tokens) and autoregressive (discrete tokens); and two generation modes: text->video and text+video->video. Physical AI has a big data problem. Synthetic data to the rescue! We apply Cosmos to large-scale synthetic data generation for robotics and autonomous driving, and now you can too! It's all yours to finetune. Check it out: github.com/NVIDIA/Cosmos

707

4,108

567,234

Jim Fan · Mar 18, 2024 · 8:07 PM UTC

Jim Fan

@DrJimFan

18 Mar 2024

Jensen Huang is the new Taylor Swift

123

510

4,063

540,402

Jim Fan · Feb 5, 2024 · 5:30 PM UTC

Jim Fan

@DrJimFan

5 Feb 2024

MidJourney hired an engineer from Apple Vision Pro to be "Head of Hardware". My best guess is that they are thinking about generating full synthetic worlds for AR/VR, because of their rumored works on text-to-3D. Data-driven simulation is a hot topic at NVIDIA and very dear to my heart. Congrats to the Vision Pro engineer who found a new adventure! Would love to see what MidJourney comes up with. First discovered by @zackhargett

375

3,935

701,058

Jim Fan · Mar 27, 2023 · 5:03 PM UTC

Jim Fan

@DrJimFan

27 Mar 2023

Enough with LLMs - exciting things are happening in the world of atoms. This is Stanford ALOHA, a low-cost and agile robot platform. The whole system is open-source (!!): hardware design, CAD models for 3D printing, simulator, and training code. Time to graft a physical arm onto GPTs 🦾 Led by my friend @tonyzzhao at Stanford, and advised by @chelseabfinn @svlevine @Vikashplus. Project page: tonyzhaozh.github.io/aloha/ If you don't want to 3D print your own components, you can buy the setup at trossenrobotics.com/aloha.as…

888

3,903

818,377

Jim Fan · Aug 11, 2023 · 4:45 PM UTC

Jim Fan

@DrJimFan

11 Aug 2023

This is an ape ("Kanzi") playing Minecraft! A fascinating experiment on non-human biological neural networks 🙉 I've been teaching AI to play Minecraft for too long. There're so many similar techniques that the ape trainers used: - In-context reinforcement learning: Kanzi gets a fruit or peanut whenever he hits a marked milestone in the game, incentivizing him to follow the in-game guides. - RLHF: Kanzi doesn't understand much language, but he can see the trainers cheering him on, and he occasionally cheers them back! That gives him a strong signal that he's on the right track. - Imitation learning: the trainers show Kanzi just 1 demonstration of how to do a task, and he immediately grasps the concepts. It's much more efficient than using rewards alone. - Curriculum learning: they start with very simple environments to gradually teach Kanzi the controls. At the end, Kanzi is able to navigate complex caves, mazes, and the Nether. It also amazes me how strong the ape's vision system is. Kanzi never saw Minecraft in his life, and for sure his ancestors didn't either. Yet he rapidly adapts to Minecraft's texture and physics, which are dramatically different from the natural world. This level of generalization is far beyond what our most powerful vision models can do today. We are right in the thick of Moravec's paradox again: our best AIs are approaching human level on understanding language, but far behind animals on parsing pixels. From YouTube channel "ChrisDaCow": piped.video/watch?v=UKpFoYqN… Researchers are from the Ape Initiative, a non-profit org.

111

814

3,778

745,929

Jim Fan · Sep 14, 2023 · 2:03 PM UTC

Jim Fan

@DrJimFan

14 Sep 2023

This is the way to unlock the next trillion high-quality tokens, currently frozen in textbook pixels that are not LLM-ready. Nougat: an open-source OCR model that accurately scans books with heavy math/scientific notations. It's ages ahead of other open OCR options. Meta is doing extraordinary open-source AI, sometimes without as much fanfare as Llama. My first serious AI research project (back @Columbia, 2012) was to convert chemical engineering PDFs into NLP-ready corpus. I still remember the immense pain of Tesseract, a much older OCR system (github.com/tesseract-ocr/tes…). Now Nougat runs a powerful Swin Transformer backbone and blows the benchmarks out of the water. We're talking about double-digit improvements across all metrics. Now, textbooks are all we need for the next GPT! Website: facebookresearch.github.io/n… Open-source code: github.com/facebookresearch/… Paper "Nougat: Neural Optical Understanding for Academic Documents": arxiv.org/abs/2308.13418

117

735

3,869

1,088,888

Jim Fan · Mar 7, 2023 · 6:56 PM UTC

Jim Fan

@DrJimFan

7 Mar 2023

After ChatGPT, the future belongs to multimodal LLMs. What’s even better? Open-sourcing. Announcing Prismer, my team’s latest vision-language AI, empowered by domain-expert models in depth, surface normal, segmentation, etc. No paywall. No forms. Batteries included: pre-trained weights, inference code, and even training/finetuning scripts (!!) Welcome you all to try today: github.com/NVlabs/Prismer Paper: arxiv.org/abs/2303.02506 Website: shikun.io/projects/prismer This work is led by our awesome summer intern @liu_shikun at @NVIDIAAI. Deep dive with me: 🧵

757

3,849

943,949

Jim Fan · Apr 16, 2023 · 3:03 PM UTC

Jim Fan

@DrJimFan

16 Apr 2023

AutoGPT just exceeded PyTorch itself in GitHub stars (74k vs 65k). I see AutoGPT as a fun experiment, as the authors point out too. But nothing more. Prototypes are not meant to be production-ready. Don't let media fool you - most of the "cool demos" are heavily cherry-picked: 🧵

134

507

3,778

1,156,745

Jim Fan · Sep 12, 2023 · 2:58 PM UTC

Jim Fan

@DrJimFan

12 Sep 2023

A neural network can smell like humans do for the first time!👃🏽 Digital smell is a modality that AI community has long ignored, but maybe one day useful for robot chef 👩🏽‍🍳? Here's how to do smell2text: 1. Collected 5,000 molecules and ask humans to label "creamy, chocolate, alcoholic, beefy, spicy, citrus", etc. This dataset is one of its kind and a huge contribution from the paper. 2. Train a graph neural network (GNN) to map the molecule to label. Each molecule is a graph of atoms described by valence, degree, hydrogen count, hybridization, formal charge, atomic number, etc. 3. The GNN predictions match well with expert humans on novel smells. 4. The embeddings give us a "Principal Odor map (POM)" that faithfully represents hierarchies and distances among odorants.

117

739

3,816

836,252

Jim Fan · Jul 13, 2025 · 5:06 PM UTC

Jim Fan

@DrJimFan

13 Jul 2025

I've been a bit quiet on X recently. The past year has been a transformational experience. Grok-4 and Kimi K2 are awesome, but the world of robotics is a wondrous wild west. It feels like NLP in 2018 when GPT-1 was published, along with BERT and a thousand other flowers that bloomed. No one knew which one would eventually become ChatGPT. Debates were heated. Entropy was sky high. Ideas were insanely fun. I believe the GPT-1 of robotics is already somewhere on Arxiv, but we don't know exactly which one. Could be world models, RL, learning from human video, sim2real, real2sim, etc. etc, or any combo of them. Debates are heated. Entropy is sky high. Ideas are insanely fun, instead of squeezing the last few % on AIME & GPQA. The nature of robotics also greatly complicates the design space. Unlike the clean world of bits for LLMs (text strings), we roboticists have to deal with the messy world of atoms. After all, there's a lump of software-defined metal in the loop. LLM normies may find it hard to believe, but so far roboticists still can't agree on a benchmark! Different robots have different capability envelopes - some are better at acrobatics while others at object manipulation. Some are meant for industrial use while others are for household tasks. Cross-embodiment isn't just a research novelty, but an essential feature for a universal robot brain. I've talked to dozens of C-suite leads from various robot companies, old and new. Some sell the whole body. Some sell body parts such as dexterous hands. Many more others sell the shovels to manufacture new bodies, create simulations, or collect massive troves of data. The business idea space is as wild as research itself. It's a new gold rush, the likes of which we haven't seen since the 2022 ChatGPT wave. The best time to enter is when non-consensus peaks. We're still at the start of a loss curve - there're strong signs of life, but far, far away from convergence. Every gradient step takes us into the unknown. But one thing I do know for sure - there's no AGI without touching, feeling, and being embodied in the messy world. On a more personal note - running a research lab comes with a whole new level of responsibility. Giving updates directly to the CEO of a $4T company is, to put it mildly, both thrilling and all-consuming of my attention weights. Gone are the days when I could stay on top of and dive deep into every AI news. I’ll try to carve out time to share more of my journey.

188

318

3,922

1,003,919

Jim Fan · Feb 23, 2024 · 3:34 PM UTC

Jim Fan

@DrJimFan

23 Feb 2024

Career update: I am co-founding a new research group called "GEAR" at NVIDIA, with my long-time friend and collaborator Prof. @yukez. GEAR stands for Generalist Embodied Agent Research. We believe in a future where every machine that moves will be autonomous, and robots and simulated agents will be as ubiquitous as iPhones. We are building the Foundation Agent — a generally capable AI that learns to act skillfully in many worlds, virtual and real. 2024 is the Year of Robotics, the Year of Gaming AI, and the Year of Simulation. We are setting out on a moon-landing mission, and getting there will spin off mountains of learnings and breakthroughs. Join us on the journey: research.nvidia.com/labs/gea…

228

436

3,699

617,865

Jim Fan · May 13, 2024 · 6:40 PM UTC

Jim Fan

@DrJimFan

13 May 2024

I know your timeline is flooded now with word salads of "insane, HER, 10 features you missed, we're so back". Sit down. Chill. <gasp> Take a deep breath like Mark does in the demo </gasp>. Let's think step by step: - Technique-wise, OpenAI has figured out a way to map audio to audio directly as first-class modality, and stream videos to a transformer in real-time. These require some new research on tokenization and architecture, but overall it's a data and system optimization problem (as most things are). High-quality data can come from at least 2 sources: 1) Naturally occurring dialogues on YouTube, podcasts, TV series, movies, etc. Whisper can be trained to identify speaker turns in a dialogue or separate overlapping speeches for automated annotation. 2) Synthetic data. Run the slow 3-stage pipeline using the most powerful models: speech1->text1 (ASR), text1->text2 (LLM), text2->speech2 (TTS). The middle LLM can decide when to stop and also simulate how to resume from interruption. It could output additional "thought traces" that are not verbalized to help generate better reply. Then GPT-4o distills directly from speech1->speech2, with optional auxiliary loss functions based on the 3-stage data. After distillation, these behaviors are now baked into the model without emitting intermediate texts. On the system side: the latency would not meet real-time threshold if every video frame is decompressed into an RGB image. OpenAI has likely developed their own neural-first, streaming video codec to transmit the motion deltas as tokens. The communication protocol and NN inference must be co-optimized. For example, there could be a small and energy-efficient NN running on the edge device that decides to transmit more tokens if the video is interesting, and fewer otherwise. - I didn't expect GPT-4o to be closer to GPT-5, the rumored "Arrakis" model that takes multimodal in and out. In fact, it's likely an early checkpoint of GPT-5 that hasn't finished training yet. The branding betrays a certain insecurity. Ahead of Google I/O, OpenAI would rather beat our mental projection of GPT-4.5 than disappoint by missing the sky-high expectation for GPT-5. A smart move to buy more time. - Notably, the assistant is much more lively and even a bit flirty. GPT-4o is trying (perhaps a bit too hard) to sound like HER. OpenAI is eating Character AI's lunch, with almost 100% overlap in form factor and huge distribution channels. It's a pivot towards more emotional AI with strong personality, which OpenAI seemed to actively suppress in the past. - Whoever wins Apple first wins big time. I see 3 levels of integration with iOS: 1) Ditch Siri. OpenAI distills a smaller-tier, purely on-device GPT-4o for iOS, with optional paid upgrade to use the cloud. 2) Native features to stream the camera or screen into the model. Chip-level support for neural audio/video codec. 3) Integrate with iOS system-level action API and smart home APIs. No one uses Siri Shortcuts, but it's time to resurrect. This could become the AI agent product with a billion users from the get-go. The FSD for smartphones with a Tesla-scale data flywheel.

104

615

3,418

991,629

Jim Fan · Jan 2, 2025 · 11:24 PM UTC

Jim Fan

@DrJimFan

2 Jan 2025

This is the most gut-wrenching blog I've read, because it's so real and so close to heart. The author is no longer with us. I'm in tears. AI is not supposed to be 200B weights of stress and pain. It used to be a place of coffee-infused eureka moments, of exciting late-night arxiv safaris, of wicked smart ideas that put smile on our faces. But all the incoming capital and attention seem to be forcing everyone to race to the bottom. Jensen always tells us not to use phrases like "beat this, crush that". I absolutely love this perspective. We are here to lift up an entire ecosystem, not to send anyone to oblivion. I like to think of my work as expanding the pie. We need to bake the pie first, together, the bigger the better, before dividing it. It gives me comfort knowing that our team's works moved the needle for robotics, even just by a tiny bit. AI is not a zero sum game. In fact, it is perhaps the most positive-sum game that humanity ever plays. And we as a community should act this way. Take care of each other. Send love to "competitors" - because in the grand schemes of things, we are all coauthors of an accelerated future. I never had the privilege to know Felix irl, but I loved his research taste and set up Google Scholar alert for every one of his new papers. His works in agents and VLMs had a big influence on mine. He would've been a great friend. I want to get to know him, but I couldn't any more. RIP Felix. May the next world have no wars to fight.

384

3,653

860,770

Jim Fan · Feb 8, 2023 · 4:13 PM UTC

Jim Fan

@DrJimFan

8 Feb 2023

Microsoft will let companies create their own ChatGPT. “BYOD”: Bring Your Own Data. Do you get the implication? Startups that are just thin wrappers around OpenAI API may finally get their moat! I think this is even more exciting than Bing+ChatGPT. Start collecting data now.

133

653

3,544

626,724

Jim Fan · Nov 25, 2023 · 5:29 PM UTC

Jim Fan

@DrJimFan

25 Nov 2023

Apparently people start to wear prosthetic fingers, so that surveillance images look like they're generated by Stable Diffusion 😅 The human race is overfitting to the quirks of our AI overlords.

462

3,506

806,629

Jim Fan · Feb 15, 2023 · 6:11 PM UTC

Jim Fan

@DrJimFan

15 Feb 2023

The Adam optimizer is at the heart of modern AI. Researchers have been trying to dethrone Adam for years. How about we ask a machine to do a better job? @GoogleAI uses evolution to discover a simpler & efficient algorithm with remarkable features. It’s just 8 lines of code: 🧵

544

3,495

836,482

Jim Fan · Mar 30, 2023 · 4:49 PM UTC

Jim Fan

@DrJimFan

30 Mar 2023

Chatbot UI: an MIT-licensed, community-driven clone of the ChatGPT UI. What most people don't realize is that you can pay *much less* to enjoy the same features as the official app. $20 worth of gpt-3.5 API is about writing a full Harry Potter book every single day for a month. github.com/mckaywrigley/chat… Built by the great @mckaywrigley

501

3,435

1,391,009

Jim Fan · Apr 12, 2023 · 4:19 PM UTC

Jim Fan

@DrJimFan

12 Apr 2023

AutoGPT is a prototype of the next frontier: "Agent Smith" AI that recursively clones itself. Achieved by (1) identifying *when* its context gets overwhelming and needs offloading; (2) distilling the “cognitive overflow” part into a prompt directive for its clone; (3) talking back and forth with the newly minted siblings to get the job done. It's far from perfect yet, but we'll soon see this emergent paradigm get a lot more powerful. The funny thing is that GPT Agent Smiths can't do "Neuralink" via high-bandwidth Matrix. Texting is still the bottleneck 🤣

214

480

3,246

1,504,301

Jim Fan · Dec 6, 2023 · 4:49 AM UTC

Jim Fan

@DrJimFan

6 Dec 2023

This may be Apple's biggest move on open-source AI so far: MLX, a PyTorch-style NN framework optimized for Apple Silicon, e.g. laptops with M-series chips. The release did an excellent job on designing an API familiar to the deep learning audience, and showing minimalistic examples on OSS models that most people care about: Llama, LoRA, Stable Diffusion, and Whisper. I expect no less from my former colleague @awnihannun, spearheading this effort at Apple. Thanks for the early Christmas gift! 🎄🎁 MLX source: github.com/ml-explore/mlx Well-documented, self-contained examples: github.com/ml-explore/mlx-ex…

532

3,401

870,696

Jim Fan · Mar 23, 2023 · 5:42 PM UTC

Jim Fan

@DrJimFan

23 Mar 2023

OpenAI just announced ChatGPT Plugins. If ChatGPT's debut was the "iPhone event", today is the "iOS App Store" event. 3 official plugins available now: - Web browser: adding Bing in the loop - Code interpreter: adding a live Python interpreter in a sandboxed & firewalled execution environment - Retrieval: semantic search for your personal & organizational docs. Note that we do have an "Android App Store" already - the open-source LangChain ecosystem, built by @hwchase17. Open-source ftw! It's so interesting to see the LLM community mirror the iOS vs Android competition from now on: - iOS for LLM: openai.com/blog/chatgpt-plug… - Android for LLM: github.com/hwchase17/langcha…

591

3,336

801,463

Jim Fan · Apr 4, 2023 · 3:49 PM UTC

Jim Fan

@DrJimFan

4 Apr 2023

My guess is that MidJourney has been doing a massive-scale reinforcement learning from human feedback ("RLHF") - possibly the largest ever for text-to-image. When human users choose to upscale an image, it's because they prefer it over the alternatives. It'd be a huge waste not to use this as a reward signal - cheap to collect, and *exactly* aligned with what your user base wants. The more users you have, the better RLHF you can do. And then the more users you gain.

102

366

3,336

1,315,939

Jim Fan · Apr 12, 2024 · 4:17 PM UTC

Jim Fan

@DrJimFan

12 Apr 2024

one day PhDs will animate every object around us with reinforcement learning to keep their thesis going

335

3,282

464,036

Jim Fan · Jan 27, 2025 · 2:13 PM UTC

Jim Fan

@DrJimFan

27 Jan 2025

An obvious, “we are so back” moment in the AI circle somehow turned into “it’s so over” in mainstream. > unbelievable shortsightedness > the power of o1 in the palm of every coder’s hand to study, explore, and iterate upon > ideas compound > the rate of compounding accelerates with open-source > the pie just got much bigger, faster > we, as one humanity, are marching towards universal AGI *sooner* > yes, sooner, you read that right > zero-sum game is for losers

225

412

3,315

353,199

Jim Fan · May 12, 2024 · 4:33 PM UTC

Jim Fan

@DrJimFan

12 May 2024

OpenAI is expected to demo a real-time voice assistant tomorrow. What does it take to deliver an immersive, or even magical experience? Almost all voice AI go through 3 stages: 1. Speech recognition or "ASR": audio -> text1, think Whisper; 2. LLM that plans what to say next: text1 -> text2; 3. Speech synthesis or "TTS": text2 -> audio, think ElevenLabs or VALL-E. Last year, I made the figure below to show how to make Siri/Alexa 10x better. However, naively going through 3 stages results in huge latency. User experience falls off the cliff if we have to wait 5 seconds for *each* reply. It breaks the immersion and feels lifeless even if the synthesized audio itself sounds real. Natural dialogues fundamentally don't work like this. We humans > think about what to say next at the same time as we listen & speak; > inject "yes, hmm, huh" at appropriate moments; > predict when the other person finishes and immediately take over; > decide to talk over the other person organically, without being offensive; > handle interruptions gracefully. Currently, AI assistants either cannot be interrupted (super frustrating) or simply stop when they detect an audio event and lose train of thought; > engage in group chat. We are so good at multi-agent conversations. It's not as simple as making each of the 3 neural nets faster, sequentially. Solving real-time dialogue requires us to rethink the whole stack, overlap each component as much as possible, and learn how to make interventions in real time. Or perhaps even better - just have 1 NN mapping audio to audio. End-to-end always wins. I'll sketch out how to design such a model and its training pipeline. Meanwhile, let's wait and see how far OpenAI pushes it!

127

517

3,207

727,950

Jim Fan · Apr 6, 2023 · 4:38 PM UTC

Jim Fan

@DrJimFan

6 Apr 2023

You think MidJourney's /describe is just a cool new tool? Think again. I believe hidden behind /describe is MidJourney's next-generation data flywheel. /describe guesses the prompt from an image you upload. Then you can select from (or edit) 4 choices to generate more images. This provides a brilliant community-drive data engine: 1. You typically upload images that are interesting and useful. They may not be in the training set. So you contribute a nice jpg! 2. You select from 4 choices - that provides a reward signal for the captioning model. Reinforcement learning from human feedback ("RLHF") hard at work. 3. If you edit the prompt and preserve the meaning, that's instruction finetuning. You provide a human-written "groundtruth" description. 4. If you edit the prompt and change the meaning, that helps MJ's future "edit recommendation" engine, if there will be one. (3) vs (4) can be classified by an LLM. 5. Your (prompt, image) pair will be saved to MJ's database to train BOTH diffusion and captioning models. Rinse and repeat. TL;DR: you are happily doing high-quality data annotation for free.

ALT Image credit: https://medium.com/the-generator/midjourneys-crazy-new-describe-feature-a96cc09203cc

118

444

3,118

1,625,994

Jim Fan · Nov 24, 2023 · 5:15 PM UTC

Jim Fan

@DrJimFan

24 Nov 2023

In my decade spent on AI, I've never seen an algorithm that so many people fantasize about. Just from a name, no paper, no stats, no product. So let's reverse engineer the Q* fantasy. VERY LONG READ: To understand the powerful marriage between Search and Learning, we need to go back to 2016 and revisit AlphaGo, a glorious moment in the AI history. It's got 4 key ingredients: 1. Policy NN (Learning): responsible for selecting good moves. It estimates the probability of each move leading to a win. 2. Value NN (Learning): evaluates the board and predicts the winner from any given legal position in Go. 3. MCTS (Search): stands for "Monte Carlo Tree Search". It simulates many possible sequences of moves from the current position using the policy NN, and then aggregates the results of these simulations to decide on the most promising move. This is the "slow thinking" component that contrasts with the fast token sampling of LLMs. 4. A groundtruth signal to drive the whole system. In Go, it's as simple as the binary label "who wins", which is decided by an established set of game rules. You can think of it as a source of energy that *sustains* the learning progress. How do the components above work together? AlphaGo does self-play, i.e. playing against its own older checkpoints. As self-play continues, both Policy NN and Value NN are improved iteratively: as the policy gets better at selecting moves, the value NN obtains better data to learn from, and in turn it provides better feedback to the policy. A stronger policy also helps MCTS explore better strategies. That completes an ingenious "perpetual motion machine". In this way, AlphaGo was able to bootstrap its own capabilities and beat the human world champion, Lee Sedol, 4-1 in 2016. An AI can never become super-human just by imitating human data alone. ----- Now let's talk about Q*. What are the corresponding 4 components? 1. Policy NN: this will be OAI's most powerful internal GPT, responsible for actually implementing the thought traces that solve a math problem. 2. Value NN: another GPT that scores how likely each intermediate reasoning step is correct. OAI published a paper in May 2023 called "Let's Verify Step by Step", coauthored by big names like @ilyasut @johnschulman2 @janleike: arxiv.org/abs/2305.20050 It's much lesser known than DALL-E or Whipser, but gives us quite a lot of hints. This paper proposes "Process-supervised Reward Models", or PRMs, that gives feedback for each step in the chain-of-thought. In contrast, "Outcome-supervised reward models", or ORMs, only judge the entire output at the end. ORMs are the original reward model formulation for RLHF, but it's too coarse-grained to properly judge the sub-parts of a long response. In other words, ORMs are not great for credit assignment. In RL literature, we call ORMs "sparse reward" (only given once at the end), and PRMs "dense reward" that smoothly shapes the LLM to our desired behavior. 3. Search: unlike AlphaGo's discrete states and actions, LLMs operate on a much more sophisticated space of "all reasonable strings". So we need new search procedures. Expanding on Chain of Thought (CoT), the research community has developed a few nonlinear CoTs: - Tree of Thought: literally combining CoT and tree search: arxiv.org/abs/2305.10601 @ShunyuYao12 - Graph of Thought: yeah you guessed it already. Turn the tree into a graph and Voilà! You get an even more sophisticated search operator: arxiv.org/abs/2308.09687 4. Groundtruth signal: a few possibilities: (a) Each math problem comes with a known answer. OAI may have collected a huge corpus from existing math exams or competitions. (b) The ORM itself can be used as a groundtruth signal, but then it could be exploited and "loses energy" to sustain learning. (c) A formal verification system, such as Lean Theorem Prover, can turn math into a coding problem and provide compiler feedbacks: lean-lang.org/ And just like AlphaGo, the Policy LLM and Value LLM can improve each other iteratively, as well as learn from human expert annotations whenever available. A better Policy LLM will help the Tree of Thought Search explore better strategies, which in turn collect better data for the next round. @demishassabis said a while back that DeepMind Gemini will use "AlphaGo-style algorithms" to boost reasoning. Even if Q* is not what we think, Google will certainly catch up with their own. If I can think of the above, they surely can. Note that what I described is just about reasoning. Nothing says Q* will be more creative in writing poetry, telling jokes @grok, or role playing. Improving creativity is a fundamentally human thing, so I believe natural data will still outperform synthetic ones. I welcome any thoughts or feedback!!

147

643

3,152

1,795,657

Jim Fan · Mar 18, 2024 · 8:54 PM UTC

Jim Fan

@DrJimFan

18 Mar 2024

Blackwell, the new beast in town. > DGX Grace-Blackwell GB200: exceeding 1 Exaflop compute in a single rack. > Put numbers in perspective: the first DGX that Jensen delivered to OpenAI was 0.17 Petaflops. > GPT-4-1.8T parameters can finish training in 90 days on 2000 Blackwells. > New Moore’s law is born.

115

489

2,855

397,453

Jim Fan · Oct 30, 2024 · 3:12 PM UTC

Jim Fan

@DrJimFan

30 Oct 2024

Not every foundation model needs to be gigantic. We trained a 1.5M-parameter neural network to control the body of a humanoid robot. It takes a lot of subconscious processing for us humans to walk, maintain balance, and maneuver our arms and legs into desired positions. We capture this “subconsciousness” in HOVER, a single model that learns how to coordinate the motors of a humanoid robot to support locomotion and manipulation. We trained HOVER in NVIDIA Isaac, a GPU-powered simulation suite that accelerates physics by 10,000x faster than real time. To put the number in perspective, the robots undergo 1 year of intense training in a virtual “dojo”, but take only ~50 minutes of wall clock time on one GPU card. The neural net then transfers zero-shot to the real world without finetuning. HOVER can be *prompted* for various types of high-level motion instructions that we call “control modes”. To name a few: - Head and hand poses: can be captured by XR devices like Apple Vision Pro. - Whole-body poses: via MoCap or RGB camera. - Whole-body joint angles: Exoskeleton. - Root velocity command: Joysticks. What HOVER enables: - A unified interface for us to control the robot using whichever input devices are convenient at hand. - An easier way to collect whole-body teleoperation data for training. - An upstream Vision-Language-Action model to provide motion instructions, which HOVER translates to low-level motor signals at high frequency. HOVER supports any humanoid that can be simulated in Isaac. Bring your own robot, and watch it come to life! It's a big teamwork from NVIDIA GEAR Lab and collaborators: 🧵

114

490

3,100

568,665

Jim Fan · Sep 24, 2023 · 4:28 PM UTC

Jim Fan

@DrJimFan

24 Sep 2023

Let's reverse engineer the phenomenal Tesla Optimus. No insider info, just my own analysis. Long read: 1. The smooth hand movements are almost certainly trained by imitation learning ("behavior cloning") from human operators. The alternative is reinforcement learning in simulation, but that typically leads to jittery motion and unnatural hand poses. There're at least 4 ways to collect human demonstrations: (1) A custom-built teleoperation system - I believe this is the most likely means used by Tesla team. Open-source example: ALOHA, a low-cost bimanual robot arm and teleoperation system by Stanford AI Labs (tonyzhaozh.github.io/aloha/). It enables very precise, dexterous motions, such as putting AAA batteries into a remote or manipulating contact lens. (2) Motion Capture (MoCap): apply the MoCap systems used for Hollywood movies to capture the fine-grained motions of hand joints. Optimus' 5-finger hand is a great design decision that enables a direct mapping - there is no "embodiment gap" from human operators. For instance, a demonstrator can wear a CyberGlove (cyberglovesystems.com/) and grasp the cubes on the table (as shown in video). CyberGlove will capture the motion signals & haptic feedback in real-time, which can be re-targeted onto Optimus. (3) Wearing gloves & markers can be clumsy. An alternative way to do MoCap is through computer vision. DexPilot from NVIDIA enables marker-less and glove-free data collection. The human operator simply uses their bare hands to perform the tasks. 4 Intel RealSense depth cameras and 2 NVIDIA Titan XP GPUs (yeah, 2019 work) translate the pixels to precise motion signals for robot learning. (4) VR Headset: turn the training room into a VR game, and let humans "role play" Optimus. Use the native VR controller or CyberGlove to control the virtual Optimus hands. This has the advantage of scalable remote data collection - annotators from around the world can contribute without coming onsite. VR demonstration technique appeared in research projects like the iGibson home robot simulator, an initiative that I participated in at Stanford: svl.stanford.edu/igibson/ Above 4 are not mutually exclusive. Optimus could use a combo of them for different pros & cons. 2. Neural Architecture. Optimus is trained end-to-end: videos in, actions out. I'm quite sure it's implemented by a multimodal Transformer with the following components: (1) Image: some variant of efficient ViT, or simply an old ResNet/EfficientNet backbone (arxiv.org/abs/1905.11946). The block pick-and-place demo doesn't require sophisticated vision. The spatial feature map from the image backbone can be tokenized easily. (2) Video: two ways. Either flatten the video into a sequence of images and produce tokens independently, or have a video-level tokenizer. There're numerous ways to efficiently process video pixel volumes. You don't necessarily need Transformer backbones, e.g. SlowFast Network (arxiv.org/abs/1812.03982) and RubiksNet (stanfordvl.github.io/rubiksn…, my paper at ECCV 2020, efficient CUDA shift primitives). (3) Language: it's not clear if Optimus is language prompted. If it is, there needs to be a way to "fuse" the language representations into perception. FiLM is a very lightweight neural network module that serves this purpose (arxiv.org/abs/1709.07871). You can think of it intuitively as a "cross attention" of language embedding into the image-processing neural pathway. (4) Action tokenization: Optimus needs to convert the continuous motion signals into discrete tokens for the autoregressive Transformer to work. A few ways: - Directly bin the continuous values for each hand joint control. [0, 0.01) -> token #0, [0.01, 0.02) -> token #1, etc. This is straightforward but could be inefficient due to the long sequence length. - The joint movements are highly dependent on each other, which means they occupy a low-dimensional "state space". Apply VQVAE to the motion data to obtain a shorter-length, compressed token set. (5) Putting the above pieces together, we have a Transformer controller that consumes video tokens (optionally with language modulation), and outputs action tokens, one step at a time. The next frame from the table is fed back to the Transformer, so it knows the consequence of its action. That gives the self-corrective ability shown in the demo. I believe the architecture is most similar to: - Google RT-1: blog.research.google/2022/12… - NVIDIA VIMA: vimalabs.github.io/ 3. Lastly, I'm genuinely impressed by the hardware quality. The motions are fluid, and the aesthetics is amazing as well. As I mentioned above, it's such a great decision to follow human morphology closely, so that there is no gap in imitating humans. Atlas from Boston Dynamics only has simple gripper-style hands. In the long run, Optimus' bi-dexterous, 5-finger hands will prove far superior in daily tasks. Congrats to @Tesla_Optimus team & @elonmusk 🎉! I look forward to seeing the bots roam Mars some day 🦾

148

542

3,039

1,933,082

Jim Fan · Mar 15, 2023 · 3:09 PM UTC

Jim Fan

@DrJimFan

15 Mar 2023

Stop saying: AI will replace humans. Start saying: humans who know how to use AI at work will replace those who don’t.

209

555

2,988

348,901

Jim Fan · Jan 11, 2023 · 6:34 PM UTC

Jim Fan

@DrJimFan

11 Jan 2023

Many people don’t understand how challenging Minecraft is for AI agents. Let me put it this way. AlphaGo solves a board game with only 1 task, countably many states, and full observability. Minecraft has infinite tasks, infinite gameplay, and tons of hidden world knowledge. 🧵

ALT Parody of the GPT-3 vs 4 parameter figure widely circulating on twitter (that one is very likely wrong, and honestly meaningless)

413

3,039

746,093

Jim Fan · Jan 24, 2025 · 2:34 PM UTC

Jim Fan

@DrJimFan

24 Jan 2025

Whether you like it or not, the future of AI will not be canned genies controlled by a "safety panel". The future of AI is democratization. Every internet rando will run not just o1, but o8, o9 on their toaster laptop. It's the tide of history that we should surf on, not swim against. Might as well start preparing now. DeepSeek just topped Chatbot Arena, my go-to vibe checker in the wild, and two other independent benchmarks that couldn't be hacked in advance (Artificial-Analysis, HLE). Last year, there were serious discussions about limiting OSS models by some compute threshold. Turns out it was nothing but our Silicon Valley hubris. It's a humbling wake-up call to us all that open science has no boundary. We need to embrace it, one way or another. Many tech folks are panicking about how much DeepSeek is able to show with so little compute budget. I see it differently - with a huge smile on my face. Why are we not happy to see *improvements* in the scaling law? DeepSeek is unequivocal proof that one can produce unit intelligence gain at 10x less cost, which means we shall get 10x more powerful AI with the compute we have today and are building tomorrow. Simple math! The AI timeline just got compressed. Here's my 2025 New Year resolution for the community: No more AGI/ASI urban myth spreading. No more fearmongering. Put our heads down and grind on code. Open source, as much as you can. Acceleration is the only way forward.

211

626

3,028

465,064

Jim Fan · Jul 30, 2023 · 6:04 PM UTC

Jim Fan

@DrJimFan

30 Jul 2023

Kaiming He, inventor of ResNet, is leaving industry to join MIT faculty in 2024!! He’s one of the most impactful figures in deep learning. - Residual layer is a fundamental building block of LLMs. - Faster/Mask R-CNN are industrial standards for image segmentation and robot perception stack. - Panoptic segmentation redefined a research sub-field in vision. - Mask AutoEncoder (MAE) is among the best general-purpose self-supervised algorithms for computer vision and beyond. - Before MAE, Momentum Contrast (MoCo) was a SOTA contrastive learning technique. - SlowFast network was among the default backbones for video learning until ViTs took over. - Too many other groundbreaking works to enumerate … I recently observe an exodus of researchers from big techs to academia. It’s an interesting movement given the current LLM gold rush 🤔 Normally we congratulate someone who joins MIT, but this time I congratulate @MITEECS to have Kaiming! 🎉

327

3,015

729,677

Jim Fan · Aug 31, 2023 · 5:05 PM UTC

Jim Fan

@DrJimFan

31 Aug 2023

This is a neural network flying a drone at extremely high speed, beating human champions in FPV drone racing. - Reinforcement learning as a tool is so marvelously versatile. It's able to solve both fast, reactive tasks and slow, deliberate tasks (ChatGPT RLHF). - Trained in large-scale simulation, finetuned in real world - I believe this is the paradigm that will get us to generalist robot some day. Published on Nature's cover, "Champion-Level Drone Racing using Deep Reinforcement Learning." nature.com/articles/s41586-0… Authors: Elia Kaufmann, Leonard Bauersfeld, Antonio Loquercio, Matthias Müller, Vladlen Koltun & Davide Scaramuzza

608

2,989

598,596

Jim Fan · Nov 18, 2023 · 4:40 PM UTC

Jim Fan

@DrJimFan

18 Nov 2023

I was OpenAI's first intern in 2016. I used to chat about the next learning paradigm with @ilyasut, engineering with @gdb, and scaling & safety with Dario. That summer reshaped my perspective and taste on AI research forever. I have huge admiration and respect for all of them. Yesterday's events didn't change that. Greg, Ilya, and Dario are among the most brilliant and mission-driven people I know, even though their missions now diverge into OpenAI, Anthropic, and whatever Greg/Sam are starting next. Betting against any of them is a bad idea. I do think they would be stronger together. 1+1+1 >> 3. OAI in 2016 didn't have GPTs, but it surely had the greatest team ever assembled in AI. I was fortunate to be there and witness history. Zooming out, we will inevitably see the birth of new heavyweight competitors, and that isn't necessarily a bad thing for the community. AI will be a bit more decentralized. New capital will pour in. Every party will act with more urgency. And new PhD grads will have at least one extra job offer to consider. This isn't the timeline I expect, but may not be the worst timeline we'll get.

179

2,995

986,849

Jim Fan · Feb 16, 2024 · 5:00 AM UTC

Jim Fan

@DrJimFan

16 Feb 2024

Apparently some folks don't get "data-driven physics engine", so let me clarify. Sora is an end-to-end, diffusion transformer model. It inputs text/image and outputs video pixels directly. Sora learns a physics engine implicitly in the neural parameters by gradient descent through massive amounts of videos. Sora is a learnable simulator, or "world model". Of course it does not call UE5 explicitly in the loop, but it's possible that UE5-generated (text, video) pairs are added as synthetic data to the training set.

Jim Fan

@DrJimFan

15 Feb 2024

125

422

2,937

981,144

Jim Fan · May 5, 2024 · 4:18 PM UTC

Jim Fan

@DrJimFan

5 May 2024

Congrats to @Tesla_Optimus team on another stellar update! The video gives us a peek at their human data collection farm, which I believe is Optimus' biggest lead. What does it take to build such a pipeline? Optimus nailed all of the following: 1. Optimus hands are among the best 5-finger, dexterous robot hands in the world. It's got tactile sensing, 11 degrees of freedom (DOF) compared to many competitors with only 6-7 DOF, and robustness to withstand lots of object interactions without constant maintenance. 2. Teleoperation software: we can see that the human operators are wearing VR goggles and gloves. It is very non-trivial to set up the software to have first-person video streamed in and precise control streamed out, while maintaining extremely low latency. Humans are highly sensitive to even the smallest delay between their own motions and the robot's. Optimus has a fluid whole body controller that enacts the human poses in real-time. 3. Sizeable fleet: you need more than one robot to collect data in parallel, well-trained human contractors taking multiple shifts per day (preferably 24/7), and an on-call maintenance crew to make sure that the robots are always busy. That's a ton of operational complexity that academic research labs don't even think of. 4. Tasks & environments: it's equally important to figure out *what* to teleoperate. Currently, most such efforts are demo-driven: collect data on the tasks that you want to put into a social media video. But solving general-purpose robots requires us to think carefully about the distribution of tasks and environments. From 43"-51" in the video, we can see factory & household settings like moving batteries, handling laundry, sorting daily objects into shelves. It's an open-ended research question: if you only have the budget to collect training data for 1,000 tasks, what would you pick to maximize skill transfer and generalization? Closing thought: teleoperation is a necessary but insufficient condition to solve humanoid robotics. It fundamentally does not scale. More about this later.

125

393

2,954

708,342

Jim Fan · Mar 21, 2023 · 4:10 PM UTC

Jim Fan

@DrJimFan

21 Mar 2023

I can finally discuss something extremely exciting publicly. Jensen just announced NVIDIA AI Foundations: - Foundation Model as a Service is coming to enterprise, customized for your proprietary data. - Multimodal from day 1: text LLM is just one part. Bring your images, videos, even 3D data! Let's build *custom* multimodal LLM and generative models for your use case. - Exciting partners from day 1: Getty Images, Shutterstock, and Adobe! Don't lose sleep over copyrights any more. - Biology is a unique type of LLM we support: this is the first large-scale AlphaFold API/finetuning service! Powers up your drug discovery and research pipelines. 2023 is an inflection point. NVIDIA is going beyond a pure hardware provider and becoming an enterprise-first AI provider.

518

2,897

1,104,544

Jim Fan · Feb 4, 2025 · 5:08 PM UTC

Jim Fan

@DrJimFan

4 Feb 2025

We RL'ed humanoid robots to Cristiano Ronaldo, LeBron James, and Kobe Byrant! These are neural nets running on real hardware at our GEAR lab. Most robot demos you see online speed videos up. We actually *slow them down* so you can enjoy the fluid motions. I'm excited to announce "ASAP", a "real2sim2real" model that masters extremely smooth and dynamic motions for humanoid whole body control. We pretrain the robot in simulation first, but there is a notorious "sim2real" gap: it's very difficult for hand-engineered physics equations to match real world dynamics. Our fix is simple: just deploy a pretrained policy on real hardware, collect data, and replay the motion in sim. The replay will obviously have many errors, but that gives a rich signal to compensate for the physics discrepancy. Use another neural net to learn the delta. Basically, we "patch up" a traditional physics engine, so that the robot can experience almost the real world at scale in GPUs. The future is hybrid simulation: combine the power of classical sim engines refined over decades and the uncanny ability of modern NNs to capture a messy world.

129

452

2,944

566,483

Jim Fan · Mar 14, 2023 · 5:15 PM UTC

Jim Fan

@DrJimFan

14 Mar 2023

GPT-4 is HERE. Most important bits you need to know: - Multimodal: API accepts images as inputs to generate captions & analyses. - GPT-4 scores 90th percentile on BAR exam!!! And 99th percentile with vision on Biology Olympiad! Its reasoning capabilities are far more advanced than ChatGPT. - 25,000 words context: allows full documents to fit *within a single prompt*. - More creative & collaborative: generate, edit, and iterate with users on writing tasks. - There're already many partners testing out GPT-4: Duolingo, Be My Eyes, Stripe, Morgan Stanley, Khan Academy ... even Government of Iceland!

553

2,747

878,379

Jim Fan · Sep 20, 2023 · 5:48 PM UTC

Jim Fan

@DrJimFan

20 Sep 2023

I think DALL·E 3 is not just a stance against MidJourney. It's actually a sneak peak of the upcoming, epic battle of massively multimodal LLMs, against DeepMind Gemini. Quote: "DALL·E 3 is built natively on ChatGPT". This is the key phrase. DALL·E 3's extraordinary language alignment is built on a solid textual GPT foundation. MidJourney doesn't really have much "reasoning brain", which is why so much prompt hacking is needed. Brain first, pixel second -> that's the way to build strong multimodal AI.

347

2,746

703,172

Jim Fan · Sep 4, 2023 · 5:10 PM UTC

Jim Fan

@DrJimFan

4 Sep 2023

This is likely the most significant lawsuit in AI history - its outcome would have far-reaching impact on the whole industry. The arguments get fairly philosophical. Quote: "The purpose of copyright law, OpenAI argued, is 'to promote the Progress of Science and useful Arts' by protecting the way authors express ideas, but 'not the underlying idea itself, facts embodied within the author’s articulated message, or other building blocks of creative,' which are arguably the elements of authors' works that would be useful to ChatGPT's training model." Copyright is a tricky issue that I don't have expert opinions on. But everyone in the field should read and follow the case: arstechnica.com/tech-policy/…. The other related and parallel cases are ongoing with Stability/MidJourney/any text-to-media businesses.

292

518

2,647

1,217,541

Jim Fan · Dec 23, 2024 · 5:16 PM UTC

Jim Fan

@DrJimFan

23 Dec 2024

These are not CGI. Reinforcement learning is so back. When operating on strings, it gives us o3. When operating on physical motors, it gives us a perfect humanoid backflip and a robot creature that out-maneuvers almost every animal on earth. RL is one of the only learning algorithms that can master both the world of bits and the world of atoms. Give me a reward function, and I shall move the world. 2025, Year of RL.

108

424

2,756

356,732

Jim Fan · Mar 15, 2023 · 9:40 PM UTC

Jim Fan

@DrJimFan

15 Mar 2023

The wife trick that used to convince ChatGPT no longer works for GPT-4 😅. It's arguable what true human alignment should be here.😆

108

255

2,656

557,537

Jim Fan · Apr 17, 2024 · 3:41 PM UTC

Jim Fan

@DrJimFan

17 Apr 2024

It took my brain a while to parse what's going on in this video. We are so obsessed with "human-level" robotics that we forget it is just an artificial ceiling. Why don't we make a new species superhuman from day one? Boston Dynamics has once again reinvented itself. Gradually, then suddenly.

153

369

2,702

480,536

Jim Fan · Mar 28, 2023 · 6:29 PM UTC

Jim Fan

@DrJimFan

28 Mar 2023

GPT-4's vision API isn't public yet, but something better is here. Genmo: a creative & multimodal chatbot that not only takes image as input, but also generates and EDITs images and videos. Unlike Midjourney, Genmo is an *interactive* assistant able to ask for your feedback and iterate, using your favorite programming language: English. genmo.ai/ Built by my friend @ajayj_, coauthor of the legendary paper "Denoising Diffusion Probabilistic Models" that started the diffusion revolution.

438

2,691

705,770

Jim Fan · Feb 16, 2024 · 5:50 PM UTC

Jim Fan

@DrJimFan

16 Feb 2024

I see some vocal objections: "Sora is not learning physics, it's just manipulating pixels in 2D". I respectfully disagree with this reductionist view. It's similar to saying "GPT-4 doesn't learn coding, it's just sampling strings". Well, what transformers do is just manipulating a sequence of integers (token IDs). What neural networks do is just manipulating floating numbers. That's not the right argument. Sora's soft physics simulation is an *emergent property* as you scale up text2video training massively. - GPT-4 must learn some form of syntax, semantics, and data structures internally in order to generate executable Python code. GPT-4 does not store Python syntax trees explicitly. - Very similarly, Sora must learn some *implicit* forms of text-to-3D, 3D transformations, ray-traced rendering, and physical rules in order to model the video pixels as accurately as possible. It has to learn concepts of a game engine to satisfy the objective. - If we don't consider interactions, UE5 is a (very sophisticated) process that generates video pixels. Sora is also a process that generates video pixels, but based on end-to-end transformers. They are on the same level of abstraction. - The difference is that UE5 is hand-crafted and precise, but Sora is purely learned through data and "intuitive". Will Sora replace game engine devs? Absolutely not. Its emergent physics understanding is fragile and far from perfect. It still heavily hallucinates things that are incompatible with our physical common sense. It does not yet have a good grasp of object interactions - see the uncanny mistake in the video below. Sora is the GPT-3 moment. Back in 2020, GPT-3 was a pretty bad model that required heavy prompt engineering and babysitting. But it was the first compelling demonstration of in-context learning as an emergent property. Don't fixate on the imperfections of GPT-3. Think about extrapolations to GPT-4 in the near future.

230

430

2,660

990,818

Jim Fan · May 17, 2024 · 5:01 PM UTC

Jim Fan

@DrJimFan

17 May 2024

AI is math. GPU is metal. Sitting between math and metal is a programming language. Ideally, it should feel like Python but scale like CUDA. I find two newcomers in this middle layer quite exciting: 1. Bend: compiles modern high-level language features to native multi-threading on Apple Silicon or NVIDIA GPU. Supports difficult constructs like - lambdas with full closure, unrestricted recursion and branches, folds, ADTs, etc. Bend compiles to HVM2, a thread-safe runtime implemented in Rust. All open-source: - github.com/HigherOrderCO/hvm - github.com/HigherOrderCO/Ben… 2. Mojo: a CUDA-flavored, Python like language the executes at C speed. Mojo is conceptually lower level than Bend and allows you to have stronger control over exactly how the parallelism is done. Especially suited for coding modern neural net accelerations by hand. - modular.com/max/mojo - Llama2 in one Mojo source file: github.com/tairov/llama2.moj…

406

2,644

330,736

Jim Fan · Aug 16, 2023 · 4:31 PM UTC

Jim Fan

@DrJimFan

16 Aug 2023

There're few who can deliver both great AI research and charismatic talks. OpenAI Chief Scientist @ilyasut is one of them. I watched Ilya's lecture at Simons Institute, where he delved into why unsupervised learning works through the lens of compression. Sharing my notes: - Kolmogorov compressor is the theoretical shortest-length program that produces a dataset. SGD is a practical approximation of the Kolmogorov search that finds an implicit program embedded in the weights of a soft computer, i.e. big Transformers. - Unsupervised learning is about computing the conditional Kolmogorov complexity of a target dataset given an unlabelled corpus, i.e. K(Y|X) - Theory tells us that optimizing for K(X, Y), the joint complexity, is as good as K(Y|X). So simply throw all data into the mix, and "just compress everything". - Joint compression is maximum likelihood over the giant concatenated dataset. - Ilya cites iGPT, Chen et al. 2020, to illustrate the ideas. iGPT is an image compressor that learns to predict the next pixel using a 1D sequence model. This is a phenomenal lecture, very accessible, and sometimes quite entertaining. YouTube: piped.video/watch?v=AKMuA_TV… Lecture page: simons.berkeley.edu/talks/il…

415

2,667

822,296

Jim Fan · Aug 1, 2023 · 5:15 PM UTC

Jim Fan

@DrJimFan

1 Aug 2023

I'm waking up to the prospect that in my prime years, I'll see both mainstream superconducting and AGI. The former will propel the latter, and the latter will propel every scientific breakthrough. These should've stayed in sci-fi for another 20 yrs. But somehow, they are eerily within reach. And we as a species will soon take these for granted and move on to the next thing that should've stayed in sci-fi.

ALT Image credit: waitbutwhy.com, Tim Urban

175

334

2,593

1,069,120

Jim Fan · Jun 28, 2023 · 4:26 PM UTC

Jim Fan

@DrJimFan

28 Jun 2023

Everyone should read the celebrated mathematician Terence Tao's blog on LLM. He predicts that AI will be a trustworthy co-author in mathematical research by 2026, when combined with search and symbolic math tools. I believe math will be the first scientific discipline to see major breakthroughs enabled by AI, because math: ▸ can be expressed conveniently as a coding problem. Strings are naturally first-class citizens. ▸ can be rigorously verified by theorem provers like Lean, rather than relying on empirical results. ▸ does not require physical experiments like biology & medicine. Robotics isn't ready yet. We are already seeing big progress: ▸ LeanDojo (leandojo.org/) from my colleagues @NVIDIAAI & @Caltech is among the first steps towards this grand challenge. ▸ Last year, OpenAI used Lean to solve some math olympiad problems: openai.com/research/formal-m… ▸ ChemCrow is another example, but for chemistry. It integrates GPT-4 with professional tools like molecular synthesis planner and reaction prediction: arxiv.org/abs/2304.05376 ▸ Terrance Tao's blog: unlocked.microsoft.com/ai-an…

ALT From leandojo.org

609

2,554

921,824

Jim Fan · Jul 25, 2025 · 4:58 PM UTC

Jim Fan

@DrJimFan

25 Jul 2025

I'm observing a mini Moravec's paradox within robotics: gymnastics that are difficult for humans are much easier for robots than "unsexy" tasks like cooking, cleaning, and assembling. It leads to a cognitive dissonance for people outside the field, "so, robots can parkour & breakdance, but why can't they take care of my dog?" Trust me, I got asked by my parents about this more than you think ... The "Robot Moravec's paradox" also creates the illusion that physical AI capabilities are way more advanced than they truly are. I'm not singling out Unitree, as it applies widely to all recent acrobatic demos in the industry. Here's a simple test: if you set up a wall in front of the side-flipping robot, it will slam into it at full force and make a spectacle. Because it's just overfitting that single reference motion, without any awareness of the surroundings. Here's why the paradox exists: it's much easier to train a "blind gymnast" than a robot that sees and manipulates. The former can be solved entirely in simulation and transferred zero-shot to the real world, while the latter demands extremely realistic rendering, contact physics, and messy real-world object dynamics - none of which can be simulated well. Imagine you can train LLMs not from the internet, but from a purely hand-crafted text console game. Roboticists got lucky. We happen to live in a world where accelerated physics engines are so good that we can get away with impressive acrobatics using literally zero real data. But we haven't yet discovered the same cheat code for general dexterity. Till then, we'll still get questioned by our confused parents.

144

545

2,505

397,915

Jim Fan · Jun 30, 2023 · 5:15 PM UTC

Jim Fan

@DrJimFan

30 Jun 2023

Google is hosting the first "Machine Unlearning" challenge. Yes you heard it right - it's the art of forgetting, an emergent research field. GPT-4 lobotomy is a type of machine unlearning. OpenAI tried for months to remove abilities it deems unethical or harmful, sometimes going a bit too far. Unlike deleting data from disk, deleting knowledge from AI models (without crippling other abilities) is much harder than adding. But it is useful and sometimes necessary: ▸ Reduce toxic/biased/NSFW contents ▸ Comply with privacy, copyright, and regulatory laws ▸ Hand control back to content creators - people can request to remove their contribution to the dataset after a model is trained ▸ Update stale knowledge as new scientific discoveries arrive Check out the machine unlearning challenge: ai.googleblog.com/2023/06/an…

511

2,286

588,570

Jim Fan · Nov 18, 2023 · 2:09 AM UTC

Jim Fan

@DrJimFan

18 Nov 2023

Here's my prediction of what's next. The infinite energy of @sama & @gdb cannot be contained. They will re-build Rome from the ashes with even greater sense of urgency. OpenAI just created its mightiest competitor, and we are all seeing it unfold in real-time. And it happened before. The Anthropic founding team, many of whom co-authored the watershed GPT-3 paper, split from OpenAI in 2021. It remains OAI's close No. 2 today. History is destined to repeat itself. I wish Sam and Greg a great battle ahead. The ship is unstoppable no matter where they sail.

130

250

2,511

999,922

Jim Fan · Nov 23, 2023 · 1:54 AM UTC

Jim Fan

@DrJimFan

23 Nov 2023

It’s pretty obvious that synthetic data will provide the next trillion high-quality training tokens. I bet most serious LLM groups know this. The key question is how to SUSTAIN the quality and avoid plateauing too soon. The Bitter Lesson by @RichardSSutton continues to guide AI development: there’re only 2 paradigms that scale indefinitely with compute: Learning & Search. It’s true in 2019 at the time of writing, true today, and I bet will hold true till the day we solve AGI. incompleteideas.net/IncIdeas…

137

285

2,484

1,567,707

Jim Fan · Jan 2, 2023 · 2:22 PM UTC

Jim Fan

@DrJimFan

2 Jan 2023

The launch of GPT-4 will be a predictably seismic event this year. But I can predict with high confidence what GPT-4 *cannot do*: It can’t cook spaghetti, play tennis, or build a lego treehouse. Robotics will be the last moat we conquer in the grand quest for AI 🤖🦾

149

429

2,396

482,415

Jim Fan · Nov 23, 2022 · 4:49 PM UTC

Jim Fan

@DrJimFan

23 Nov 2022

GPT3 is powerful but blind. The future of Foundation Models will be embodied agents that proactively take actions, endlessly explore the world, and continuously self-improve. What does it take? In our NeurIPS Outstanding Paper “MineDojo”, we provide a blueprint for this future:🧵

496

2,452

Jim Fan · Apr 18, 2024 · 5:07 PM UTC

Jim Fan

@DrJimFan

18 Apr 2024

The upcoming Llama-3-400B+ will mark the watershed moment that the community gains open-weight access to a GPT-4-class model. It will change the calculus for many research efforts and grassroot startups. I pulled the numbers on Claude 3 Opus, GPT-4-2024-04-09, and Gemini. Llama-3-400B is still training and will hopefully get even better in the next few months. There is so many research potential that can be unlocked with such a powerful backbone. Expecting a surge in builder energy across the ecosystem!

408

2,454

872,331

Jim Fan · Nov 30, 2023 · 6:53 PM UTC

Jim Fan

@DrJimFan

30 Nov 2023

One of the best tutorial-style repos since @karpathy's minGPT! GPT-Fast: a minimalistic, PyTorch-only decoding implementation loaded with best practices: int8/int4 quantization, speculative decoding, Tensor parallelism, etc. Boosts the "clock speed" of LLM OS by 10x with no model change! We need more minGPTs and GPT-Fasts in the open-source world! Created by the awesome @cHHillee from PyTorch team. Blog: pytorch.org/blog/acceleratin… Code: github.com/pytorch-labs/gpt-…

400

2,477

407,721

Jim Fan · Apr 7, 2023 · 5:01 PM UTC

Jim Fan

@DrJimFan

7 Apr 2023

Why does generative AI struggle with hands? It is not a mystical Bermuda Triangle in the latent space. There're compelling reasons: 1. Data size (duh). Face pics are much more common than hand pics. Even when the whole body is shown, hands tend to occupy much smaller pixel real estate. 2. Lack of embodied understanding. This is a much deeper issue: AIs never use hands in the physical world, so they have to infer how hands look in various poses by superficial pattern matching. That's why AI can't even get the number of fingers correct, because hands are frequently occluded by tools or itself. More technically, diffusion doesn't have a working world model @ylecun 3. Low tolerance. It's fine to mess up the texture a little bit, you won't even notice. But getting hands wrong easily triggers the uncanny valley reaction. Reference in 🧵

152

362

2,356

1,222,300

Jim Fan · Jan 7, 2025 · 2:53 AM UTC

Jim Fan

@DrJimFan

7 Jan 2025

People ask me what's next. The GOAT points the way. Physical AI. Embodied Agents. Robotics. That's what's next.

311

2,377

224,940

Jim Fan · Nov 17, 2023 · 8:46 PM UTC

Jim Fan

@DrJimFan

17 Nov 2023

ChatGPT is now the new CEO.

223

2,287

311,842

Jim Fan · Dec 8, 2022 · 4:05 PM UTC

Jim Fan

@DrJimFan

8 Dec 2022

Why does ChatGPT work so well? Is it “just scaling up GPT-3” under the hood? In this 🧵, let’s discuss the “Instruct” paradigm, its deep technical insights, and a big implication: “prompt engineering” as we know it may likely disappear soon:👇

469

2,339

Jim Fan · Apr 28, 2023 · 3:14 PM UTC

Jim Fan

@DrJimFan

28 Apr 2023

Transformers are here to stay for a while. Not because it’s the absolute best architecture, but because the staggering amount of resources lock us to the existing weights. Starting another model evolution tree will literally burn forests to ground (CO2). You only train once. In a sense, Transformers won the pretraining lottery. Big companies (OpenAI) have little economic incentive to deviate from their backbone model, and indie players can’t afford to train from scratch. Llama/OPT makes the commitment even stickier.

393

2,298

648,333