My opinions only here. πŸ‘¨β€πŸ”¬ RS DeepMind 1.8y, Midjourney 1y πŸ§‘β€πŸŽ“ DPhil AIMS 4.5y πŸ§™β€β™‚οΈ RE DeepMind 1y πŸ“Ί SWE Google 3y πŸŽ“ TUM πŸ‘€ @nwspk

Oxford, England
Ever wondered why presenting more facts can sometimes *worsen* disagreements, even among rational people? πŸ€” It turns out, Bayesian reasoning has some surprising answers - no cognitive biases needed! Let's explore this fascinating paradox quickly ☺️
20
71
374
106,593
Everyone's arguing about whether current AI models could be conscious or not, as if it was a scientific discussion, yet I don't even know what consciousness is πŸ₯Ί
203
73
1,749
Replying to @JSTOR
JSTOR vs Aaron Swartz?
12
7
1,632
42,889
When I think of Jensen's inequality, I think of the following sketch which helps me remember it. Maybe this is useful for you, too. #ML #Mathematics
33
194
1,343
I'm late to review the "Illusion of Thinking" paper, so let me collect some of the best threads by and critical takes by @scaling01 in one place and sprinkle some of my own thoughts in as well. The paper is rather critical of reasoning LLMs (LRMs):
🧡 1/8 The Illusion of Thinking: Are reasoning models like o1/o3, DeepSeek-R1, and Claude 3.7 Sonnet really "thinking"? πŸ€” Or are they just throwing more compute towards pattern matching? The new Large Reasoning Models (LRMs) show promising gains on math and coding benchmarks, but we found their fundamental limitations are more severe than expected. In our latest work, we compared each β€œthinking” LRM with its β€œnon-thinking” LLM twin. Unlike most prior works that only measure the final performance, we analyzed their actual reasoning tracesβ€”looking inside their long "thoughts". Our analysis reveals several interesting results ⬇️ πŸ“„ machinelearning.apple.com/re… Work led by @ParshinShojaee and @i_mirzadeh, and with @KeivanAlizadeh2, @mchorton1991, Samy Bengio.
34
224
1,259
435,097
Replying to @FedorovMykhailo
Don't use telegram. Use signal please. Telegram is not encrypted and is based in Russia 🀦
27
118
1,007
After 4 years, I'm kinda like: maybe I should have focused on ML engineering instead of research πŸ˜‚
27
41
988
Biggest regret: not spending more time getting the basics right at the beginning of my PhD. I started going full-time on research projects right away, and now three years later I'm still playing catch-up with some stuff I should have focused on right away
24
47
825
I'm incredibly excited about all the amazing progress in ML lately 🀯 but part of me really wished I had picked a different field because I have no idea how to keep up anymore or know what to focus on πŸ₯ΊπŸ˜‡
26
34
712
Excited to publish a Python package that turns @karpathy's "A Recipe for Training Neural Networks" into easy-to-use diagnostics code! πŸ”§ No more randomly poking around in your custom @PyTorch DNN to debug it. Get simple diagnostics for your neural nets 🫢 #PyTorch 1/
8
115
707
62,633
I got rejected from @DeepMind and @MetaAI for internships now. I guess I shouldn't have quit being an engineer five years ago πŸ˜…
47
11
601
How does one keep up with papers in ML while still finding time for foundational studies? Not even talking about doing active research. Feeling overwhelmed every day and like an imposter more and more πŸ™ˆ
21
57
542
I want to share my latest (very short) blog post: "Active Learning vs. Data Filtering: Selection vs. Rejection." What is the fundamental difference between active learning and data filtering? Well, obviously, the difference is that: 1/11
14
74
553
94,579
NeurIPS 2024 PCs being a bunch of clowns 🀑 the state of ML πŸ™„ All you get back a month after raising a concern:
14
20
526
174,627
Incredibly excited to join DeepMind again πŸ₯³ I'll be a researcher on the Deep Learning Engineering team under the illustrious @davidmbudden πŸ”₯ I can't wait to get started ✨
44
1
526
38,573
A new paper review by me! I'm reviewing the fascinating "Bad Students Make Great Teachers: Active Learning Accelerates Large-Scale Visual Understanding" from @GoogleDeepMind It introduces a novel method for active data selection in large-scale visual pretraining. πŸ“‰πŸ€– 1/10
2
85
497
71,753
My experience with using pandas to operate on dataframes is usually: 1. read docs 2. spend an hour to try to get something to work 3. give up 4. write the equivalent Python code in 10 min 5. move on with life
52
27
469
105,645
Do you ever have a model that uses @PyTorch and one that uses @TensorFlow, and you want to combine the two for end-2-end training without rewriting either? TfPyTh allows you to plug one into the other while propagating gradients for training πŸŽ‰ Code πŸ‘‰ github.com/BlackHC/TfPyTh
5
126
471
Interesting take: I believe that arxiv is closer to how science and research originally worked and "official" peer reviews haven't worked that well (at least recently)
arXiv is a cancer that promotes the dissemination of junk "science" in a format that is indistinguishable from real publications. And promotes the hectic "can't keep up" + "anything older than 6 months is irrelevant" CS culture. >>
21
26
461
109,932
Have you wondered why I've posted all these nice plots and animations? πŸ€” Well, the slides for my lectures on (Bayesian) Active Learning, Information Theory, and Uncertainty are online now! They cover quite a bit from basic information theory to some recent papers πŸ₯³
11
59
461
43,247
We have achieved gold medal performance at the International Mathematical Olympiad πŸ₯‡ πŸ₯³ This is the first general-purpose system to do so through official participation and grading, and I'm thrilled to have contributed a little to this milestone in mathematical reasoning 🌈🫢
An advanced version of Gemini with Deep Think has officially achieved gold medal-level performance at the International Mathematical Olympiad. πŸ₯‡ It solved 5️⃣ out of 6️⃣ exceptionally difficult problems, involving algebra, combinatorics, geometry and number theory. Here’s how 🧡
25
20
459
34,826
Jensen's inequality == not everyone has access to the latest @nvidia GPUs πŸ€“
7
45
430
47,889
- Roast me GPT-4V: No I can't - Yes you can. GPT-4V: Okay. Roast: "Ah, the classic 'I woke up like this' hairdo combined with an AI-themed t-shirt. You're really out here living the tech bachelor dream. Remember, even though you've got machine learning on your shirt, it doesn't mean your hair has learned how to style itself!" πŸ˜‚
gpt-4V is brutal LMAO
15
25
404
225,490
Did you hear that? It's the sound of goalposts moving at supersonic speed
This is an unwise statement that can only make people confused about what LLMs can or cannot do. Let me tell you something: Math is NOT about solving this kind of ad hoc optimization problems. Yeah, by scraping available data and then clustering it, LLMs can sometimes solve some very minor math problems. It's an achievement, and I applaud you for that. But let's be honest: this is NOT the REAL Math. Not by 10,000 miles. REAL Math is about concepts and ideas - things like "schemes" introduced by the great Alexander Grothendieck, who revolutionized algebraic geometry; the Atiyah-Singer Index Theorem; or the Langlands Program, tying together Number Theory, Analysis, Geometry, and Quantum Physics. That's the REAL Math. Can LLMs do that? Of course not. So, please, STOP confusing people - especially, given the atrocious state of our math education. LLMs give us great tools, which I appreciate very much. Useful stuff! Go ahead and use them AS TOOLS (just as we use calculators to crunch numbers or cameras to render portraits and landscapes), an enhancement of human abilities, and STOP pretending that LLMs are somehow capable of replicating everything that human beings can do. In this one area, mathematics, LLMs are no match to human mathematicians. Period. Not to mention many other areas. Calling on my friend @ericweinstein and @GaryMarcus, who has been one of the few sane expert voices on these matters lately. πŸ™ h/t @hellheff
21
13
394
40,457
It's six months since I've submitted my thesis and I still start feeling suicidal every single time I think about my PhD experience, esp the last year of it 😐 Thank God it's over, and I hope I'll reflect about it less in the future
22
7
371
87,788
How is software engineering going to change with LLMs? What if we could "implement" class interfaces automagically using LLMs? Presenting `llm-strategy`: a PoC package based on @langchain and @OpenAI's GPT that adds an llm_strategy decorator to Python: github.com/blackhc/llm-strat…
12
55
379
121,830
Very interesting ICLR **tiny** paper: openreview.net/forum?id=vHOO… It computes a loss for all possible subsets of the dataset at the same time which has a very elegant solution: softplus of the negative log likelihood per sample, which essentially drops outliers 🀯 @mtetelman
5
47
372
32,967
I think Europe and the UK need their own DeepMind-like AI lab which is not connected to the US in any form
40
15
357
56,991
Research questions I'm excited about: * More sample-efficient RL and the bitter lesson * Meta-cognitive abilities in models * Active learning and curriculum approaches * Automated scientific discovery Things I focus on a lot instead: Software engineering, testing infra, and developer experience to accelerate good research
11
11
346
27,624
πŸŽ‰ New blog post on a better (visual!) intuition for information theoretic quantities (eg entropy and mutual information) πŸŽ‰ πŸ”₯ Lots of visualisations πŸ”₯ πŸ‘‰ Based on Yeung's "A new outlook on Shannon's information measures" from 1991 πŸ“– #oldiebutgoldie blackhc.net/blog/2019/better…
2
89
314
Is it common or specific to ML that researchers try to add more maths to their papers and complexify their contributions to get through reviews? It is very frustrating to have to parse complexity to find nuggets of simplicity that might not warrant a paper πŸ™„
28
23
316
Can't believe I made it in the end πŸŽŠπŸ˜‡ Thanks to everyone at @UniofOxford, @ExeterCollegeOx and @OATML_Oxford for the at times stressful, often beautiful, and always inspiring and memorable time πŸ™πŸ«Ά
37
3
301
15,064
The state of conferences. 40 weaknesses with 40 questions. Weak reject, confidence 5. Death by sea lioning openreview.net/forum?id=kDhA…
If you want to read a very bad ICLR drama, here you go: openreview.net/forum?id=kDhA…
14
16
322
122,968
I haven't done a single novel thing in my PhD. I'm just very lucky that reviewers have no clue about prior art πŸ˜…
8
6
276
Why are people excited about this paper ("Neural Networks are Decision Trees", arxiv.org/abs/2210.05189)? TL;DR: The result is obvious and useless by itself. Slightly longer "hot" take below 1/4
Neural Networks are Decision Trees! Could this finally open up the black box of deep NNs? Find out in this video (w/ @Alex_Mattick ): piped.video/_okxGdHM5b8
8
25
279
Replying to @JakeWSimons
The "six Hiroshima bombs equivalent" argument is so dumb. All it does is show that Israel has been using precision ammunition to minimize collateral damage because if they had dropped six Hiroshima bomb equivalents indiscriminately on Gaza it would indeed be a parking lot now
10
9
287
6,516
I've passed my viva πŸ₯³ Thanks to my examiners @maosbot & @sirbayes for the discussion and feedback! To @tom_rainforth for mentoring; to @joost_v_amersf, @JishnuMukhoti, @fbickfordsmith & @seb_far for our joint papers; to @yaringal for supervising πŸ‘¨β€πŸ« & @OATML_Oxford for the fun πŸŽ‰
39
7
283
24,261
My Ph.D. thesis (mostly on active learning and information-theoretic intuitions and approaches related to it) is finally on arXiv πŸ₯³ I'm looking forward to finding and fixing many more typos in the future πŸ˜‚
Advancing Deep Active Learning & Data Subset Selection: Unifying Principles with Information-Theory Intuitions. arxiv.org/abs/2401.04305
21
21
267
38,893
Why autonomous weapons are inevitable And what we can still do about it medium.com/@BlackHC/why-auto…
15
77
237
πŸ”₯ Has your PyTorch code ever crashed because it ran out-of-memory in CUDA, and you had to fiddle with batch sizes repeatedly? πŸ”₯ What if we could just write code that adapted to the available memory instead of resorting to brittle hand-tuning? 🀯 πŸ‘‰ github.com/BlackHC/tomaπŸ€—
9
56
255
Very happy & proud to share some research in Deep Bayesian Active Learning from @yaringal, @joost_v_amersf and me at @OATML_oxford πŸŽ‰πŸŽ‰πŸŽ‰πŸ€—πŸ€—πŸ€— oatml.cs.ox.ac.uk/blog/2019/…
1
64
245
arxiv.org/abs/2010.06610 is such an insane paper and idea 🀯
5
34
250
24,499
Mandatory ELBO derivation in any lecture series: (I think I finally understand the unnecessarily confusing derivation in the VAE paper πŸ˜…)
12
24
233
21,355
Whoop whoop, I got my first single-author paper accepted πŸŽ‰πŸŽ‰πŸŽ‰
11
1
239
26,661
Replying to @bendreyfuss
What a banger 🀯
2
18
231
7,210
Intuition why adding Gaussian noise to parameters is nice for optimization: when we integrate/marginalize over the noise, we convolve/blur the loss surface with a Gaussian kernel -> making it smoother
11
16
237
66,186
This is one of the best papers I have read in a while. It contains a crazy amount of insights and ideas 🀯
✨🎨🏰Super excited to share our new paper Ensemble everything everywhere: Multi-scale aggregation for adversarial robustness Inspired by biology we 1) get adversarial robustness + interpretability for free, 2) turn classifiers into generators & 3) design attacks on vLLMs 1/12
32
240
33,932
Pretty amazing: they're lowering PyTorch to JAX using a custom torch tensor type and dispatch overrides (all natively supported by PyTorch). It's both amazing and I imagine painful to debug, but great that it works
Announcing the completely reimagined vLLM TPU! In collaboration with @Google, we've launched a new high-performance TPU backend unifying @PyTorch and JAX under a single lowering path for amazing performance and flexibility. πŸš€ What's New? - JAX + Pytorch: Run PyTorch models on TPUs with no code changes, now with native JAX support. - Up to 5x Performance: Achieve nearly 2x-5x higher throughput compared to the first TPU prototype. - Ragged Paged Attention v3: A more flexible and performant attention kernel for TPUs. - SPMD Native: We've shifted to Single Program, Multi-Data (SPMD) as the default, a compiler-centric model native to TPUs for optimal execution. Dive deep into the new architecture and see the performance benchmarks in our latest blog post! blog.vllm.ai/2025/10/16/vllm… #vLLM #TPU #JAX #PyTorch #AI #OpenSource
6
11
245
27,111
Can we say that LLMs can't code bc of this paper. Is it fair? No tool use was allowed at all 🀯 How many people can write correct code like that, without running it once to debug or find typos? Even then, the models are in the top 1.5% of human coders. Bad news, indeed 😬
15
12
243
32,918
My dream work-life would be 50% basic research and 50% applied engineering work πŸ€— And 50% reading papers and books πŸ˜…πŸ« 
11
20
227
29,179
Working through my thesis corrections and GitHub Copilot gracefully auto-completed both the sentence and auto-generated a comment from my supervisor πŸ˜‚
6
6
226
32,199
This is too real πŸ˜‚ I've spent the last days reading through the chapters in PML
7
23
228
25,148
Why does Google Doc still not support LaTeX equations? 😭😭😭
10
5
235
19,731
TRM also provides an intuition that can be directly applied to reasoning LLMs (and RL): To say, that in the TRM algorithm we can view z as thoughts and y as final output (which is compared to some golden solution during training to compute a reward). The `latent recursion` is a refinement (self-critique) cycle of the thoughts with a final update to the solution. The point of `deep recursion` is that we need not take gradients through all steps, but it is sufficient to focus on the last refinement cycle. Now the intuition is clear that for LLMs, we could repeatedly apply a self-improvement operator. starting from empty thoughts and solution (z, y = \empty) or an initial draft. We then compute rewards using a golden solution towards the proposed solution and only backprop through the last refinement cycle, but this should be sufficient to improve the refinement operation. Deep supervision actually has the benefit that bad reasoning will amplify across so many steps, so the final gradient will provide a stronger signal to reign that in
New paper πŸ“œ: Tiny Recursion Model (TRM) is a recursive reasoning approach with a tiny 7M parameters neural network that obtains 45% on ARC-AGI-1 and 8% on ARC-AGI-2, beating most LLMs. Blog: alexiajm.github.io/2025/09/2… Code: github.com/SamsungSAILMontre… Paper: arxiv.org/abs/2510.04871
9
23
233
25,663
Replying to @paulpowlesland
It's a very UK thing that you press the button and by the time it switches to green, there is no more traffic anyway πŸ˜‚
3
1
205
21,778
Different PyTorch versions changing the significance of my results was the world I always dreamt of living in πŸ™ˆπŸ˜…
10
4
209
Singapore is the most impressive city I've ever visited. Way more than NYC or SF. I'm so happy that ICLR (& AABI) decided on Singapore, and I hope we can avoid the US for a bit πŸ™ big thanks to my friends for showing me around and to all the awesome people I met and talked to 😊
14
6
225
19,602
My current belief is that AI capabilities are more strongly correlated with compute availability than with research itself. Is this wrong?
51
12
204
60,028
Why are the NeurIPS reviews released on a Friday and then people are only given 3.5 working days to write rebuttals? Do we care about work-life balance at all? πŸ€”πŸΉ
9
19
206
Isn't this the whole OpenAI ZΓΌrich office? πŸ˜‚
Scoop: Meta has poached three OpenAI researchers: Lucas Beyer, Alexander Kolesnikov and Xiaohua Zhai, according to people familiar with the matter. An OpenAI spox confirmed the three have left the company.
12
5
214
35,069
Several recent papers connect the generalization error of a model to the model’s prediction disagreement. And oh no, I've taken a look at one of them, an ICLR 2022 spotlight, in more detail πŸ”₯ And published my thoughts in TMLR πŸ₯³ 1/16 openreview.net/forum?id=oRP8…
3
30
204
My personal bitter lesson is that my random hot take tweets get more views than any of my papers or blog posts that I spend months on πŸ˜…πŸ« 
17
4
193
17,012
After PhD: would I do it again? No not like that. Was it worth it? No not really. Would I recommend it? YMMV but prob not πŸ˜…
25
6
183
74,471
Replying to @martinmbauer
It also appears in machine learning a lot, just not sure if anyone has specifically recognized as such. Eg. negative entropy is the log product integral over density using the density measure. It also appears in PAC-Bayes equations a lot
5
6
178
20,173
The most effective way to achieve better performance is through pre-training of RL. This unlocks a lot of high-quality data. Right now, pretraining on graduate physics or maths texts is allowed the same compute as text with low information density. The model cannot predict those tokens well right away. Allowing additional thinking tokens during pretraining enables the model to extract a lot more signal from such data
Are you ready for web-scale pre-training with RL ? πŸš€ πŸ”₯ New paper: RLP : Reinforcement Learning Pre‑training We flip the usual recipe for reasoning LLMs: instead of saving RL for post‑training, we bring exploration into pretraining. Core idea: treat chain‑of‑thought as an action. Reward it by the information gain it provides for the very next token: This gives a verifier‑free, dense reward on ordinary text with no task checkers, no labels, no filtering. Why this matters ? * 🧠 Models think before predicting during pretraining, not just after alignment. * πŸ“ˆ Position‑wise credit at every token = stable signal at full web‑scale. * πŸ” No proxy filters or β€œeasy‑token” heuristics. Trains on the entire stream. Results: On the 8‑benchmark math+science suite (AIME’25, MATH‑500, GSM8K, AMC’23, Minerva Math, MMLU, MMLU‑Pro, GPQA): β€’ Qwen3-1.7B-Base: RLP improves the overall average by 24% ! β€’ Nemotron-Nano-12B-v2-Base: RLP improves the overall average by 43% ! πŸ“„Paper: tinyurl.com/rlp-pretraining ✍️Blog: research.nvidia.com/labs/adl… #AI #LLM #ReinforcementLearning #ChainOfThought #Pretraining #RLP
1
20
184
23,228
Replying to @NoFarmsNoFoods
Yep, totally support this. Why would anyone not support this?
3
172
4,662
1/ I just read the fascinating GaLore paper on memory-efficient LLM training using gradient low-rank projection. Kudos to the authors for this insightful work! My TL;DR and some thoughts below (as a little paper review) 🧡
For the first time, we show that the Llama 7B LLM can be trained on a single consumer-grade GPU (RTX 4090) with only 24GB memory. This represents more than 82.5% reduction in memory for storing optimizer states during training. Training LLMs from scratch currently requires huge computational resources with large memory GPUs. While there has been significant progress in reducing memory requirements during fine-tuning (e.g., LORA), they do not apply for pre-training LLMs. We design methods that overcome this obstacle and provide significant memory reduction throughout training LLMs. Training LLMs often requires the use of preconditioned optimization algorithms such as Adam to achieve rapid convergence. These algorithms accumulate extensive gradient statistics, proportional to the model's parameter size, making the storage of these optimizer states the primary memory constraint during training. Instead of focusing just on engineering and system efforts to reduce memory consumption, we went back to fundamentals. We looked at the slow-changing low-rank structure of the gradient matrix during training. We introduce a novel approach that leverages the low-rank nature of gradients via Gradient Low-Rank Projection (GaLore). So instead of expressing the weight matrix as low rank, which leads to a big performance degradation during pretraining, we instead express the gradient weight matrix as low rank without performance degradation, while significantly reducing memory requirements. @jiawzhao @BeidiChen @tydsh
4
30
170
53,037
Re memory layers: Shouldn't memory layers include a sink (ie. one memory slot with score 0 which is always included that can be used as none-of-the-above)? Performing softmax on the top-k means that the memory layer can never defer, so you will always grab some random memory locations even if they're not very relevant
🧠 How can we equip LLMs with memory that allows them to continually learn new things? In our new paper with @AIatMeta, we show how sparsely finetuning memory layers enables targeted updates for continual learning, w/ minimal interference with existing knowledge. While full finetuning and LoRA see drastic drops in held-out task performance (πŸ“‰-89% FT, -71% LoRA on fact learning tasks), memory layers learn the same amount with far less forgetting (-11%). 🧡:
6
10
178
22,450
πŸŽ‰πŸŽ‰Happy & proud to share some research into Information Bottlenecks from @yaringal, @clarelyle and me at @OATML_oxford πŸŽ‰πŸŽ‰ We provide intuition and practical IB objectives for modern DNN architectures, like ResNets. Check it out on arXiv πŸ‘‰arxiv.org/abs/2003.12537
3
36
168
Yet again feeling slightly overwhelmed by the speed of progress in ML. Kinda wish I could go back in time a few years and re-prioritize certain studies ^^'
3
4
165
For what it's worth, I know lots of folks who essentially work as research scientists without PhD, and reflecting on my own journey, I could have done as much research (if not more with more compute) if I hadn't left to do a PhD in academia ^^' Thad said, doing a PhD has its own perks and benefits ofc
HOT TAKE: Reality is, you can't actually work in top-quality ML research labs without a PhD. Top research labs still look for people with PhDs and excellence in maths, stats, PyTorch, neural networks, and CUDA kernels. In India, quality ML research labs are virtually nonexistent. Most good research labs are in the US/UK and China. Implementing papers and working on T4 Colab is cool, but you won't cross the threshold to become a researcher. 99% of ML people belong to the applied side, which has better practical perks: - MNCs or SF startups - You can switch and get promoted every 1.5 years - You can move to product management or CTO - All you need is hands-on experience and not many research papers - Cashflow is best I really respect people who code research papers, but how long will you wait for your breakthrough? In 3 months, research evolves, and you're following it without actually building anything. Stop following blindly! The world's best research labs pick only from top universities, not because you've implemented papers and posted on X! Either go for a PhD outside India or stick to the applied ML side. The job market is saturated and will remain so because we're evolving post-COVID. On the other hand, no startup or research lab thinks about you. You must focus on your growth and money first, then look for impact.
10
4
174
30,802
I'm sorry for any hurt feelings for calling NeurIPS PCs clowns and pointing out an apparent domain conflict of interest. I didn't mean any PC individually or personally, but the organization and its (lack of) processes. Sadly, my sentiment was warranted - even if the phrasing might not be to everyone's liking. So, to avoid further tone policing and suggestions to remove the tweet, let me rephrase it to remove any personal mention: NeurIPS is a clown show ("a comically shambolic state of affairs"), and I'm disappointed by the unprofessional official response and the lack of seriousness at the biggest ML conference. I wonder if my takeaway and recommendations should be to take such papers as the role models they are presented as. The incentives are clear: Publish in an area with much hype and fewer knowledgable reviewers; do not worry about attribution; put all of your related work section in the appendix; if someone complains informally, always agree but never act or only to the bare minimum; always wait until after the conference (or at least after the decision notice) to address anything; and if worst comes to worst and someone complains formally, fear not because there are no good processes anyway - everybody knows each other, and there are too many other incentives for anyone to be a party pooper. The ML community has neither teeth nor an appetite for academic integrity. There are too many things that taste sweeter.
NeurIPS 2024 PCs being a bunch of clowns 🀑 the state of ML πŸ™„ All you get back a month after raising a concern:
6
11
166
68,537
If you're not interested in your PhD topic and projects at all, what should you do?
49
6
148
Replying to @lastpositivist
It's about the University failing the student by not raising issues earlier and other academics disagreeing with that assessment in the first place. You are not supposed to fail a student at confirmation for things that should have been raised at the transfer of status viva
2
2
143
12,731
Somebody please explain high-dimensional embeddings to me and what things look like in those spaces πŸ˜…
38
7
156
72,719
I might be finishing my PhD within the next half year. What does one do after? I guess the time to apply for stuff was a few months ago πŸ˜…
30
1
156
94,653
Publishing a paper at ICML costs 650 USD, which is cheaper than some journals, but it cannot beat TMLR, which is free and comes with higher-quality reviews πŸ₯³
7
5
164
25,561
Replying to @tunguz

ALT Forrest Gump Mircales Forrest Gump Chad GIF

141
59,422
Due to popular demand: for example, revising basic maths some more again (LA, calculus), some more ML basics (kernel methods, SVMs, etc), old DL papers. But also orthogonally: for various projects, spend more time on lit reviews/prior art and playing around with the baselines...
4
3
150
Finally catching up on MoE with @finbarrtimbers's great posts on this (link below). My thoughts MoE and objectives: 1. Instead of the weirdly unprincipled additional losses, one can simply maximize the mutual information I[E;T], where E is the expert index and T is the token: I[E; T] = H[E] - H[E | T], so maximizing the mutual info maximizes the entropy of expert selection without knowing the token H[E], ie all experts selected uniformly across all data, while it minimizes the entropy knowing the token H[E | T], ie as one-hot as possible for a given token, equivalent to K=1. 2. The softmax(top-K of router logits + normal noise) is almost equiv a double softmax: Taking the top-K (router logits + Gumbel noise) is equivalent to samping from softmax(router logits) k times w/o replacement. Applying the softmax to those samples simply distributes the credit accordingly between the top-K chosen experts. A potentially cleaner formulation would simply always use a full mixture and only look at the top-K sampling approach etc as performance optimizations. 3. The "router Z-loss" seems overcomplicated. Z seems to stand for the partition constant of the induced categorical distribution of the logits. The Z loss does not affect router predictions as it affects all expert logits in the router equally, and it is motivated by numerical stability. Instead of regularizing with Z loss explicitly as a loss, one could also simply adapt the bias of the router network and shift it by the mean logit activations of a training batch. Same effect and no loss needed. 4. Why do we use MoE only for FFNs and not for attention? MoE for QKV or at least for the Q matrices would seem quite valuable to make attention token-specific and either save FLOPs or get better attention for same FLOPs. Mixture of Depths seems to look at that finally.
3
23
162
53,658
Replying to @miniapeur
Had to check: it is real but book-bait... uses "a one-sided generalized derivative called a subdifferential" instead of derivatives link.springer.com/book/10.10…
5
4
143
7,133
β€œStochastic Batch Acquisition: A Simple Baseline for Deep Active Learning” has been published @TmlrOrg πŸŽ‰ My last paper with @OATML_Oxford and the amazing @seb_far*, @parmi_atg, @anndvision, @fbranchaud1, and @yaringal Some details and a blog post below:
2
15
155
26,495
So riding a bicycle requires petrol bc your Amazon parcels are delivered by truck? Same logic πŸ™„ I wouldn't trust a founder making such an argument with anything
2
155
8,603
ML design interview question @ @Waymo: they log a lot of online telemetry data, too much to transmit it all. only some of that data is interesting when run in simulation later because it might lead to divergent and wrong behaviour by agents in the simulation.
4
5
154
arxiv.org/pdf/1905.12957.pdf β€œNeural Entropic Estimation: A faster path to mutual information estimation” πŸ‘ˆ this paper is beautiful It derives the Donsker-Varadhan representation en passant using simple straightforward steps to get to MINE’s estimator...
1
26
150
A small info-theory thread (or at least food for thought): Why is the Bayesian Model Average the best choice? Really why? I'll go through a naive argument (anyone has better references?), simple lower-bounds and decompositions, and pitch a "reverse mutual information" 1/15
4
22
151
17,495
We are though? The overall trajectory is still one of continued global mass extinction, and this is putting lipstick on a pig
8
142
2,988
Replying to @kellerjordan0
I'm bored, so let's examine how the two losses are maybe similar: d_i = ||w_i - x||^2 = -2 w_i . x + ||w_i||^2 + ||x||^2 -log d_i \approx 1 - d_i So then: 1/d_i = exp(-log d_i) \approx exp(2 w_i . x - ||w_i||^2 - ||x||^2 + 1) = exp(2 w_i . x + C_i) where C_i = - ||w_i||^2 - ||x||^2 + 1 The harmonic loss probability looks approximately like a softmax: 1/d_i / (sum_j 1/d_j) \approx exp(2 w_i . x + C_i) / (sum_j exp(2 w_jY . x + C_j)) Softmax is invariant to shifts, so what if the C_j are approximately constant? Empirically, "Understanding Softmax Confidence and Uncertainty" by Pearce et al. (2021) [1] argues that all w_i are about the same magnitude for optimal decision boundaries when trained using CE loss (so assuming we train the model for long enough). Then, ||w_i|| \approx const, and all C_i are approx const for a given x. Thus : exp(2 w_i . x + C_i) / (sum_j exp(2 w_j . x + C_j)) \approx exp(2 w_i . x + C_i) / (sum_j exp(2 w_j . x + C_i)) = exp(2 w_i . x) / (sum_j exp(2 w_j . x)) This is the same as the regular softmax in the cross-entropy loss with temperature 1/2, so slightly sharper. What does this tell us? Not much, but I was bored, and this was fun 😊 --- [1] arxiv.org/abs/2106.04972
10
6
150
11,841
Replying to @ForensicArchi
The Forensic Architecture report appears to use a Russian rocket impact in Ukraine as evidence of an Israeli artillery impact in Gaza. The cited picture they use here fairly clearly shows the remains of a Russian 122mm Grad rocket that hit Kharkiv Oblast last year.
2
2
129
7,632
Finally, another paper summary from me for the amazing "Limitations of the Empirical Fisher Approximation for Natural Gradient Descent" by Kunstner, @lukas_balles & @PhilippHennig5 notion.so/Limitations-of-the…
3
27
139
Very happy to announce that my reproduction "Does β€˜Deep Learning on a Data Diet’ reproduce? Overall yes, but GraNd at Initialization does not" has been published in @TmlrOrg πŸ₯³ 1/5
7
11
149
41,244
Replying to @suchenzang
Can't wait for the series about it: Game of Flops
4
148
9,840
Kinda depressing when you keep getting rejection emails for internships 10 months later without even being asked for an interview 😐
18
128
47,436
Weirdly, I'm rooting for China somehow. How did that happen?
28
1
135
11,506
A lot of finance bros are discovering that they wanted to do AI research all along right now
3
2
137
9,518
Replying to @pli_cachete
Nah it's a human em-dash bc real em-dashes don't have whitespace around them, and an LLM would know better
15
131
24,992