Production Engineer / ex-Meta

London
I was really excited that Mojo became publicly available and thinking which project can I implement to learn Mojo concepts. Since I have already ported llama2.c to pure Python, I decided why not try to port llama2.py to Mojo now 😀 And here is what I got...
9
62
558
346,046
The out-of-the-box features of @Modular_AI's Mojo are just incredible. We applied unrolling and now llama2.🔥 outperforms @ggerganov's llama.cpp by almost 20% in CPU inference speed. github.com/tairov/llama2.moj…
11
46
355
192,984
BREAKING: I've implemented a prototype of a cutting-edge Q-Learning algorithm on Mojo 🔥, and now it's working 35,000x times faster than any existing (!) implementations. Thanks to Mojo's incredible feature that allows transparently import any Python modules! I really hope this breakthrough won't lead to another wave of OAI leadership layoffs, though I'm not ruling this out, because it is an amazing leap forward Check this out: github.com/tairov/QStarLearn…
14
29
228
106,863
llama2 inference in a pure Mojo 🔥 github.com/tairov/llama2.moj… I found the SIMD Mojo primitives really interesting feature, since it helped to improve pretty awful performance of Python solution almost 250x times.
2
31
178
21,787
Llama2.mojo performance on Mac is right up there with llama.cpp (!!!), and even outperforms plain C in many cases. This is insane!
8
18
137
23,139
Replying to @tldraw
INSANE !
5
12
115
15,267
Replying to @tairov @karpathy
5
9
110
11,232
Internally I used vectorisation helpers for matmul so that now Mojo solution can beat original llama2.c by @karpathy by 20% I think there is still some room for further improvements.
3
3
59
9,392
I've got early access to the Mojo SDK 🔥 for Mac from @Modular_AI . And of course I always wanted to inference baby llama on pure Mojo on Apple Silicon.. True story not only on Mojo 😉 So far, results are mind-blowing! Here are some benchmarks
6
2
37
3,759
Another milestone! We're baselining Mojo with Mojo 😉
Mojo 🔥 0.5.0 is released! 🚀 Even more epic updates unleashed! 😱 Checkout the highlights ⬇️ or read the full changelog here & happy weekend hacking! 👩🏼‍💻 ➡️ bit.ly/463X9eq
1
2
35
3,145
GitHub's integration of GPU enabled M1 Apple Silicon hosts for action runners may have flown under the radar, but its implications are vast. It's a strong indicator of Apple Silicon's rising adoption among developers, hinting at a future where M2 and M3 become central to the development ecosystem. #GitHubUniverse #AppleSilicon
1
3
24
1,817
Thanks @Modular_AI , 🔥 finally my open source contributions materialized in the form of nice color merch T-shirt 😎 👕
1
2
24
2,325
Wow! This is exciting! Thanks @Modular_AI for appreciating my efforts & congrats on the public release of Mojo! Seems that my port of llama2 inference is truly a "First prober ai written in Mojo" 😀
The first crack at llama2.🔥 is here 🚀 A Mojo 🔥 community member - Mojician - did a simple port from Python to Mojo, and shows its already 20% faster than Karpathys llama.c implementation 😱 How much faster can it go? 📈
1
20
2,267
Replying to @Modular @Modular_AI
Nice work! llama2.🔥 bumped Mojo version to 0.5.0, and we got few % boost. github.com/tairov/llama2.moj…
18
776
Replying to @andrew_n_carr
Oh, man! I've got so much fun. I'm encouraging you to finish & share it anyway. I'd love to check it out!
18
6,010
Replying to @Modular @Modular_AI
Exciting! I got early access to the Mojo SDK (Mac) week ago, and compared it's performance on baby-llama inference. Mojo VS Rust, C, Cpp, Go, Zig, and Julia. In total 12 implementations on 7 languages x 3 model x 30 rounds Check this out engiware.com/benchmark/llama…
2
4
19
2,532
My shock in the shock! Appreciate @lexfridman highlighting our efforts – exciting to receive such a feedback!
Replying to @clattner_llvm
Wow, impressive!
1
2
16
1,430
Another moment of fame, now on livestream with awesome @Modular_AI team 😎! Really exciting week! piped.video/watch?v=bMb7ieOD…
1
3
15
565
Baking something cool.. Now on Mac 👨‍💻
1
1
15
958
I have the honor of authoring the first ever guest-post on the Modular AI blog, about my journey with porting #llama2 inference into the #mojo lang Kudos to @shshnkp for the incredible cooperation and support with preparing this article!
New blog post by Mojician 🔥 and guest contributor @tairov 💯⬇️ Aydyn discusses his journey from discovering Mojo 🔥 to implementing llama2.🔥 which has over 1.2k stars 🤩 on GitHub! 🚀 modular.com/blog/community-s…
1
1
15
857
It'll disrupt not just AI/ML development industry, but much more.. It's a huuge story
1
11
1,637
Thanks to PR from Modular team member, parallelize works in llama2.🔥github.com/tairov/llama2.moj… Why not compare parallel execution with llama2.c? And... llama2.c strikes back, now with OMP...
1
1
12
1,248
👀 "MLX has a Python API which closely follows NumPy. MLX also has a fully featured C++ API which closely mirrors the Python API" Yet another attempt to fix ML models usability by implementing Python libs written on C++. Now from Apple research
Just in time for the holidays, we are releasing some new software today from Apple machine learning research. MLX is an efficient machine learning framework specifically designed for Apple silicon (i.e. your laptop!) Code: github.com/ml-explore/mlx Docs: ml-explore.github.io/mlx/bui…
3
12
992
Repo got 1K stars. When you truly understand you have made a meaningful contribution to the #mojo community when Github offers to nominee a successor
9
498
whisper.cpp now supports HF distill-whisper + full CUDA & Apple Metal offloading, that brings almost a 4x boost to transcribing on fp model
1
11
952
Despite my best efforts to attend #ModCon onsite this year, I sadly couldn't make it happen. But the event can't truly go on without the first ever Mojician v-attendance! 😅 Wishing everyone an insightful conference!
1
10
697
Seems that @samlakig took this challenge very seriously. I'm eager to see the details of what you can accomplish!
I think I'll need to output more debug symbols/ do you think better to read the report straight from perf
1
1
10
1,031
I tried to optimize all the #llama2c ports for max performance. Some don't support multithreading so the comparisons aren't completely apples-to-apples. But it's clear Mojo is here to stay
1
8
1,393
It's turned out that Mistral's team literally is not using Mojo to speedup training and inference 35,000x times and raised €120+M Investors what are you doing 😢 this is horrendous!
Mistral's team literally just used learned weights over data instead of programmed rules and raised 120+M. Investors what are you doing 😢 this is horrendous
2
2
10
1,223
I secured early access to the Mojo SDK on Mac before general release. Put all #llama2c ports through extensive benchmarks across 7 languages and 12 variations. Crafted custom benchmarking framework to test performance. Quite intriguing battle on M1 Mac - results are telling 👇
1
9
474
Interestingly , within just 2 weeks , we are not using anymore C implementation as a baseline. I’m sure there are still discoveries to be made 💥
1
9
1,414
Mojo talk on LLVM Dev conference is available on YT @Modular_AI team unpacked the end-to-end reasoning behind the AI Engine and the Mojo language piped.video/watch?v=SEwTjZvy…
1
1
8
697
I built a test env to benchmark all @karpathy's #llama2c ports, including Mojo, Zig, Julia, Rust & #llamacpp converter by @ggerganov. Ran inference across 3 baby llama models in 30 rounds (multi/single threaded). Check out the full report engiware.com/benchmark/llama…
1
8
850
It seems that the Mistral team is setting a new trend in OS LLMs. MoE — mixture of experts . If I understand correctly, GPT-4's impressive performance largely stems from a similar technique. In the backend, it features eight 'heads' or 'experts', each to be 250 billion parameter LLMs similar to GPT-3.5
New open weights LLM from @MistralAI params.json: - hidden_dim / dim = 14336/4096 => 3.5X MLP expand - n_heads / n_kv_heads = 32/8 => 4X multiquery - "moe" => mixture of experts 8X top 2 👀 Likely related code: github.com/mistralai/megablo… Oddly absent: an over-rehearsed professional release video talking about a revolution in AI. If people are wondering why there is so much AI activity right around now, it's because the biggest deep learning conference (NeurIPS) is next week.
8
777
Next move by 4D chess grandmaster?
2
6
332
The thing is llama2.mojo is also was implemented for understanding Mojo concepts and to have a real-world example. However that doesn't mean both projects couldn't be evolved further to squeeze everything you can out of the hardware while trying to maintain brevity
1
7
342
Kudos to the @ziglang community for improving and benchmarking llama2 inference in Zig under Apple M1/2 ! The llama2.zig implementation has solid single-threaded performance - it may be the fastest single-threaded inference of tiny-llama models so far on Macs. Surprisingly, no boost on multi-threaded mode? github.com/karpathy/llama2.c…
Replying to @tairov
Llama2.mojo performance on Mac is right up there with llama.cpp (!!!), and even outperforms plain C in many cases. This is insane!
1
1
8
1,198
llama2.🔥 is a port of @karpathy's llama2.c, biggest supported model so far is TinyLlama with 1.1B parameters. I hope soon we can run bigger quantized models as well
1
6
1,528
This is Wow
Project #2: LLM Visualization So I created a web-page to visualize a small LLM, of the sort that's behind ChatGPT. Rendered in 3D, it shows all the steps to run a single token inference. (link in bio)
7
684
Replying to @clattner_llvm
This is what happens when you launch a new exciting technology that many enthusiasts have been eager to try out right before the weekend 👨‍💻
7
722
The Mojo truck is unstoppable! It now outperforms the Porsche 911 ! I think this could be a great trailerbuster for ModCon'23 😎
7
506
In case of anyone still doubt the importance of diving into learning and entering the AI field now. 👇
Are you kidding? There has never been a green pasture of this size with this low barrier to entry.
1
6
570
Replying to @tairov @Modular_AI
I am particularly pleased that my project llama2.🔥 was also mentioned as an example of what can be achieved with inbuilt Mojo features. github.com/tairov/llama2.moj…
2
2
7
462
I think it's worthwhile to share, some interim results we got with llama2.🔥 speedup. Our incredible github contributors baked a draft PR. And here is what we have 👇
1
7
594
I'll double check all results and will prepare a write-up with full details.
1
7
1,518
#GPT4V can understand the essence of recursion even from a photo. Unbelievable!
3
1
6
690
Replying to @lexfridman
CEO/Board misalignment
4
2,210
Hope the next year's Google Presentation be like this ( @Modular_AI ? )
1
6
1,685
It shows additional details which loops were vectorized by gcc compiler So far it seems that the very first comparison llama2.c vs Mojo was fair gcc is aggressively vectorizing all loops it can find 😀
2
5
403
Replying to @elonmusk
Thanks for sharing! I have it already implemented on Mojo 🔥, so now it's 35,000x times faster than any other implementations github.com/tairov/QStarLearn…
1
2
6
383
Who remember this? From 2008, the Google Search Appliance led the way in on-prem search solutions for enterprises deployed as a rack form factor black box device. It was discontinued post-2018
1
6
370
Is it only me or OpenAI killed N+ startups I was going to develop ? 🤓 #OpenAIDevDay
1
6
498
I discovered podcasts on X. Today I was invited to a really nice one! Thanks @altryne for the opportunity to share highlights & my experience with early release of Mojo SDK on Mac from @Modular_AI . PS. My moment of fame starts at the 59th min 🔊
T-minus 2 hours for @thursdai_pod live recording, and as always, if you can't make it to the live one, make sure you're subscribed on thursdai.news to receive the episode in newsletter and podcast form
5
499
Looking forward. 🔥 on Mac, one ♥️
Mojo 🔥 is coming to Mac 💻 very soon 😱 Here’s a little sneak peak of us testing LLama2.🔥 out of the box by @tairov. Look for this to drop in the next couple of weeks 💯🚀

ALT Mojo 🔥 on Mac 💻

5
603
Replying to @ggerganov
Sounds cool, but I think it might be hard to compete with AWS on its own territory 😀 from the costs perspective . They already have AWS bedrock service rolling out , it’s kind of API to many LLM models where you pay for tokens used
2
1
5
3,079
Make programming great again!
Epic work, the thing I love about this is how small and clean the code is - literally reimplementing everything down to the metal instead of depending on thick layers of magic.
5
1,721
I think everyone in the AI world should now about this, so feel free to be a first retweeter 😉
1
5
1,527
Gemini, the avant-garde and trailblazing multi-modal virtuoso of language models, state-of-the-art titan, infused with wit and wisdom far beyond its digital peers. It's an inventive, quick-witted behemoth, eclipsing predecessors with its sterling adaptability! Gemini:
1
4
233
Have you ever wanted to benchmark a baby Llama2 models in 12 programming languages? No? Well, now you can! github.com/tairov/lamatune
1
1
5
327
There were some debates regarding fairness of C vs Mojo comparison. I was in doubt was it fair or not , since in Mojo I deliberately introduced SIMD operations. After some research I found an intersting gcc switch `-fopt-info-vec `
1
4
733
Replying to @var_epsilon
I didn't know this model capable of generating photo-realistic images of supercar.
3
1,659
I wouldn't say llama.cpp is 10x faster on M1 Metal. Probably it has 2x boost. I'm eager to benchmark Mojo with GPU support, once it released.
1
5
1,004
Replying to @tokenbender
Hey @4evaBehindSOTA ! Wanna give it "Roud 3" ? :) We added unrolling improvements, now it hits 1000 tok/s for stories15M. Pull latest changes and use -j 6 seems that with threads = 6 it works even better
1
5
232
Solid optimization lifting #python port from slowest to more competitive! Great to see the Python community working hard on the perf challenge!
1
5
219
Replying to @clattner_llvm
What about "tok/s" ? 😀
2
62
Thank you for feedback! I'm sure there are even more achievements to come in this space
5
304
We've reached another milestone with the support for the 1.1B TinyLlama, which can now generate advanced responses, like explanation of Pythagorean theorem or providing Python code for calculating the Fibonacci sequence. Impressive performance for a 4GB sized model!
5
315
Replying to @hnasr
Exactly. And this is how I leveraged SIMD primitives to speedup llama2 inference by 15-20% in comparison to C implementation github.com/tairov/llama2.moj…
1
5
579
I’m eager to benchmark it with any other reference Q-Learning , as soon as one is available 🤓 This is probably the only opinionated prototype so far. I'm afraid the competitors don't stand a chance either way
Q-Learning works better in Mojo🔥🚀Amazing work @tairov 💯
5
679
462 vs 385 tok/s
5
240
Here is real value for whoever come to comments: Modular is giving away free tickets to ModCon 2023 + swag 🎁. It seems it's still wide open and the competition is low! I see it's as a prime opportunity to implement some classic algo on Mojo for a solid chance at win. I’d love to see submissions of more advanced algos too like realistic Q-learning? :) Show what you can build on Mojo for the best shot. But hurry, the contest closes soon! nitter.app/Modular_AI/statu…
1
4
3,608
#gpt4V pretending it's not involved in this mess.
1
5
476
Replying to @karpathy
This already the last century. Let your llm agent bring you information in a convenient format, without the need to surf at all.
5
2,319
Cloud-exits might become a trendy thing.
It's absolutely crazy how cheap the computing power is if you stay outside of the Cloud 🤯 The usage of iximiuz Labs keeps growing, so I'm upgrading my bare metal servers. And I just doubled the fleet's CPU capacity with the price going from $44 to $53 per server per month.
1
2
646
Gemini must be a beast . 85% performance on a typical Codeforces competition is wild! It’s like solving 4-5 medium/hard Leetcode problems within 2.5 hours.
So excited to share what the team and I have been working on these last months! #AlphaCode 2 is powered by Gemini and performs better than 85% of competition participants in 12 contests on Codeforces! More details at goo.gle/AlphaCode2 @GoogleDeepMind
1
4
413
Replying to @dannypostma
Why not then switch to this model permanently ?
2
674
Here is why stock inference implementations are not a good fit for production workloads. "The compute costs are eye-watering" (c) #ModCon23
3
4
368
llama2-py improved significantly, meet llama2-numpy. The SLOC dropped 3x as well. Nicely implemented inference in 350 lines of almost pure Python 😀
Replying to @tairov @karpathy
Thank you Aydyn! I love the idea of a pure Python implementation with no dependence to an external package. I started from your version and introduced numpy: github.com/chris-ch/llama2-p… ... using the standard interpreter we get to 2x slower from 10x slower compared to C version.
2
5
658
Meanwhile AMD is also presenting something, obviously for AI, I’m exhausted, have no time keep up with everything Hey Grok , maybe your qdrant based vector search can help summarise this? 😀 piped.video/live/tfSZqjxsr0M…
2
4
323
Context is not all you need. As this research highlights, LLMs struggle with basic contextual understanding as the reasoning context grows more complex. Without a framework firmly grounding symbols in reality, model performance degrades. As it was demonstrated on OpenAI DevDays, techniques like prompt engineering and RAG remain crucial to overcoming these context evaluation failures. Interestingly, that means there are natural optimizations within context that can help information retrieval. However, models still lack natural understanding of real-world contextual relationships.
Claude 2.1 (200K Tokens) - Pressure Testing Long Context Recall We all love increasing context lengths - but what's performance like? Anthropic reached out with early access to Claude 2.1 so I repeated the “needle in a haystack” analysis I did on GPT-4 Here's what I found: Findings: * At 200K tokens (nearly 470 pages), Claude 2.1 was able to recall facts at some document depths * Facts at the very top and very bottom of the document were recalled with nearly 100% accuracy * Facts positioned at the top of the document were recalled with less performance than the bottom (similar to GPT-4) * Starting at ~90K tokens, performance of recall at the bottom of the document started to get increasingly worse * Performance at low context lengths was not guaranteed So what: * Prompting Engineering Matters - It’s worth tinkering with your prompt and running A/B tests to measure retrieval accuracy * No Guarantees - Your facts are not guaranteed to be retrieved. Don’t bake the assumption they will into your applications * Less context = more accuracy - This is well know, but when possible reduce the amount of context you send to the models to increase its ability to recall * Position Matters - Also well know, but facts placed at the very beginning and 2nd half of the document seem to be recalled better Why run this test?: * I’m a big fan of Anthropic! They are helping to push the bounds on LLM performance and creating powerful tools for the world * As a practitioner of LLMs, it’s important to build an intuition for how they work, where they excel and their limits * Tests like these, while not bulletproof, help showcase real world examples and get a feeling for how they work. The goal is to transfer this knowledge to productive use cases Overview of the process: * Use Paul Graham essays as ‘background’ tokens. With 218 essays it’s easy to get up to 200K tokens (repeated essays when necessary) * Place a random statement within the document at various depths. Fact used: “The best thing to do in San Francisco is eat a sandwich and sit in Dolores Park on a sunny day.” * Ask Claude 2.1 to answer this question only using the context provided * Evaluate Claude 2.1s answer with GPT-4 using @langchain evals * Rinse and repeat for 35x document depths between 0% (top of document) and 100% (bottom of document) (sigmoid distribution) and 35x context lengths (1K Tokens > 200K Tokens) Next Steps To Take This Further: * For rigor, one should do a key:value retrieval step. However for relatability I did a San Francisco line within PGs essays for clarity and practical relevance * Repeat test multiple times for increased statistical significance Notes: * Amount Of Recall Matters - The model's performance is hypothesized to diminish when tasked with multiple fact retrievals or when engaging in synthetic reasoning steps * Changing your prompt, question, fact to be retrieved and background context will impact performance * The Anthropic team reached out and offered credits to repeat this test. They also offered prompt advice to maximize performance. It's important to clarify that their involvement was strictly logistical. The integrity and independence of the results were maintained, ensuring that the findings reflect my unbiased evaluation and are not influenced by their support. * This test cost ~$1,016 for API calls ($8 per million tokens)
1
4
660
99% of TikTok influencers lose job/audience because of this ?
“Animate Anyone” was released last night for making pose guide videos. Lets dive in. Paper: arxiv.org/abs/2311.17117 Project: humanaigc.github.io/animate-… 🧵1/
2
1
4
732
A Mojician, of course 🔥
1
3
184
Replying to @Modular @Modular_AI
Hi @ylecun! 🙌 I've been diving deep into the new Mojo lang by implementing #Llama2 inference on it. We'd love to hear your insights on Mojo and its stated capabilities
1
1
4
347
Stable Diffusion video in real-time is fascinating , look how it dressed Lex as a pirate on this interview!
3
1,329
Ideal explanation of all aspects of LLMs, including the security concerns, in such a condensed form. It's brilliant how succinctly the information is conveyed. @karpathy's videos are examples of almost perfect compression of huge ML topics into an accessible form.
New YouTube video: 1hr general-audience introduction to Large Language Models piped.video/watch?v=zjkBMFhN… Based on a 30min talk I gave recently; It tries to be non-technical intro, covers mental models for LLM inference, training, finetuning, the emerging LLM OS and LLM Security.
3
663
Now, on average we're performing slightly better than multithreaded #llama2c github.com/tairov/llama2.moj… . We're able to further improve vectorization/parallelization of transformers forward pass
1
2
362
If you want to avoid this kind of oversight in the future, make sure you come to ModCon '23 😉
1
2
237
Replying to @radamar @Gradio
Man, are you counting how many startups closed because you implemented their primary product on Gradio ? 😅
3
245
How to run your fine-tuned LM app on top of underlying LLM-OS kernel
You don’t have to train from scratch whenever developing a smaller model of an existing model family. Sharing our latest work - “Initializing Models with Larger Ones” arxiv preprint: arxiv.org/abs/2311.18823 code: github.com/OscarXZQ/weight-s…
3
440
Best place to keep up with latest changes in AI world 👍
Replying to @ptsi @tairov
Let's gooo! We actually had Aydyn on ThursdAI and talked about LlaMa.🔥 sub.thursdai.news/p/thursdai…
3
574