Building | Learning | Sharing | Previously Lead AI Engineer at RelevanceAI, AI Engineer @ Weights&Biases

Sydney, New South Wales
🧵 Most modern LLMs like Qwen, DeepSeek & gpt-oss use YaRN to extend context from 4K→128K tokens. But what led to YaRN? Today I'm proud and excited to share a comprehensive resource into the evolution of positional embeddings such as APE, RoPE, YaRN & variants👇 1/n
2
4
24
2,289
I have primarily switched to Claude 3.5 Sonnet and hardly use GPT-4. Anybody else?
399
56
2,059
355,048
1/ After weeks of learning, I am proud to share - "The Annotated GPT-2" ladies and gentleman! In this post, I re-implement OpenAI's GPT-2 in PyTorch using @huggingface source code and try to explain all the magic that goes on inside the model. amaarora.github.io/2020/02/1…
15
323
1,245
I've been working on Object Detection for the past few weeks - and I am proud to announce "The Annotated DETR" !! amaarora.github.io/2021/07/2… In this post, I try to explain all the magic that goes on inside the architecture. 1/n
5
116
506
After days and hours of learning, I am very excited to share my latest blog post "The EfficientDet Architecture in PyTorch"! amaarora.github.io/2021/01/1… In this post, I reference @wightmanr's source code and try to explain all the magic that goes on inside the network. 1/n
6
75
369
What is the currently the best way to extract JSON from unstructured text using open source models by passing in a Pydantic schema? So far I have been looking into: 1. Guidance (github.com/guidance-ai/guida…) 2. Instructor (github.com/jxnl/instructor) 3. DSPy (github.com/stanfordnlp/dspy) 4. Guardrails-AI (github.com/guardrails-ai/gua…) 5. jsonformer (github.com/1rgs/jsonformer) Guidance and Instructor seem to have better openai compatibility. Getting them to work with open-source models seemed like a pain. Anyone have a working demo already? Anything else I should be trying? Also, is Hermes-2-Pro-Mistral-7B.Q8_0.gguf still the best go to model for this task? huggingface.co/NousResearch/…. I don't see a LLAMA-3 version of Nous-Hermes out yet. Finally, I have also been looking into llama-cpp-agent, has anyone tried this before? Seems to be working pretty well so far! github.com/Maximilian-Winter…
21
58
314
51,956
Investing time in @fastdotai is one of the best investments I have ever made. To continue to learn, I am starting a new series #CodeFirst where I will digging deep into the source code. This builds on top of @jeremyphoward code walkthrus. medium.com/@aman.arora0210/f…
4
38
291
Very excited to share my latest blog post on Optimizers called `"Adam" and friends`! amaarora.github.io/2021/03/1… In this blog post we are going to re-implement SGD, Momentum, RMSprop & Adam from scratch and also compare performance with PyTorch's implementation. 1/
1
45
288
I’m excited to share that I’ve joined @wandb! This means - more paper summaries, more research, more community events, more paper reading groups, more @fastdotai study groups, more open source contributions, more fun. :)
13
6
282
1/ Not only is @fastdotai great for building deep learning models, it is also an excellent place to learn! By reading 21 pages of cs231n.github.io/convolution… resource mentioned in the pets lesson of V2 bit.ly/34dUNtS, I had several AHA moments! Such as,
5
46
257
Excited to share a new blog post on Gemma 2 that goes into the details of: Grouped Query Attention, Sliding Window Attention, Rotary Position Embeddings (RoPE), Logit soft-capping & model-merging. **All with easy to follow PyTorch implementations!** 1/N
1
44
264
31,382
Super excited to present my latest blog post on ResNet-RS - "Revisiting ResNets: Improved Training and Scaling Strategies". bit.ly/2QT3yIU I also share code implementation in PyTorch using TIMM & more! 1/3
5
55
241
Trust me when I tell you that the below code implements Grouped Query Attention (GQA), Multi Head Attention (MHA) & Multi Query Attention (MQA). There is no magic to it. Paper (GQA): arxiv.org/abs/2305.13245 Implementation adapted from: github.com/meta-llama/llama/…
6
33
237
22,760
I am not sure if I should be scared or happy - with Uber's latest Plug & Play Language Model (arxiv.org/abs/1912.02164) it is now possible to drive LM's activations (such as GPT-2) and generate text with a specific sentiment on a specific topic. Is this dangerous? Time will tell.
1
64
219
It brings me great excitement as I share my latest blog on EfficientNet for two reasons: - Efficientnet-B7 achieved new SOTA while being 8.4 times smaller and 6.1 times faster than GPIPE - Recent and current SOTA have all been related to EfficientNets amaarora.github.io/2020/08/1… 1/
6
54
221
It's Monday and I am pretty excited to release my latest blog post "U-Net: A PyTorch Implementation in 60 lines of Code". amaarora.github.io/2020/09/1… I was able to train this network (without pretrained weights) for SIIM ACR Pneumothorax Kaggle Competition and get 0.79 dice score. 1/
4
32
217
After digging deep into HF's implementation of the LongFormer architecture, I have written a new blog post that explains SWA and shows how to implement in PyTorch. amaarora.github.io/posts/202… Continue reading this thread for a short summary. 1/
3
30
217
25,002
I'd love to be able to make beautiful visualizations for my future blog posts. For example, below I share fig-2 from the Weight Standardization paper. Does anybody any good tools that are fairly easy to use? (Don't want to spend months learning new tools.)
20
16
195
Here's a thread on why I write blogs and how that has completely changed my life. amaarora.github.io/ 1/
9
46
202
Now you can extract activation statistics from any module inside `timm` models easily using Unix filename pattern matching and @PyTorch hooks! In this example we extract average square channel mean of activations after every residual block inside a ResNet-50 model: 1/
6
36
188
What are some of the best books that really help you think about "how to design software?" Particularly after something that is: - Ideally for Python users - Mentions the key steps in designing/testing software - Mentions the tools - Helps think about key design decisions
12
26
176
1/ Wouldn't it be great if someone explained to you exactly what Resnet does in great detail and that too in a simple language? Fastbook's chapter 14 - ResNets (github.com/fastai/fastbook/b…) does exactly that! Thanks @jeremyphoward and @GuggerSylvain ! :)
2
31
173
"Could @huggingface Accelerate really be this easy?" I asked myself, and the result is this blog post where we take a deep-dive into the source code of the package. wandb.ai/wandb_fc/pytorch-im… Thanks @GuggerSylvain - you've done it again!! A thread: 1/n
5
37
173
Special thanks to @math_rachel and @jeremyphoward for the brilliant NLP course that really helped me in my journey to start learning about Transformers and NLP in general. fast.ai/2019/07/08/fastai-nl…
24
167
Super excited to share my latest blog on "Normalizer-Free ResNets" by @DeepMind !! Blog: bit.ly/3g9igFJ Paper: arxiv.org/abs/2101.08692 The idea is to explain everything in detail in a simple language & also show code implementation in @PyTorch. :) A thread: 1/5
4
39
160
.@dr_hb_ai and I have teamed up to present the 1st blog in "TIMM SERIES" on "Vision Transformer"! amaarora.github.io/2021/01/1… Thanks to @dr_hb_ai's contributions, IMHO, this is one of the prettiest and most detailed blogs on ViT so far. We also share code implementations! 1/n
3
44
159
Always wanted to write code some day that does something productive and fits in a single screen. New blog post on "U-Net using PyTorch" coming out this Monday 9am AEST! :)
3
17
157
It's hard for me to contain my excitement as I share this with you! @fastdotai has been at the core of all my learnings and I look forward to sharing the love for this library with you through fastbook reading sessions at @wandb for the next ~20 weeks! A thread: 1/n
Join @jeremyphoward & the @wandb team for our 1st Fastbook reading session on June 3, 2021 at at 8pm PST / 1pm AEST! Over the next 20 weeks, we'll dive into this hands-on-guide to deep learning. 📍 Register: wandb.me/fastbook #deeplearning #machinelearning
6
26
149
Best post on Transformers till date - The annotated transformer. nlp.seas.harvard.edu/annotat…
1
34
147
9,367
1/ It's Monday and as promised I am back with another blog post - "Group Normalization". amaarora.github.io/2020/08/0… As a summary, we look at: - What is GN - In which cases you might want to try GN as opposed to BN - Other norm techniques like LayerNorm and InstanceNorm (briefly)
5
33
147
Excited to bring to you the only resource that you'll need to understand "Swin Transformers" V1 (with #PyTorch code implementation!). amaarora.github.io/2022/07/0… A 🧵:
4
32
147
I am elated and humbled to have won my first silver medal on Kaggle in recent "SIIM-ISIC Melanoma Classification" competition. This is all thanks to the wonderful community and open source projects around me especially @fastdotai! Detailed write-up and journey coming soon.
8
4
145
OMG! "Sliding Window Attention" is seriously a wild concept to wrap your head around! 🤯 github.com/huggingface/trans…
3
23
148
15,418
Excited to present part-2 of Annotated CLIP (the only 2 resources that you will need to understand CLIP completely with PyTorch code implementation). amaarora.github.io/posts/202… As part of this blog post re-implement CLIP in PyTorch step-by-step using code from open clip. 1/
2
35
145
16,931
My biggest fear: "NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running."
11
5
143
Thrilled to present my latest blog post: "Demystifying Document Question-Answering Chatbot - A Comprehensive Step-by-Step Tutorial with LangChain" 🔗: amaarora.github.io/posts/202… It's the most exhaustive resource that you'll find on the topic going into depths of @langchain. 1/
2
25
141
25,929
It's Monday and I am pretty excited to release my latest blog post "Introduction to Metric learning and Center Loss". amaarora.github.io/2020/10/1… This post is a first in a series of total 4 blog posts on Metric Learning! We start with center loss and later look at other losses. 1/
3
26
132
Hello everyone! 👋 I'm thrilled to share with you the journey I've embarked on in the world of AI and Machine Learning. Over the past few years, I've had the privilege of diving deep into various topics, exploring new technologies, and sharing my insights through numerous blog posts and reports. I've compiled a list of all my writings so far, each one a stepping stone in my learning journey. I hope these resources can be of help to you as they have been to me. nebula-cow-fb1.notion.site/C… I'm incredibly proud of the work I've done and the knowledge I've gained. But more than that, I'm excited about the opportunity to share it with all of you. I believe in the power of community and the collective wisdom we can build by sharing our experiences and insights. I'd love to hear your thoughts on these topics. Have you found any of these resources helpful? Do you have any favourite articles or insights you'd like to share? Let's start a conversation!
2
30
129
27,848
We're discussing the MDETR paper tomorrow with author @ashkamath20 tomorrow in our paper reading group at @wandb! Simply put, MDETR is a multi-modal transformer-based architecture that learns to assign free-form text to objects in an image. But, how does it do that? 1/
3
15
131
Did you know that the `timm` library also has over 15 available optimizers including Lookahead to choose from on top of hundreds of the pretrained models? This tutorial shows how to incorporate these optimizers in your custom PyTorch training scripts- fastai.github.io/timmdocs/Op…
1
23
123
Sure, a bit behind on the MLP mixer madness, but needed some time for the "Is MLP-Mixer a CNN in disguise?" debate to settle down! @dr_hb_ai and I spent hours on a call together to find the answer. The result? A new blog post! wandb.ai/wandb_fc/pytorch-im… 1/n
3
29
116
This is a thread on our fastbook reading group currently running at @wandb. We decided to finish the book in week-0 with @jeremyphoward, and I am so glad and happy that so many of you are with me on this journey. It's already week-12, and we're still going strong!! 1/n
1
20
115
Deep learning practitioners, I have a Q - Is there a nice visualization that explains why nn.Conv1d and nn.Linear are essentially the same when kernel size is 1 for conv?
5
9
114
Anyone interested in joining me on a journey to replicate the 1st place solution for Google Landmark Retrieval 2020 using Google Colab TPUs? solution summary: kaggle.com/c/landmark-retrie… arxiv: arxiv.org/abs/2009.05132
11
21
108
This just happened. ImageNet, here I come!
8
1
97
"Are fully connected and convolution layers with 1x1 kernel equivalent? If so, how?" In a quest to find the answer I ended up implementing both operations in MS Excel and compare results to PyTorch outputs! bit.ly/2VvGYrG 1/n
3
19
95
Personal Update thread: I've relocated to Delhi for a while. Sadly, my father was diagnosed with cancer, and I am here to support him in his fight. I've mostly been away from Twitter and work but slowly getting back to it now. 1/
29
1
92
New blogpost on ‘Group Normalization’ with PyTorch implementation coming tomorrow! :)
94
Before joining @wandb, I used to work for a medical startup in Sydney that was pretty heavy on compliance! Their product could diagnose around 127 diseases in Chest X-rays - but this also meant that any change in the deep learning model affected human lives directly. 1/
Need help with keeping track of your source code files and maintaining model & dataset versions that get served in production? In this webinar @amaarora will demo a robust framework to structure experiments with W&B. 📍Details & registration: wandb.me/webinar-registratio…
2
9
90
There's so much to learn from @kaggle.
5
6
93
Does anybody know where does the dot-product attention formula come from and why does it work?
14
10
91
Here's a wonderful article by @pandeyparul - "Building a compelling Data Science Portfolio with writing"! Having a good data science portfolio not only helps the people around you but also is great for your own personal growth! wandb.ai/parul_pandey/discus… 1/
3
18
88
I am always happy when I have a new blog post to present. I am happy again and excited to present my latest blog on ConViT architecture. bit.ly/3c2nuAf So what is ConViT? It's ViT with the first 10 SA layers replaced with GPSA layers. Wait, what? A thread: 1/n
1
27
83
1/ It's Monday morning for me and I am back with another blog post. This time it's "Squeeze and Excitation Networks Explained with PyTorch Implementation". amaarora.github.io/2020/07/2… Research paper : arxiv.org/abs/1709.01507
1
20
79
1/ I am really excited to share my new blog post "Label Smoothing Explained using Microsoft Excel" amaarora.github.io/2020/07/1… In this post, not only do we implement Label Smoothing in Microsoft Excel step by step but also,
2
13
80
It's Monday again and I am keeping my promise of releasing a new blog post yet another week. This week's post is based on a "code-first" approach where we build a solution for SIIM-ACR Pneumothorax Segmentation competition using @PyTorch. amaarora.github.io/2020/09/0… 1/
2
18
80
1/ As is usual for Monday mornings, I am back with yet another blog post - "DenseNet Architecture Explained with PyTorch Implementation from TorchVision" amaarora.github.io/2020/08/0… In this blogpost, together, we look at-
3
14
75
I got my job at @wandb because of my blog. Having my own personal blog almost always adds an X factor to my profile every time I interview. It’s also a great way to document your learnings and I often find myself referring to my older blogs as part of revision. 1/
Starting to write online was one of the most impactful decisions I've made in my career so far ✍️ I've gotten job opportunities, met amazing people and learned a lot from writing online. We are running a "blogathon" at W&B to get more people writing. wandb.me/blogathon 1/4
3
5
70
Lots of information packed in this article "Make Delegation Work in Python" by @jeremyphoward ! bit.ly/2UeTpEB Apart from learning about how Python deals with `**kwargs`, I also got to know about the "DRY" principle! ;)
1
11
65
I recently summarised what’s new in CV. The talk summarises CV architectures’ progress over time. Starting with AlexNet in 2012 to Transformers, NfNet and MLP Mixers in 2021. I also shared my views about publicly sharing your work. Thanks @QLD_AI_Hub for hosting me.
Our mates @Queensland_AI were recently joined by @wandb Aman Arora to explore what's new in #computervision and how to publicly share your work. #artificialintelligence #machinelearning #deeplearning #data piped.video/watch?v=IYg46wNy…
4
17
66
Love this blog on @fastdotai's learning rate finder! Includes experimentation with different LRs on the PETs dataset and some pretty cool handwritten notes on discriminative learning rates too :) Also explains why we don't need to re-initialize learners after running lr_find()!
An amazing week 7 of #fastbook session with @amaarora from @wandb ! Aman explains the importance of tuning learning rates for training #DeepLearning models and I have summarized my understanding below elisonsherton.github.io//fas…
1
13
66
Slightly old but gold. Illustrating Reinforcement Learning from Human Feedback (RLHF) by @huggingface huggingface.co/blog/rlhf
1
13
63
4,018
Got bronze in Covid competition by the finest of margins. But, also - 1. First time with "object detection" 2. Wrote a blog post about DETR and hosted PRG at @wandb 3. Another PRG next week on MDETR 4. Learnt about Detectron2, MMDetection 5. Working on a library
There's so much to learn from @kaggle.
4
7
64
As deep learning practitioners, we are surrounded by frameworks that make our lives so much easier! One such framework that has been a part of almost all of my distributed training (multi-GPU/TPU) loops since its release has been -🤗accelerate! github.com/huggingface/accel… 1/4
1
6
62
Did you want to use the SGDR scheduler for your custom PyTorch training scripts? This tutorial shows you how you an implement it using timm and explains each of the hyperparameters. Oh btw, it is also possible to schedule other params apart from lr. fastai.github.io/timmdocs/SG…
3
12
62
Not long left now, in 2 days we are hosting the first ML-frameworks meetup at @wandb with @GuggerSylvain, for a deep-dive into hugging face accelerate! code: huggingface.co/docs/accelera… blog: huggingface.co/blog/accelera… rsvp: wandb.me/ml-frameworks See you all there!🤗
13
61
If you've tried and failed to understand Seq2Seq models (like I did many times), try @math_rachel's NLP course accompanied with these two excellent blog posts by @karpathy and @ch402 bit.ly/2ZuFwE0, bit.ly/34YugAU. Then, try implementing the model in code. :)
13
58
Did you know that the `timm` library can load the ImageNet pretrained weights for images with input number of channels != 3? Here is a tutorial that explains how this works. bit.ly/30tzftz 1/
5
7
60
I am so excited to share, that at CTDS we are stating out with a new series on TIMM! GitHub: github.com/rwightman/pytorch… In this series, @bhutanisanyam1 and I are going to dig deep dive into the source code of TIMM over the next few weeks. CTDS: piped.video/c/ChaiTimeDataSc… 1/n
2
8
58
It was lovely to host @jeremyphoward yesterday for our introductory session on fastbook reading group at @wandb! :) This session is also available on YouTube here - piped.video/watch?v=X3tjlZL9… 1/n
2
15
57
In case you missed @GuggerSylvain talk about 🤗Accelerate - the video is now available on YouTube! Plenty of good advice from the master himself. :) piped.video/watch?v=A7lnu-Zs…
1
13
54
Have taken the first step towards updating my blog (and hopefully, making it better). You know what I am talking about, right? amaarora.github.io/
5
6
51
8,137
This is an exciting moment. I've successfully been able to re-implement `SGD`, `Momentum` & `RMSProp` from scratch. Blue is the loss curve when a model was trained using PyTorch's `RMSprop` and orange represents the new implementation from scratch. New blog post out soon! :)
2
3
51
Long way to go, but happy to be a Kaggle 2x expert. :)
4
46
Today, @bhutanisanyam1 and I will continue looking at the top solutions from "SIIM-FISABIO-RSNA Covid-19 Detection" Kaggle competition. Join us in ~3 hrs: piped.video/watch?v=HJDfV6Tj… We will train the Study Classification model from scratch using Segmentation AUX loss in PyTorch.
3
11
47
New Blogpost: In today's blogpost "SIIM-ISIC Melanoma Classification - my journey to a top 5% solution and first silver medal on Kaggle", I share my journey, solution summary and key learnings from having participated in this competition. amaarora.github.io/2020/08/2… 1/
1
6
47
Dear NLP experts, I want to train a model to do address segmentation. Trying to break a text address like: "Unit 12, 11-15 Myra Rd, Strathfield" to it's constituents like: Unit: 12 Street number: 11-15 Street name: Myra Rd Suburb: Strathfield How could I do this please?
17
7
48
We're discussing the DETR paper in our paper reading group at @wandb tomorrow! Paper: arxiv.org/abs/2005.12872 RSVP: wandb.me/prg
I've been working on Object Detection for the past few weeks - and I am proud to announce "The Annotated DETR" !! amaarora.github.io/2021/07/2… In this post, I try to explain all the magic that goes on inside the architecture. 1/n
2
12
47
Recently I integrated my "ResNet Strikes Back" (arxiv.org/abs/2110.00476) related experiments with @wandb and wrote a blog post about it too: wandb.ai/amanarora/resnet_st… 1/
2
11
45
Every week after the fastbook sessions I have a big smile on my face! I think it’s the @wandb and @fastdotai magic! Love the community and people attending! Thanks guys for making it so fun! 😁
In our upcoming Fastbook Reading Group session, @amaarora will wrap up Chapter 2 and start Chapter 3! 🗓 June 23, 8pm PT (<1 hr to go) 🚀 RSVP - wandb.me/fastbook
5
9
43
Hey everybody! I know I’ve been away, but not today. We are hosting our first beginner-friendly live coding session at @wandb on ResNet! Join me live at wandb.me/resnet-stream today at 10:30pm IST! We’ll build the architecture from scratch in @PyTorch.
10
43
May 5th, 2023: Release of StarCoder & StartCoderBase. nitter.app/BigCodeProject/s… I just finished reading the 54-page accompanying pre-print - arxiv.org/abs/2305.06161, & let me take you through all the finer details of dataset generation & curation, model training & evaluation below. Big thanks to @ServiceNow, @BigCodeProject & @huggingface for the open-source model, dataset & training recipe. ---------------------------------------------------- KEY FEATURES: 1. StarCoder is a finetuned version of StarCoderBase, that has been finetuned using 35B Python tokens! 2. StarCoderBase is a 15.5B parameter model with an 8K context length, trained on 1 trillion tokens from The Stack (arxiv.org/abs/2211.15533). 3. 1T tokens consist of 80+ programming languages, GitHub issues, Git commits & Jupyter Notebooks. 4. StarCoderBase outperforms every open Code LLM that supports multiple programming languages and matches or outperforms the OpenAI `code-cushman-001` model. 5. Both StarCoderBase & StarCoder have 8k context length, support Fill-in-the-Middle (arxiv.org/abs/2207.14255) & inference through Multi-Query-Attention (arxiv.org/abs/1911.02150); I will write about these two papers in follow-up Twitter threads. 6. OpenRAIL-M license agreement, a new attribution tool into the VSCode demo that can help users detect and locate model generations that may have been copied from the training set & a significantly improved the PII redaction pipeline by collecting a PII dataset containing 12,000 files with 22,950 annotated entities. ---------------------------------------------------- DATA CURATION & CLEANING: 1. From the 358 programming languages in The Stack, 86 were chosen based on two filters: - Languages with more than 500MB data - Top-50 languages on GitHut (githut.info/) or TIOBE Index for December 2022. (full list in table-1 & table-2 attached as imgs) 2. Swift was not chosen in the final list of languages due to human error! 3. Data was visually inspected - eighteen community annotators evaluated 300 programming language extensions. Here's how the process looked like: - Randomly 30,000 files were selected, and categorized by extension - Keep max 1,000 files per extension - Annotators went through 50-100 files & confirmed if data appeared normal code. 4. For HTML: custom HTML filter that targets excessive HTML boilerplate and links; For YAML: keep files with 50–5000 characters, an average line length smaller than 100, a maximum line length smaller than 1000, and more than 50% alphabetic characters; For JSON: keep files with 50–5000 characters and more than 50% alphabetic characters, which removes around 70% of the files and 98% of the volume. 5. Jupyter Notebooks were transformed into two different datasets - Jupyter-scripts & Jupyter-structured. - For Jupyter-scripts, Jupytext (jupytext.readthedocs.io/) was used to convert notebooks to scripts. Some notebooks missing metadata about programming language within each notebook, Guesslang (guesslang.readthedocs.io/) was used to automatically identify programming languages in this case. - For Jupyter-structured, filter out notebooks that don't have Python code or Markdown text. Only notebooks explicitly marked as ‘Python’ in the metadata were kept, consecutive Markdown blocks or code blocks were merged into a large Markdown or code block respectively. Total 1M structured Jupyter Notebooks after preprocessing. 6. For GitHub Issues, conversations from PR's & Issues were collected as part of The Stack. These were then filtered as below: - Remove auto-generated text when users replied to issues via email. (see Regex expression as Listing A.1 img attached) - removed 18% of volume. - Exclude comments from bots. Done by searching for keywords in username & comment's author. - Keep conversations with two or more users, or total text within comment < 7,000 characters for single user. - Use `fasttext` (fasttext.cc/docs/en/language…) to filter out non-English issues. 7. For Git Commits, data collected from BigQuery (For ), remove repos from users that opted out of The Stack. Keep 50% sample and apply following filters: - Remove code files with >100k chars; - Remove commits with empty commit subject; - Subsample changes with ≤ 2 lines with 50% probability; - Subsample changes spanning ≥ 200 lines with 10% probability; - Remove commits with whitespace-separated words-to-character ratio >20; - Subsample data formats (JSON, YAML, XML, HTML) with 50% probability. 8. For DeDuplication, same approach as in arxiv.org/abs/2301.03988. - Calculate MiniHashes of all src code files followed by Locally Sensitive Hashing (LSH) to map similar code files to same bucket. * I am not sure about how this de-duplication part works, will have to further read about LSH & MiniHashes. 9. Regarding Weighting of Data Sources, authors decided not to up-sample or down-sample certain programming languages. Why? Because, after the deduplication process, it was found that several high-resource programming languages, such as C, C++, C#, Java, Javascript, Python, and PHP, had a similar amount of data ranging from 44–87 GB. ---------------------------------------------------- PII REDACTION Even though the Personally Identifiable Information (PII) redaction is a subset of Data Curation section before, I share it separately in this tweet as it's quite interesting. Consists of three parts: 1. Data Collection (identifying PII entities such as names, usernames, emails, IP addresses, passwords..): the collected dataset comprises of 12,000 files each containing approximately 50 lines of code in 31 programming languages. The annotators detected a total of 22,950 PII entities in the dataset. 2. Encoder only model called StarEncoder trained on data collected from step-1 using MLM (Masked Language Modelling) & NSP (Next Sentence Prediction) objectives - objectives from BERT! Takes ~2 days on 64 A100 GPUs for 400B tokens. 3. Finetune StarEncoder for NER (named entity recognition) task with 6 target classes: names, emails, keys, passwords, IP addresses, and usernames. The finetuned version baseline achieves F1 scores of more than 90% on names, emails, and IP addresses and 73.39% on passwords. The observed model’s performance is comparatively low on keys and usernames, with F1 scores of only 56.66% and 59.39%, respectively. Comparison against regex baseline: PII detection models still surpassed the regex approach in detecting all three entities supported by regex - Email, IP address & Key. All PII entities were replaced with the following tokens: <NAME>, <EMAIL>, <KEY>, <PASSWORD> ---------------------------------------------------- MODEL TRAINING StarCoderBase is the first model trained on 1 trillion tokens sourced from the curated dataset described above. StarCoder is the fine-tuned version of StarCoderBase, trained on another 35B Python tokens (roughly 2 epochs) 1. Data formatting using tokens performed prior to training. - For code, authors prepended repository name, file name, # of stars, & code. <reponame>REPONAME<filename>FILENAME<gh_stars>STARS\nCode<eos> - For Issues, special tokens used to separate comments. <issue_start>title + USERID: comment<issue_comment>USERID: Comment ... <issue_closed (optional)> <eos> - Jupyter scripts were formatted in the same manner as code. - For Git Commits, separated the code before the commit, the commit message, and the code after the commit with tokens. <commit_before>code<commit_msg>text<commit_after>code<eos> 2. Tokenizer: used the Hugging Face Tokenizers library to train a byte-level Byte-Pair-Encoding with a vocabulary size of 49,152 tokens—including the sentinel tokens. 3. Model Architecture: trained a 15.5B parameter model with the same architecture as SantaCoder. It is a decoder-only Transformer with Fill-in-the-Middle, Multi-Query-Attention & learned absolute positional embeddings.
Introducing: 💫StarCoder StarCoder is a 15B LLM for code with 8k context and trained only on permissive data in 80+ programming languages. It can be prompted to reach 40% pass@1 on HumanEval and act as a Tech Assistant. Try it here: shorturl.at/cYZ06r Release thread🧵
11
39
10,726
I'd like to think of Deep Learning in NLP and CV to broadly consist of: 1. Model architectures 2. Optimizers 3. Loss functions 4. Data augmentation techniques 5. Schedulers 6. Layer initializations What am I missing? Model distillation (can be part of category 1) and..?
7
2
38
Join me in 2 hours as we look into the EfficientNetV2 paper as part of our paper reading group at @wandb! arxiv: arxiv.org/abs/2104.00298 report: bit.ly/3y7ysx7
2
6
39
We asked and you answered! And now, we're excited to host our next paper reading group at @wandb this Sunday at 12pm PST on the "Vision Transformer"! Together, let's break this paper down into simple parts and learn all about it. Register here - wandb.ai/aarora/discussions/…
If I were to host a paper reading group at @wandb, which paper would you want us to discuss together with code implementation? If any other, please let me know. :)
1
6
41
Great to see the classic matrix multiplication problem that we've seen in fast.ai as part of Modular keynote by @jeremyphoward ! Check out the keynote: modular.com/ 1/
1
7
36
5,627
Planning to host beginner friendly paper reading groups at @wandb! (possibly ResNet, SeNet, EfficientNet & DenseNet) We could also go through code implementation in TIMM! How does that sound?
33% Cool
64% Super cool
4% Not cool
318 votes • Final results
4
10
38
We're discussing the CaiT paper today in our paper reading group at @wandb in 2 hours - and I am also going to reference code from TIMM to show everybody the implementation of the paper in PyTorch. RSVP: wandb.me/prg Paper: arxiv.org/abs/2103.17239
2
9
37
It's Monday again, and in today's blog post we will be looking at "GeM Pooling" and also a brief introduction to the Image Retrieval. amaarora.github.io/2020/08/3… We also look at PyTorch implementation and run a small experiment as usual. Jupyter nb: nbviewer.jupyter.org/github/… 1/
1
7
36
After having spent everyday for the past two years working and learning continuously, I have decided to take a little break - relax and replenish my energy. It is also my birthday in around 10 days time. I promise to continue writing more blogs when I come back. :)
3
36
Something I just discovered today is that my blog just crossed the 100K unique users milestone! amaarora.github.io/ It has been now been read in 76 countries with over 150K sessions within ~2 years of starting it. This makes me very happy and motivated to write more! :)
2
35
I learnt "much more" from the friends I made in the `DS 101` course than from the actual curriculum itself. Then, someone pointed me to @fastdotai by @jeremyphoward. Having trained an img classifier within the first 2 hours of starting out, I was hooked to DL for life. 3/
1
2
33
It's the end of week-1 of "Deep learning for coders : Part-1" course and I spent this week looking in to the *DataBlocks API*. Here is a code-first introduction to the wonderful API using five different single label CV applications: amaarora.github.io/fastaiexp…
1
3
35
Just joined Twitch! Who are some deep learning folks I should follow? Recommendations, please! :)
8
1
33
Sometimes it takes 8 windows to follow @fastdotai source code! Thanks @jeremyphoward for introducing me to VIM and TMUX. My most comprehensive article on DataBlocks API coming out soon!! #fastai #datablocks #python #vim #tmux
1
2
32
Do you understand perplexity metric? If not, thats okay. I didn't understand it completely either and asked GPT-4 for help. The results are mind blowing! 🤯 "Can you please explain the "perplexity" with example sequence of words and predictions from a large language model?" 1/
1
5
33
5,517