May 5th, 2023: Release of StarCoder & StartCoderBase.
nitter.app/BigCodeProject/s…
I just finished reading the 54-page accompanying pre-print -
arxiv.org/abs/2305.06161, & let me take you through all the finer details of dataset generation & curation, model training & evaluation below.
Big thanks to
@ServiceNow,
@BigCodeProject &
@huggingface for the open-source model, dataset & training recipe.
----------------------------------------------------
KEY FEATURES:
1. StarCoder is a finetuned version of StarCoderBase, that has been finetuned using 35B Python tokens!
2. StarCoderBase is a 15.5B parameter model with an 8K context length, trained on 1 trillion tokens from The Stack (
arxiv.org/abs/2211.15533).
3. 1T tokens consist of 80+ programming languages, GitHub issues, Git commits & Jupyter Notebooks.
4. StarCoderBase outperforms every open Code LLM that supports multiple programming languages and matches or outperforms the OpenAI `code-cushman-001` model.
5. Both StarCoderBase & StarCoder have 8k context length, support Fill-in-the-Middle (
arxiv.org/abs/2207.14255) & inference through Multi-Query-Attention (
arxiv.org/abs/1911.02150); I will write about these two papers in follow-up Twitter threads.
6. OpenRAIL-M license agreement, a new attribution tool into the VSCode demo that can help users detect and locate model generations that may have been copied from the training set & a significantly improved the PII redaction pipeline by collecting a PII dataset containing 12,000 files with 22,950 annotated entities.
----------------------------------------------------
DATA CURATION & CLEANING:
1. From the 358 programming languages in The Stack, 86 were chosen based on two filters:
- Languages with more than 500MB data
- Top-50 languages on GitHut (
githut.info/) or TIOBE Index for December 2022.
(full list in table-1 & table-2 attached as imgs)
2. Swift was not chosen in the final list of languages due to human error!
3. Data was visually inspected - eighteen community annotators evaluated 300 programming language extensions. Here's how the process looked like:
- Randomly 30,000 files were selected, and categorized by extension
- Keep max 1,000 files per extension
- Annotators went through 50-100 files & confirmed if data appeared normal code.
4. For HTML: custom HTML filter that targets excessive HTML boilerplate and links; For YAML: keep files with 50–5000 characters, an average line length smaller than 100, a maximum line length smaller than 1000, and more than 50% alphabetic characters; For JSON: keep files with 50–5000 characters and more than 50% alphabetic characters, which removes around 70% of the files and 98% of the volume.
5. Jupyter Notebooks were transformed into two different datasets - Jupyter-scripts & Jupyter-structured.
- For Jupyter-scripts, Jupytext (
jupytext.readthedocs.io/) was used to convert notebooks to scripts. Some notebooks missing metadata about programming language within each notebook, Guesslang (
guesslang.readthedocs.io/) was used to automatically identify programming languages in this case.
- For Jupyter-structured, filter out notebooks that don't have Python code or Markdown text. Only notebooks explicitly marked as ‘Python’ in the metadata were kept, consecutive Markdown blocks or code blocks were merged into a large Markdown or code block respectively. Total 1M structured Jupyter Notebooks after preprocessing.
6. For GitHub Issues, conversations from PR's & Issues were collected as part of The Stack. These were then filtered as below:
- Remove auto-generated text when users replied to issues via email. (see Regex expression as Listing A.1 img attached) - removed 18% of volume.
- Exclude comments from bots. Done by searching for keywords in username & comment's author.
- Keep conversations with two or more users, or total text within comment < 7,000 characters for single user.
- Use `fasttext` (
fasttext.cc/docs/en/language…) to filter out non-English issues.
7. For Git Commits, data collected from BigQuery (For ), remove repos from users that opted out of The Stack. Keep 50% sample and apply following filters:
- Remove code files with >100k chars;
- Remove commits with empty commit subject;
- Subsample changes with ≤ 2 lines with 50% probability;
- Subsample changes spanning ≥ 200 lines with 10% probability;
- Remove commits with whitespace-separated words-to-character ratio >20;
- Subsample data formats (JSON, YAML, XML, HTML) with 50% probability.
8. For DeDuplication, same approach as in
arxiv.org/abs/2301.03988.
- Calculate MiniHashes of all src code files followed by Locally Sensitive Hashing (LSH) to map similar code files to same bucket.
* I am not sure about how this de-duplication part works, will have to further read about LSH & MiniHashes.
9. Regarding Weighting of Data Sources, authors decided not to up-sample or down-sample certain programming languages. Why? Because, after the deduplication process, it was found that several high-resource programming languages, such as C, C++, C#, Java, Javascript, Python, and PHP, had a similar amount of data ranging from 44–87 GB.
----------------------------------------------------
PII REDACTION
Even though the Personally Identifiable Information (PII) redaction is a subset of Data Curation section before, I share it separately in this tweet as it's quite interesting.
Consists of three parts:
1. Data Collection (identifying PII entities such as names, usernames, emails, IP addresses, passwords..): the collected dataset comprises of 12,000 files each containing approximately 50 lines of code in 31 programming languages. The annotators detected a total of 22,950 PII entities in the dataset.
2. Encoder only model called StarEncoder trained on data collected from step-1 using MLM (Masked Language Modelling) & NSP (Next Sentence Prediction) objectives - objectives from BERT! Takes ~2 days on 64 A100 GPUs for 400B tokens.
3. Finetune StarEncoder for NER (named entity recognition) task with 6 target classes: names, emails, keys, passwords, IP addresses, and usernames.
The finetuned version baseline achieves F1 scores of more than 90% on names, emails, and IP addresses and 73.39% on passwords. The observed model’s performance is comparatively low on keys and usernames, with F1 scores of only 56.66% and 59.39%, respectively.
Comparison against regex baseline: PII detection models still surpassed the regex approach in detecting all three entities supported by regex - Email, IP address & Key.
All PII entities were replaced with the following tokens: <NAME>, <EMAIL>, <KEY>, <PASSWORD>
----------------------------------------------------
MODEL TRAINING
StarCoderBase is the first model trained on 1 trillion tokens sourced from the curated dataset described above.
StarCoder is the fine-tuned version of StarCoderBase, trained on another 35B Python tokens (roughly 2 epochs)
1. Data formatting using tokens performed prior to training.
- For code, authors prepended repository name, file name, # of stars, & code.
<reponame>REPONAME<filename>FILENAME<gh_stars>STARS\nCode<eos>
- For Issues, special tokens used to separate comments.
<issue_start>title + USERID: comment<issue_comment>USERID: Comment ... <issue_closed (optional)> <eos>
- Jupyter scripts were formatted in the same manner as code.
- For Git Commits, separated the code before the commit, the commit message, and the code after the commit with tokens.
<commit_before>code<commit_msg>text<commit_after>code<eos>
2. Tokenizer: used the Hugging Face Tokenizers library to train a byte-level Byte-Pair-Encoding with a vocabulary size of 49,152 tokens—including the sentinel tokens.
3. Model Architecture: trained a 15.5B parameter model with the same architecture as SantaCoder. It is a decoder-only Transformer with Fill-in-the-Middle, Multi-Query-Attention & learned absolute positional embeddings.