BigCode (@BigCodeProject) | nitter

Pinned Tweet

BigCode @BigCodeProject

28 Feb 2024

Introducing: StarCoder2 and The Stack v2 ⭐️ StarCoder2 is trained with a 16k token context and repo-level information for 4T+ tokens. All built on The Stack v2 - the largest code dataset with 900B+ tokens. All code, data and models are fully open! hf.co/bigcode/starcoder2-15b

12

202

661

222,775

BigCode @BigCodeProject

4 May 2023

Introducing: 💫StarCoder StarCoder is a 15B LLM for code with 8k context and trained only on permissive data in 80+ programming languages. It can be prompted to reach 40% pass@1 on HumanEval and act as a Tech Assistant. Try it here: shorturl.at/cYZ06r Release thread🧵

69

629

2,589

882,232

BigCode @BigCodeProject

27 Oct 2022

Introducing 📑 The Stack - a 3TB dataset of permissively licensed code in 30 programming languages. hf.co/datasets/bigcode/the-s… You want your code excluded from the model training? There is an opt-out form and data governance plan: bigcode-project.org/docs/abo… Let's take a tour🧵

8

219

1,058

BigCode @BigCodeProject

22 Dec 2022

Announcing a holiday gift: 🎅SantaCoder - a 1.1B multilingual LM for code that outperforms much larger open-source models on both left-to-right generation and infilling! Demo: hf.co/spaces/bigcode/santa-d… Paper: hf.co/datasets/bigcode/admin… Attribution: hf.co/spaces/bigcode/santaco… A🧵:

9

196

824

264,104

BigCode @BigCodeProject

8 Jun 2023

📣 Introducing ⭐ StarCoder+ & StarChat Beta! We trained StarCoder on the Falcon model's English web dataset and Instruction-tuned it. Both models rank high in the LLM leaderboard, with strong natural language performance and coding capabilities. huggingface.co/HuggingFaceH4…

8

101

358

80,248

BigCode @BigCodeProject

5 Apr 2023

We started training something big and the daily training updates have degenerated to weather reports 🌦:

5

28

296

66,705

BigCode @BigCodeProject

29 Apr 2024

Releasing StarCoder2 Instruct! 🚀 Achieves 72% HumanEval score using only self-generated content without any GPT-3.5/4 data. This work demonstrates that self-instruct works already well at the 15B scale without data from proprietary models! Read more: huggingface.co/blog/sc2-inst…

4

72

284

38,690

BigCode @BigCodeProject

4 May 2023

Today we release two open-access models! StarCoderBase: trained on 1T tokens in 80+ programming languages huggingface.co/bigcode/starc… StarCoder: additionally trained on 35B Python tokens that can be prompted to reach 40.8% pass@1 huggingface.co/bigcode/starc…

8

57

270

86,162

BigCode @BigCodeProject

27 Jul 2023

🌌 News from the StarCoder cosmos! We trained smaller versions of StarCoder: 1B, 3B and 7B models. 1T tokens, 80+ programming languages with 8k context window, MQA & FIM.

3

71

242

77,702

BigCode @BigCodeProject

26 Sep 2022

print("Hello world! 🎉") Excited to announce the BigCode project led by @ServiceNowRSRCH and @huggingface! In the spirit of BigScience we aim to develop large language models for code in an open and responsible way. Join here: bigcode-project.org/docs/abo… A thread with our goals🧵

5

70

208

BigCode @BigCodeProject

18 Apr 2023

Day 18: Weather is clear and the loss is still going down ...

4

8

204

37,154

BigCode @BigCodeProject

22 May 2023

Introducing the BigCode Evaluation Harness for Code LLMs: github.com/bigcode-project/b… Inspired by the lm-evaluation-harness from @AiEleuther, it ensures ease-of-use, reproducibility and efficiency. Let’s explore its key features 🧵:

2

40

166

31,846

BigCode @BigCodeProject

1 Dec 2022

Today we are releasing The Stack v1.1! 🚀 We added more data, included more programming languages, and extended the list of permissive licenses used. huggingface.co/datasets/bigc… Also the first batch of opt-out requests was removed from the dataset.

2

42

159

BigCode @BigCodeProject

4 May 2023

We present the most extensive evaluation of code LLMs to date in the full tech report with 68 (!) authors. You can also read up on all the details from data preprocessing and governance to training at scale! drive.google.com/file/d/1cN-…

2

24

150

34,706

BigCode @BigCodeProject

16 May 2023

How can you near-deduplicate 1.4 TB of data in under 4 hours for $60? The secret ingredient of StarCoder's performance is data curation more than anything else. Besides manual inspection we did extensive deduplication. Great tutorial by @MouChenghao: hf.co/blog/dedup

Large-scale Near-deduplication Behind BigCode

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

1

32

143

24,479

BigCode @BigCodeProject

4 May 2023

In addition to chatting with StarCoder, it can also help you code in the new VSCode plugin. By pressing CTRL+ESC you can also check if the current code was in the pretraining dataset! marketplace.visualstudio.com…

9

15

142

30,375

BigCode @BigCodeProject

6 Jun 2023

👀 A glimpse of our latest mystery model's performance. Not just acing the coding tasks, but also mastering natural language! Intrigued yet? Join us at our StarCoder webinar this Thursday to find out: servicenow.zoom.us/j/9910373…

1

19

100

26,154

BigCode @BigCodeProject

4 May 2023

StarCoder was also trained on JupyterNotebooks and with Jupyter plugin from @JiaLi52524397 it can make use of previous code and markdown cells as well as outputs to predict the next cell. You can install it here or search on chrome store: github.com/bigcode-project/j…

1

9

88

18,949

BigCode @BigCodeProject

4 May 2023

For example the folks at @refact_ai are working on a shiny VSCode extension that can now make use of StarCoder to autocomplete or refactor code as well as writing code from an instruction! refact.ai/blog/2023/self-hos…

1

10

87

13,166

BigCode @BigCodeProject

6 Nov 2023

Exciting times: we are working on the next generation of StarCoder trained on a new dataset! 🚀 If you would like to have your code excluded from the training run you can check if your data is in the dataset and follow the link to opt-out: huggingface.co/spaces/bigcod…

Am I in The Stack? - a Hugging Face Space by bigcode

This tool checks if your GitHub repositories are included in The Stack dataset. Enter your GitHub username and select the dataset version to see if your code is part of it. If found, you can follow...

2

24

91

13,639

BigCode @BigCodeProject

29 Nov 2022

Between now and Christmas🎄 we are running a series on experiments to figure out what the best pre-processing is for code datasets such as The Stack. We'll share the W&B dashboards of these 🎅-models so if you are interested you can follow along!

1

14

78

BigCode @BigCodeProject

20 Mar 2023

Exciting new release: The Stack v1.2 with many new features! - The Stack Issues - The Stack Metadata - The Stack Commits (coming soon!) Along with a new and simplified opt-out request and an even better near-deduplicated version of the code dataset. hf.co/bigcode 🧵👇

1

20

69

18,597

BigCode @BigCodeProject

10 May 2023

BigCode was organized around the value of openness; open sharing of datasets and models, and also transparency of the project organization, motivations, and decisions! We're making all this information available in our new Governance Card 📚 1/4 🧵 hf.co/datasets/bigcode/gover…

1

11

68

19,079

BigCode @BigCodeProject

4 May 2023

With @TolokaAI we recruited 1,399 crowd-workers across 35 countries to annotate a diverse dataset for PII in code. Our PII detection model surpasses regex-based tools, especially for secret keys. PII dataset and model are available via gated access. hf.co/bigcode/starpii

bigcode/starpii · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

2

6

64

12,559

BigCode @BigCodeProject

15 Nov 2022

Is your code in 📑 The Stack? Check if your repositories are in the dataset and a large language models for code will learn from them! hf.co/spaces/bigcode/in-the-… You don't want your code to be part of The Stack? Follow the opt-out instruction and we'll remove it!

Am I in The Stack? - a Hugging Face Space by bigcode

This tool checks if your GitHub repositories are included in The Stack dataset. Enter your GitHub username and select the dataset version to see if your code is part of it. If found, you can follow...

4

23

60

BigCode @BigCodeProject

4 May 2023

We are excited to see what people are gonna build with StarCoder. Get started with code examples in this repo to fine-tune and run inference on StarCoder: github.com/bigcode-project/s… You can find all models/datasets/demos at hf.co/bigcode

GitHub - bigcode-project/starcoder: Home of StarCoder: fine-tuning & inference!

Home of StarCoder: fine-tuning & inference! Contribute to bigcode-project/starcoder development by creating an account on GitHub.

2

7

55

11,514

BigCode @BigCodeProject

4 May 2023

We release StarCoder under an OpenRAIL license agreement. This OpenRAIL: (i) makes more viable for companies to use and share the model; and (ii) promotes the sharing of AI documentation along the value chain. huggingface.co/spaces/bigcod…

BigCode Model License Agreement - a Hugging Face Space by bigcode

This application allows users to upload and display a PDF file directly in a web browser. Users need to provide a PDF file, and the app will show it embedded within the page.

1

5

55

15,336

BigCode @BigCodeProject

28 Feb 2024

StarCoder2 performs well on a wide range of coding and math tasks. The 15B model is best in its class, while the 3B model is at the performance of StarCoder1-15B. This makes StarCoder2 models more efficient and performant! Read the full report: hf.co/bigcode/report

2

3

51

4,592

BigCode @BigCodeProject

2 Jun 2023

Next week we'll host an online session about how StarCoder was built and show-casing interesting demos that people have built since! Date: Thursday June 8th, 6-7:30pm CEST (9-10:30am PST) Link: servicenow.zoom.us/j/9910373… If you have an interesting demo to present please reach out!

3

14

41

7,998

BigCode @BigCodeProject

4 May 2023

You can find all the links here: huggingface.co/bigcode

bigcode (BigCode)

Org profile for BigCode on Hugging Face, the AI community building the future.

3

34

21,702

BigCode @BigCodeProject

28 Feb 2024

This is the result of hard work by the BigCode community and supported by @ServiceNowRSRCH, @huggingface and @nvidia to train the 3B, 7B and 15B models! The Stack v2 was built with @SoftwareHeritage and the full processed training data is coming soon! huggingface.co/datasets/bigc…

4

35

4,901

BigCode @BigCodeProject

29 Jun 2023

You can watch the full online session on how StarCoder was built and where BigCode might head next here: piped.video/watch?v=sQFWE__J…

9

31

4,543

BigCode @BigCodeProject

27 Jul 2023

📊 We are also releasing a leaderboard showcasing the performance of base multilingual code models. We additionally compare throughputs so you don't have to trade performance for efficiency: huggingface.co/spaces/bigcod…

1

4

33

12,216

BigCode @BigCodeProject

27 Jul 2023

💫 Each model demonstrates the strongest performance for its size across various programming languages. 7B StarCoderBase reaches 28.37% pass@1 on HumanEval. 7B: huggingface.co/bigcode/starc… 3B: huggingface.co/bigcode/starc… 1B: huggingface.co/bigcode/starc…

1

4

33

11,399

BigCode @BigCodeProject

27 Jul 2023

@nomic_ai team already added support for StarCoderBase-3B in their GPT4ALL local models. Download the model at: gpt4all.io/models/starcoderb… & follow the docs: docs.gpt4all.io/gpt4all_pyth… Stay tuned for the 7B model integration!

2

6

33

18,535

BigCode @BigCodeProject

4 May 2023

BigCode @BigCodeProject

4 May 2023

You can find all the links here: huggingface.co/bigcode

2

1

23

12,888

BigCode @BigCodeProject

22 Dec 2022

SantaCoder is trained on Python, Java, and JavaScript and outperforms other large multilingual models such as InCoder (6.7B) or CodeGen-multi (2.7B) considerably! A lot of pieces from a lot of collaborators came together to get to that result:

1

3

26

5,552

BigCode @BigCodeProject

25 Jan 2023

Today BigCode crossed 500 participants! It's been a great collaboration so far and looking forward to the next phase!🚀

5

26

4,280

BigCode @BigCodeProject

8 Jun 2023

Reminder that the online session is starting in 90min and we have an exciting model we'll release as well! Link: servicenow.zoom.us/j/9910373…

BigCode @BigCodeProject

6 Jun 2023

👀 A glimpse of our latest mystery model's performance. Not just acing the coding tasks, but also mastering natural language! Intrigued yet? Join us at our StarCoder webinar this Thursday to find out: servicenow.zoom.us/j/9910373…

2

7

26

10,294

BigCode @BigCodeProject

29 Apr 2024

We release all the code, datasets, and models with a permissive license: 🤖Model: huggingface.co/bigcode/starc… ⚙️Code: github.com/bigcode-project/s… 📚Dataset: huggingface.co/datasets/bigc…

bigcode/starcoder2-15b-instruct-v0.1 · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

2

26

1,463

BigCode @BigCodeProject

27 Oct 2022

For easy experimentation we set up a small subset with 10k files per programming language here: hf.co/datasets/bigcode/the-s…

2

25

BigCode @BigCodeProject

12 Dec 2022

BigCode aims to give developers agency on how their open-source code is used, starting with a dataset opt-out mechanism: bigcode-project.org/docs/abo… We'll train the first larger models early next year so use it before January! #GitHub #Community #OpenSource #LLM #OptOut

Datasets

bigcode-project.org

1

6

23

BigCode @BigCodeProject

22 Dec 2022

Finally, we summarized our findings in a technical report with a wonderful group of collaborators: Paper: hf.co/datasets/bigcode/admin… So what's next? Scaling to larger models and training more languages early next year! 🚀

1

1

23

3,096

BigCode @BigCodeProject

21 Mar 2023

Join us tomorrow, Wednesday 22nd (6:30 PM - 8:00PM CET) at the @mozillafestival Science Fair to learn more about our work in the open and responsible development of large language models (LLMs) for code. schedule.mozillafestival.org… #Mozfest

5

23

4,124

BigCode @BigCodeProject

27 Oct 2022

Real fun facts: The dataset contains 1.5M files that include "hello world". It is also 150x larger than English Wikipedia and if printed (double sided) it would almost reach 11x the height of the Mount Everest.

1

3

23

BigCode @BigCodeProject

22 Dec 2022

We release all models and intermediate checkpoints on the Hugging Face Hub and load the via the revision: huggingface.co/bigcode/santa… The compute for these experiments was sponsored by @ServiceNowRSRCH's research cluster.

1

19

3,667

BigCode @BigCodeProject

2 Dec 2022

It was a blast to meet some of the BigCode community last night in person. Thank you to Juan’s Flying Burrito and team for helping to make the event a success.

2

20

BigCode @BigCodeProject

27 Oct 2022

Usage: There are several ways to use the dataset. You can download it with git: git lfs install git clone huggingface.co/datasets/bigc… Or you can use the 🤗Datasets library to load the dataset. With the streaming this requires no disk space at all!

1

20

BigCode @BigCodeProject

27 Oct 2022

Dataset collection: With gharchive.org over 220M repos were identified and 137M successfully cloned with over 50B files and 90TB of data. Filtered by extension and permissive licenses this yields 3TB of data. We also make a near-deduplicated version (1.5TB) available.

2

2

18

BigCode @BigCodeProject

22 Dec 2022

In Multi Head Attention every head has a set of queries, keys, and values. In MQA, the queries are unique while keys and values are shared. This saves memory and speeds up inference for large batches. We found it had only a minor impact on performance:

1

1

18

2,753

BigCode @BigCodeProject

27 Oct 2022

You can find much more information on the dataset and the models in the paper that is available: drive.google.com/file/d/17J-…

BigCode @BigCodeProject

27 Oct 2022

Introducing 📑 The Stack - a 3TB dataset of permissively licensed code in 30 programming languages. hf.co/datasets/bigcode/the-s… You want your code excluded from the model training? There is an opt-out form and data governance plan: bigcode-project.org/docs/abo… Let's take a tour🧵

2

2

18

BigCode @BigCodeProject

1 Feb 2023

Are you a front-end developer or just excited about building great demos and integrations? We are working on some cool large language models for code and could use your help! It's also a great opportunity to see how these models are built! Join here: bigcode-project.org/docs/abo…

How to join?

bigcode-project.org

2

11

18

5,142

BigCode @BigCodeProject

27 Jul 2023

✨ For more information on fine-tuning and deploying these models check the StarCoder and starcoder.cpp GitHub repos: github.com/bigcode-project/s… github.com/bigcode-project/s…

GitHub - bigcode-project/starcoder: Home of StarCoder: fine-tuning & inference!

Home of StarCoder: fine-tuning & inference! Contribute to bigcode-project/starcoder development by creating an account on GitHub.

1

19

1,544

BigCode @BigCodeProject

27 Oct 2022

Evaluation: to see if the dataset is useful to train generative models for code we trained a 350M parameter GPT-2 model on several datasets: training without license filtering performs best. Interestingly, removing near duplicates improves the performance on all datasets!

2

1

17

BigCode @BigCodeProject

9 May 2023

You can now also try StarCoder in Vim!

Luc Georges

@LucSGeorges

9 May 2023

I hacked a code completion plugin for neovim this weekend : github.com/huggingface/hfcc.… Feedback much appreciated!

16

2,071

BigCode @BigCodeProject

29 Nov 2022

If you are at NeurIPS and would like to meet with people from BigCode: we are organizing a BigCode social event! Link: eventbrite.ca/e/bigcode-comm… Pasword: bigcode2022 Heads-up: we only have space for 75 people, so you might end up on the waitlist. Hope to see many of you there!

7

16

BigCode @BigCodeProject

27 Oct 2022

The dataset includes ~30 programming languages covering common languages such as Java, C/C++ and Python as well as lower resource languages (2GB of Dockerfiles 🐳). If you'd like to see a new language added, feel free to add it in this issue: github.com/orgs/bigcode-proj…

1

15

BigCode @BigCodeProject

22 Dec 2022

The SantaCoder models are licensed under an open & responsible AI license (OpenRAIL). These are AI-specific licenses enabling free use and distribution of the model while setting specific use restrictions (e.g. malware generation). cc @ResponsibleAIL FAQ: bigcode-project.org/docs/pag…

2

1

16

6,579

BigCode @BigCodeProject

27 Oct 2022

We think of the dataset as a living thing and plan to update it with more languages/licenses as well as exclude opt-out requests. If you want to help shape the dataset and the models built on top of it consider joining a vibrant community of 300+ members! bigcode-project.org/docs/abo…

How to join?

bigcode-project.org

1

1

17

BigCode @BigCodeProject

4 May 2023

Replying to @robinwittkamp

The links are here: huggingface.co/bigcode

bigcode (BigCode)

Org profile for BigCode on Hugging Face, the AI community building the future.

2

3

17

5,486

BigCode @BigCodeProject

22 Dec 2022

The foundation to train SantaCoder is The Stack (v1.1) dataset. Given the relatively small size of our model (1B parameters) we chose three popular programming languages: Python, Java, and JavaScript. You can check if your code was used for training here: huggingface.co/spaces/bigcod…

Am I in The Stack? - a Hugging Face Space by bigcode

This tool checks if your GitHub repositories are included in The Stack dataset. Enter your GitHub username and select the dataset version to see if your code is part of it. If found, you can follow...

1

2

16

3,384

BigCode @BigCodeProject

8 Jun 2023

📔 Resources: StarCoderPlus: huggingface.co/bigcode/starc… StarChat Beta: huggingface.co/HuggingFaceH4… StarChat demo: huggingface.co/spaces/Huggin… StarCoderPlus demo: huggingface.co/spaces/bigcod…

1

16

2,181

BigCode @BigCodeProject

4 May 2023

Twitter seems to block Hugging Face Chat. So to try the model go to huggingface.co slash chat and select StarCoder.

1

4

16

5,766

BigCode @BigCodeProject

8 Jun 2023

The result: ⭐ StarCoder+ a powerful English Language Model with strong coding abilities. It outperforms all LLaMa models and PaLM-540B on HumanEval and stands out in the LLM leaderboard for < 30B models with a 45.1 MMLU score! huggingface.co/spaces/Huggin…

2

3

15

1,943

BigCode @BigCodeProject

22 Dec 2022

In addition to the standard near-deduplication and heuristics pipeline, we ran 4 filtering experiments: GitHub stars, tokenizer fertility, comment-to-code ratio and more near-deduplication. Filtering for GitHub stars hurts performance while comments and near-dedup help!

1

14

3,395

BigCode @BigCodeProject

22 Dec 2022

Before training any models we looked into removing sensitive information from code such as email addresses, secret keys and IP addresses. For that purpose we annotated 400 samples and then built and continuously refined RegEx rules to remove the information before training.

1

13

2,667

BigCode @BigCodeProject

22 Dec 2022

Fill-in-the-Middle is a clever approach where a sequence is reordered such that prefix|middle|suffix becomes prefix|suffix|middle. With that you can use normal left-to-right generation to fill the middle part. While some claim FIM is for free we found it's rather FIM for cheap:

1

14

2,287

BigCode @BigCodeProject

27 Oct 2022

Governance: For over 80% of all repos a license couldn't be detected and were thus excluded. The rest was filtered for permissive licenses. Devs might not be comfortable that their code is used to train such models. We are working on an opt-out process: bigcode-project.org/docs/abo…

2

1

14

BigCode @BigCodeProject

8 Jun 2023

It can build HTML websites and much more... Give it a try 🚀

1

5

13

1,948

BigCode @BigCodeProject

26 Sep 2022

💪With that dataset we aim to train a state of the art ~15B parameter language model for code that will be trained using @ServiceNowRSRCH in-house GPU cluster. With an adapted version of Megatron-LM, we’ll train the large model on the distributed infrastructure.

1

13

BigCode @BigCodeProject

22 Dec 2022

An important aspect of using these models is that they can copy code from the training data which requires attribution. To help users navigate this we built a search index of the pretraining data. hf.co/spaces/bigcode/santaco…

SantaCoder Search - a Hugging Face Space by bigcode

Discover amazing ML apps made by the community

1

1

12

7,242

BigCode @BigCodeProject

22 Dec 2022

In our experiments, we explored two questions: First, can we use Multi Query Attention (MQA) together with Fill-in-the-middle (FIM) without performance loss? Secondly, what's the best data filtering procedure for code models?

1

11

2,412

BigCode @BigCodeProject

9 Dec 2022

Are you at #EMNLP2022? We plan to host an informal BigCode networking event today at 3:00-4:30pm to bring the BigCode community together and to chat about LLMs for code! eventbrite.com/e/bigcode-net…

BigCode Networking Event

BigCode is an open scientific collaboration working on the responsible development of large language models for code.

12

BigCode @BigCodeProject

8 Jun 2023

We instruction-tuned StarCoder+ on the OpenAssistant Guanaco dataset to get StarChat-beta: a strong chat assistant Model: huggingface.co/HuggingFaceH4… Demo: huggingface.co/spaces/Huggin…

2

3

12

1,695

BigCode @BigCodeProject

8 Jun 2023

StarCoderBase showed promise in natural language reasoning despite being trained solely on GitHub code. So we fine-tuned it on the English web dataset used in Falcon pre-training: huggingface.co/bigcode/starc… huggingface.co/datasets/tiiu…

bigcode/starcoderplus · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

1

1

12

1,808

BigCode @BigCodeProject

22 Dec 2022

The thorough evaluation was made possible by the evaluation working group. They evaluated on the MultiPL-E benchmark (multilingual HumanEval and MBPP) and CodeXGLUE and extended the code evaluation harness! github.com/bigcode-project/b…

1

10

1,900

BigCode @BigCodeProject

4 May 2023

Replying to @defenderbenx @_akhaliq

The links are here: huggingface.co/bigcode

bigcode (BigCode)

Org profile for BigCode on Hugging Face, the AI community building the future.

3

10

9,519

BigCode @BigCodeProject

22 Dec 2022

With these new insights we trained a final model called SantaCoder. We applied both the extra near-deduplication and code-to-comment ratio filters and trained for 600K steps (236B tokens). The result is an efficient (MQA) and flexible (FIM) multilingual model:

1

1

10

1,891

BigCode @BigCodeProject

30 Sep 2022

The extended BigCode family on a boat in Amsterdam♥️

Emily Witko @witkoochocinco

29 Sep 2022

As promised… @huggingface boat photo!! Love working in Amsterdam 🤗

9

BigCode @BigCodeProject

22 Dec 2022

BigCode @BigCodeProject

22 Dec 2022

Announcing a holiday gift: 🎅SantaCoder - a 1.1B multilingual LM for code that outperforms much larger open-source models on both left-to-right generation and infilling! Demo: hf.co/spaces/bigcode/santa-d… Paper: hf.co/datasets/bigcode/admin… Attribution: hf.co/spaces/bigcode/santaco… A🧵:

1

9

2,849

BigCode @BigCodeProject

27 Oct 2022

"Fun" (actually quite stressful) fact: we planned to release the dataset earlier but discovered last minute that the HumanEval benchmark appeared in the training corpus. We decided to remove contaminated files from the near-deduped datasets and retrain the models (see above).

1

9

BigCode @BigCodeProject

27 Oct 2022

@luis_in_brief pointed out that LGPL, MPL, and EPL are actually not permissive but weak copyleft licenses (see blueoakcouncil.org). The files with these licenses make up less than 0.5% of the Python dataset and we are working on removing them from The Stack.

Blue Oak Council

free, practical materials about software licenses

blueoakcouncil.org

1

10

BigCode @BigCodeProject

4 May 2023

Replying to @CookingCodes

You can fine all links here: huggingface.co/bigcode

bigcode (BigCode)

Org profile for BigCode on Hugging Face, the AI community building the future.

8

3,464

BigCode @BigCodeProject

26 Sep 2022

🚀If you are excited about these topics join the project!

BigCode @BigCodeProject

26 Sep 2022

print("Hello world! 🎉") Excited to announce the BigCode project led by @ServiceNowRSRCH and @huggingface! In the spirit of BigScience we aim to develop large language models for code in an open and responsible way. Join here: bigcode-project.org/docs/abo… A thread with our goals🧵

1

1

7

BigCode @BigCodeProject

26 Sep 2022

🏎Where academic research usually stops after evaluation this is where the work for practical applications starts. For use-cases such as auto-complete, inference speed is crucial. We are interested in making architectural changes as well as tools for post-training optimization.

1

8

BigCode @BigCodeProject

27 Jul 2023

StarCoder Models generate more tokens per second for a single batch, but things getter more fun when we increase the batch size thanks to MQA 🚀 The sharing of values/keys among attention heads significantly reduces the memory bandwidth requirements.

2

7

1,257

BigCode @BigCodeProject

22 May 2023

Similarly to lm-evaluation-harness, we build a framework combining several benchmarks: - HumanEval, MBPP & APPS - MultiPL-E (HumanEval translated to 18 programming languages) - DS-1000 - PaL for GSM8K ...

1

7

711

BigCode @BigCodeProject

22 May 2023

Help us improve the evaluation harness, by adding more benchmarks and features ✨ Repo: github.com/bigcode-project/b…

GitHub - bigcode-project/bigcode-evaluation-harness: A framework for the evaluation of autoregres...

A framework for the evaluation of autoregressive code generation language models. - bigcode-project/bigcode-evaluation-harness

1

6

1,675

BigCode @BigCodeProject

26 Sep 2022

🤝We also want to follow, as well as establish, new responsible AI practices to train and share large language models. We welcome contributions from AI researchers and strive for openness and transparency in the LLM development process.

1

2

7

BigCode @BigCodeProject

1 Dec 2022

Code in the Stack will be used for model training. You can use 'Am I In The Stack?' to see if any of your code is included: huggingface.co/spaces/bigcod… If you would like to have your code removed from The Stack you can follow the instructions on the website: bigcode-project.org/docs/abo…

Am I in The Stack? - a Hugging Face Space by bigcode

This tool checks if your GitHub repositories are included in The Stack dataset. Enter your GitHub username and select the dataset version to see if your code is part of it. If found, you can follow...

6

BigCode @BigCodeProject

8 Jun 2023

Back to the start:

BigCode @BigCodeProject

8 Jun 2023

📣 Introducing ⭐ StarCoder+ & StarChat Beta! We trained StarCoder on the Falcon model's English web dataset and Instruction-tuned it. Both models rank high in the LLM leaderboard, with strong natural language performance and coding capabilities. huggingface.co/HuggingFaceH4…

1

6

2,025

BigCode @BigCodeProject

6 Nov 2023

Make sure to opt-out within the next two weeks, we set the cut-off date on the November 20th.

6

1,006

BigCode @BigCodeProject

20 Mar 2023

With the new version of The Stack we remove all the opt-out requests we received until February. The code subset of The Stack is similar to v1.1. If you want to learn more about the code in The Stack checkout this thread:

BigCode @BigCodeProject

1 Dec 2022

Today we are releasing The Stack v1.1! 🚀 We added more data, included more programming languages, and extended the list of permissive licenses used. huggingface.co/datasets/bigc… Also the first batch of opt-out requests was removed from the dataset.

1

5

497

BigCode @BigCodeProject

22 May 2023

🛡️ We provide Docker containers for improved security and reproducible execution. They are especially helpful for certain languages like Bash to avoid executing commands on your system. + Execution works out of the box for all supported programming languages.

1

6

1,188

BigCode @BigCodeProject

1 Feb 2023

Of course the demos will also be open source and not behind a closed API 🙂

5

1,053

BigCode @BigCodeProject

22 May 2023

A key difference to standard LLMs evaluation is that code evaluation requires generating thousands of solutions and then executing them against unit tests. For this, we need: 🚄 - quick text generation 🛡️ - secure execution

1

4

605

BigCode @BigCodeProject

15 Nov 2022

Learn more about the project and how to join it on the website: bigcode-project.org

Open and responsible development and use of LLMs for code

BigCode is an open scientific collaboration working on the responsible development and use of large language models for code

bigcode-project.org

3

BigCode @BigCodeProject

15 Nov 2022

"Am I in The Stack?" is an open governance interface for The Stack. You might also see forks of popular repos under your username. The released dataset does not include these cloned files as they are removed during the deduplication process.

2

3

BigCode @BigCodeProject

20 Mar 2023

Lots of exciting things to come in the next weeks - stay tuned!

3

894

BigCode @BigCodeProject

22 May 2023

Many code benchmarks have been released since HumanEval, but they are often scattered across different frameworks, making it hard to run them efficiently and reproduce results easily.

1

2

830