Daniel van Strien · Feb 19, 2024 · 8:57 AM UTC

Daniel van Strien

Daniel van Strien

@vanstriendaniel

19 Feb 2024

BioMistral is a new 7B foundation model for medical domains, based on Mistral and further trained PubMed Central. - top open-source medical Large Language Model (LLM) in its weight class - Apache License - includes base models, fine tunes, and quantized versions.

ALT Screenshot of the model card on the Hub.

221

1,169

146,071

Daniel van Strien · Jan 30, 2025 · 9:07 AM UTC

Daniel van Strien

@vanstriendaniel

30 Jan 2025

dolphin-r1: a dataset for training R1-style models - 800k total samples dataset similar in composition to the data used to train DeepSeek-R1 Distill models. - 300k from DeepSeek-R1 - 300k from Gemini 2.0 flash thinking - 200k from Dolphin chat.

ALT Screenshot of the dataset card on the Hugging Face Hub

510

45,879

Daniel van Strien · Oct 22, 2025 · 7:17 PM UTC

Daniel van Strien

@vanstriendaniel

22 Oct 2025

DeepSeek-OCR just got @vllm_project support 🚀 Currently processing @natlibscot's 27,915-page handbook collection with one command: Processing at ~350 images/sec on A100 Using @huggingface Jobs + @astral_sh uv - zero setup batch OCR! Will share final time + cost when done!

442

58,113

Daniel van Strien · Apr 7, 2025 · 8:48 AM UTC

Daniel van Strien

@vanstriendaniel

7 Apr 2025

OpenCodeReasoning: Competitive Coding Dataset - 735K Python solutions across 28K unique programming problems - Largest reasoning-based synthetic dataset for code generation - Generated by NVIDIA's R1 model with full reasoning steps - Sourced from 10 competitive coding platforms

373

45,201

Daniel van Strien · Aug 7, 2025 · 2:10 PM UTC

Daniel van Strien

@vanstriendaniel

7 Aug 2025

Another VLM-based OCR model! NuMarkdown-8B-Thinking is special, it reasons through documents before converting to markdown See it analyse this 1878 medical report step-by-step. Try with @huggingface Jobs + @astral uv - no GPU needed!

338

23,621

Daniel van Strien · Oct 28, 2025 · 6:26 PM UTC

Daniel van Strien

@vanstriendaniel

28 Oct 2025

NVIDIA Just Released 8M Sample Open Dataset + OCR Tooling on @huggingface - 3x larger than v1 (just 2 months ago!) - Image/video QA, reasoning, multilingual OCR - Commercial-ready (CC-BY-4.0) @NVIDIAAI is one of the few major AI labs releasing datasets 🤗

337

30,729

Daniel van Strien · Feb 5, 2024 · 10:31 AM UTC

Daniel van Strien

@vanstriendaniel

5 Feb 2024

"UltraTextbooks" on @huggingface is a comprehensive dataset of synthetic and human-written textbooks covering a range of subjects and programming languages. • 🎓 Over 5.5M examples • 💾 22.3 GB of educational content • 🌐 From math to science, and code

ALT Screenshot of the dataset card for the UltraTextBooks dataset on the Hugging Face Hub

309

42,095

Daniel van Strien · Apr 24, 2025 · 3:10 PM UTC

Daniel van Strien

@vanstriendaniel

24 Apr 2025

Need blazing-fast classifier inference with minimal code? ModernBERT now runs on @vllm_project — fast enough to process 200K+ arXiv papers in minutes. It makes running any of the 100s of ModernBERT models on the @huggingface Hub even quicker. Guide 👇 danielvanstrien.xyz/posts/20…

Efficient Inference for ModernBERT Classifiers Using vLLM

Modern Inference for modern classifier models. Using vLLM to scale inference for classifiers to clean and curate datasets

danielvanstrien.xyz

300

29,600

Daniel van Strien · Mar 26, 2025 · 9:49 AM UTC

Daniel van Strien

@vanstriendaniel

26 Mar 2025

AM-DeepSeek-R1-Distilled-1.4M: Massive reasoning dataset for LLM training - 1.4M high-quality reasoning problems with verified solutions - 900K entries distilled from DeepSeek-R1-671B - Covers math, code, and complex reasoning tasks - Bilingual (Chinese/English)

293

24,669

Daniel van Strien · Apr 1, 2025 · 7:55 AM UTC

Daniel van Strien

@vanstriendaniel

1 Apr 2025

RLVR: Breaking RL Beyond Math & Coding Boundaries - Extends reinforcement learning to 48 diverse academic fields - 638K expert-crafted QA pairs in medicine, law, economics & more - Proves RLVR effective for complex reasoning across all disciplines - Apache 2.0 licensed

250

29,374

Daniel van Strien · May 13, 2025 · 6:55 AM UTC

Daniel van Strien

@vanstriendaniel

13 May 2025

Ultra-FineWeb: A cleaner 1.1T-token foundation for better LLMs – 1T English + 120B Chinese, filtered for quality – +3.6 pts on MMLU, +3.7 on CMMLU vs FineWeb – Verification cut from 1200 GPUh → 110 – FastText classifier: 6,000 GPUh → 1,000 CPUh

243

44,297

Daniel van Strien · Jan 4, 2023 · 9:23 AM UTC

Daniel van Strien

@vanstriendaniel

4 Jan 2023

🤗 Ridiculously excited that today is my first-day working at @huggingface! I genuinely believe Hugging Face has done so much to democratise machine learning and I'm looking forward to contributing to this mission every day!

226

43,830

Daniel van Strien · Jul 11, 2022 · 3:13 PM UTC

Daniel van Strien

@vanstriendaniel

11 Jul 2022

Excited to announce #BigLAM, a @BigscienceW x @huggingface Hackathon focused on making library, archive, and museum datasets related to machine learning more accessible and discoverable github.com/bigscience-worksh…

GitHub - bigscience-workshop/lam: Libraries, Archives and Museums (LAM)

Libraries, Archives and Museums (LAM). Contribute to bigscience-workshop/lam development by creating an account on GitHub.

github.com

213

Daniel van Strien · Aug 24, 2022 · 9:33 PM UTC

Daniel van Strien

@vanstriendaniel

24 Aug 2022

Want to get started using #machinelearning based computer vision methods for humanities research/#DH/#GLAM applications? 💡

Screenshot with the following text:

Computer Vision for the Humanities: An Introduction to Deep Learning for Image Classification (Part 1)

Daniel van Strien, Kaspar Beelen, Melvin Wevers, Thomas Smits and Katherine McDonough

This is the first of a two-part lesson introducing deep learning based computer vision methods for humanities research. Using a dataset of historical newspaper advertisements and the fastai Python library, the lesson walks through the pipeline of training a computer vision model to perform image classification.

ALT Screenshot with the following text: Computer Vision for the Humanities: An Introduction to Deep Learning for Image Classification (Part 1) Daniel van Strien, Kaspar Beelen, Melvin Wevers, Thomas Smits and Katherine McDonough This is the first of a two-part lesson introducing deep learning based computer vision methods for humanities research. Using a dataset of historical newspaper advertisements and the fastai Python library, the lesson walks through the pipeline of training a computer vision model to perform image classification.

166

Daniel van Strien · Sep 23, 2024 · 2:25 PM UTC

Daniel van Strien

@vanstriendaniel

23 Sep 2024

ColPali is revolutionizing multimodal retrieval, but could it be even more effective with domain-specific fine-tuning? Check out my latest blog post, where I guide you through creating a ColPali fine-tuning dataset using @Alibaba_Qwen's Qwen2-VL-7B-Instruct model to generate queries.

197

31,936

Daniel van Strien · Nov 8, 2023 · 11:47 AM UTC

Daniel van Strien

@vanstriendaniel

8 Nov 2023

Visualizing the embedding space for models from @huggingface Hub using @nomic_ai's Atlas tool and @JinaAI_ embeddings. 🧵 Check out the overall embedding map! Clear clusters emerge, along with intriguing outliers. Let's dive into this exploration!

ALT A screenshot of a 'map' showing various coloured clusters representing model cards on the Hugging Face Hub.

173

47,495

Daniel van Strien · Mar 3, 2025 · 9:52 AM UTC

Daniel van Strien

@vanstriendaniel

3 Mar 2025

KodCode: The Largest Verified Synthetic Coding Dataset - 447K question-solution-test triplets with verifiable correctness - 12 diverse subsets - 10-trial verification system for solution robustness - Each problem includes automated test case verification

167

21,250

Daniel van Strien · Nov 22, 2023 · 8:08 PM UTC

Daniel van Strien

@vanstriendaniel

22 Nov 2023

The Mistral-7B-v0.1 model by @MistralAI is the foundation for at least 178 models available on the @huggingface Hub. These fine-tuned models have collectively been downloaded over 300,000 times. Strong, openly shared base models genuinely impact open-source machine learning.

Screenshot of the following text:

Models fine-tuned from mistralai/Mistral-7B-v0.1

mistralai/Mistral-7B-v0.1 has 178 children

mistralai/Mistral-7B-v0.1's children have been downloaded 328,312 times

ALT Screenshot of the following text: Models fine-tuned from mistralai/Mistral-7B-v0.1 mistralai/Mistral-7B-v0.1 has 178 children mistralai/Mistral-7B-v0.1's children have been downloaded 328,312 times

162

39,340

Daniel van Strien · Apr 3, 2025 · 8:05 AM UTC

Daniel van Strien

@vanstriendaniel

3 Apr 2025

MegaMath: Pushing the Limits of Open Math Corpora - 213.1B tokens across specialized math content domains - Largest math pretraining dataset, surpassing DeepSeekMath by 30% - Rigorous deduplication & fontset filtering for quality - Delivers 15-20% boost on math benchmarks

167

25,067

Daniel van Strien · Jan 31, 2024 · 10:22 AM UTC

Daniel van Strien

@vanstriendaniel

31 Jan 2024

Access over 32 Billion tokens of public domain historical newspaper data in 10 European languages. It's available on the @huggingface Hub, courtesy of biglam/europeana_newspapers. Documentation is a work in progress, but you can already explore this collection. ✨🗞️🌍📰🔎

154

24,411

Daniel van Strien · Nov 24, 2023 · 6:27 PM UTC

Daniel van Strien

@vanstriendaniel

24 Nov 2023

This is a work-in-progress Space where you can interact with the children of Mistral-7B-v0.1. If anyone has excellent network visualization skills, I would be thrilled to provide the data so that you can enhance its aesthetic appeal! huggingface.co/spaces/davans…

ALT A network diagram showing a bunch of nodes and links

147

38,787

Daniel van Strien · Jun 8, 2023 · 8:48 AM UTC

Daniel van Strien

@vanstriendaniel

8 Jun 2023

BERTopic now has @huggingface hub integration! We're excited to see which models you share on the Hub 🤗 🤔 Want some inspiration? Check out this notebook which shows you how to train a topic model on Transformers @github issues and shares it on the Hub colab.research.google.com/#f…

Google Colab

colab.research.google.com

143

26,753

Daniel van Strien · Dec 20, 2024 · 4:51 PM UTC

Daniel van Strien

@vanstriendaniel

20 Dec 2024

Introducing FineWeb-C 🌐🎓, a community-built dataset for improving language models in ALL languages. Inspired by FineWeb-Edu the community is labelling the educational quality of texts for many languages. 318 annotators, 32K+ annotations, 12 languages - and growing! 🌍

ALT Text saying FineWeb-c Educational content in many languages, labelled by the community

145

18,502

Daniel van Strien · Mar 12, 2025 · 10:55 AM UTC

Daniel van Strien

@vanstriendaniel

12 Mar 2025

DeepReviewer-13K: A dataset for training LLMs for academic paper review - 13,378 high-quality structured reviews - 33.24% accept rate - Multi-stage reasoning with novelty and reliability analysis - Available in 6 languages - Custom license preventing use in formal reviews

135

16,692

Daniel van Strien · Mar 20, 2025 · 9:17 AM UTC

Daniel van Strien

@vanstriendaniel

20 Mar 2025

Glaive Reasoning Dataset: Filling the Non-Technical Reasoning Gap - Addresses lack of large, general reasoning datasets - 177GB of reasoning traces beyond math/code - Creative writing & conversation scenarios - Apache-2.0 licensed for open research

136

17,376

Daniel van Strien · Mar 4, 2025 · 8:33 AM UTC

Daniel van Strien

@vanstriendaniel

4 Mar 2025

GeneralThought-195K: Diverse Reasoning Dataset - 195K reasoning traces from 7+ models - Expanded beyond math to sciences, humanities & conversations - Full reasoning traces with verification scores - MIT licensed with community contributions

137

19,680

Daniel van Strien · Feb 25, 2025 · 10:28 AM UTC

Daniel van Strien

@vanstriendaniel

25 Feb 2025

Big-Math: Big-Math: Massive Math Dataset for RL Training - 10x larger than GSM8k/MATH - 3 core properties: uniquely verifiable, open-ended, closed-form - Human-validated 90%+ precision filters - Difficulty metrics for curriculum learning

130

19,215

Daniel van Strien · Jan 31, 2025 · 9:05 AM UTC

Daniel van Strien

@vanstriendaniel

31 Jan 2025

WILDCHAT-50M: The largest open chat dataset - 125M+ chat transcripts - 1M+ conversations per model - Built on WildChat - Used to create RE-WILD SFT mix Outperforms existing benchmarks with 40% less data

ALT Screenshot of the wildChat collection on Hugging Face

129

18,008

Daniel van Strien · Nov 4, 2025 · 10:43 AM UTC

Daniel van Strien

@vanstriendaniel

4 Nov 2025

Open models (GLM-4.6, Kimi K2, DeepSeek, etc.) + @opencode + @huggingface Inference Providers = automated GitHub code reviews. Tested on real repos. `/oc fix this` → bot creates PR. Works great, costs pennies. 5 minutes to set up! Guide: huggingface.co/docs/inferenc…

129

19,118

Daniel van Strien · Apr 8, 2025 · 7:48 AM UTC

Daniel van Strien

@vanstriendaniel

8 Apr 2025

Llama-Nemotron-Post-Training-Dataset-v1: Massive Model Training Dataset - 30M+ examples: 19.8M math, 9.6M code + science, instruction & safety data - Powers NVIDIA's Llama-3.3-Nemotron-Super-49B & 3.1-Nemotron-Nano-8B - Responses generated by 9 leading foundation models

122

4,721

Daniel van Strien · Jan 29, 2024 · 12:00 PM UTC

Daniel van Strien

@vanstriendaniel

29 Jan 2024

AutoMathText is a 200 GB dataset of mathematical texts, autonomously labelled for quality and relevance and available on @huggingface Hub 📚 Sourced from websites, arXiv, and GitHub 🤖 Scored by Qwen-72B Potentially very powerful for making open LLMs better at maths

ALT Screenshot of a dataset preview of the AutoMathText dataset on the Hugging Face Hub

117

14,106

Daniel van Strien · Nov 13, 2023 · 11:07 AM UTC

Daniel van Strien

@vanstriendaniel

13 Nov 2023

Recently, @ai4privacy released "the world's largest open-source privacy dataset" on the @huggingface Hub. Here are its features: 🏷️ 54 PII classes 💼 229 use cases 🌐 Multilingual: EN, FR, DE, IT

Screenshot of a Hugging Face dataset (https://huggingface.co/datasets/ai4privacy/pii-masking-200k). Shows some metadata and the first few rows of the dataset. The preview of the dataset shows columns related to masked and unmasked text and span labels.

ALT Screenshot of a Hugging Face dataset (https://huggingface.co/datasets/ai4privacy/pii-masking-200k). Shows some metadata and the first few rows of the dataset. The preview of the dataset shows columns related to masked and unmasked text and span labels.

115

38,003

Daniel van Strien · Feb 4, 2025 · 8:36 AM UTC

Daniel van Strien

@vanstriendaniel

4 Feb 2025

AceCode-89K: First automated test synthesis pipeline - 89K coding problems with reliable test cases - GPT-4o-mini generates 16 test cases per problem - Pass rates used as verifiable rewards - Enables RM training & RL for coding models

119

19,357

Daniel van Strien · Jan 31, 2025 · 4:23 PM UTC

Daniel van Strien

@vanstriendaniel

31 Jan 2025

Great @huggingface blog post from @ihor_step on replicating R1 for text-to-graph tasks. It's great to see GRPO adapted to reward an LLM for doing a novel task well. 🔗 huggingface.co/blog/Ihor/rep…

Replicating DeepSeek R1 for Information Extraction

A Blog post by Stepanov on Hugging Face

huggingface.co

120

6,035

Daniel van Strien · Jul 1, 2024 · 9:38 AM UTC

Daniel van Strien

@vanstriendaniel

1 Jul 2024

Arboretum is the largest public biodiversity dataset with 134.6M curated images from iNaturalist. It features image-language paired data for training multimodal AI models to support biodiversity and agriculture research. @huggingface dataset: huggingface.co/datasets/Chih…

ALT Screenshot of a species images in the dataset.

114

16,801

Daniel van Strien · Feb 5, 2024 · 11:27 AM UTC

Daniel van Strien

@vanstriendaniel

5 Feb 2024

Synthetic data is going to be massively important in 2024, so we have recently launched a new tag on the @huggingface Hub to facilitate the discovery and sharing of synthetic datasets. To add this tag to your dataset card metadata, simply include the `synthetic` tag.

107

23,915

Daniel van Strien · Sep 24, 2024 · 12:35 PM UTC

Daniel van Strien

@vanstriendaniel

24 Sep 2024

Yesterday, I shared a blog post on generating data for fine-tuning ColPali using @Alibaba_Qwen's Qwen2-VL-7B-Instruct. To simplify testing this approach, I created a @gradio Space that lets you generate queries from an input document page image. huggingface.co/spaces/davans…

115

19,412

Daniel van Strien · May 19, 2025 · 8:09 AM UTC

Daniel van Strien

@vanstriendaniel

19 May 2025

LEXam: Legal Reasoning Benchmark - 4,586 exam questions from Swiss, EU & international law - 3 configurations: standard MCQs, perturbed MCQs & open questions - Evaluated 20+ SoTA LLMs with expert verification - Rich metadata across jurisdictions & legal domains

115

17,555

Daniel van Strien · Apr 3, 2025 · 7:02 PM UTC

Daniel van Strien

@vanstriendaniel

3 Apr 2025

OpenThoughts2-1M: Million-sample dataset powering SOTA reasoning models - Systematically curated from 26 data generation approaches - Powers models outperforming DeepSeek-R1-32B on math benchmarks - Enables 76.7% on AIME24 & 90.8% on MATH500 with just SFT

114

6,406

Daniel van Strien · May 5, 2024 · 11:30 AM UTC

Daniel van Strien

@vanstriendaniel

5 May 2024

WildChat from @allen_ai consists of 1 million exchanges between ChatGPT and human users, complemented by demographic details like location, IP addresses, and request headers. Check it out at: huggingface.co/datasets/alle…

allenai/WildChat-1M · Datasets at Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

111

30,587

Daniel van Strien · Jul 8, 2025 · 12:05 PM UTC

Daniel van Strien

@vanstriendaniel

8 Jul 2025

465 people. 122 languages. 58,185 annotations! FineWeb-C v1 is complete! Communities worldwide have built their own educational quality datasets, proving that we don't need to wait for big tech to support languages. Huge thanks to all who contributed! huggingface.co/blog/davanstr…

108

20,113

Daniel van Strien · Apr 9, 2025 · 8:26 AM UTC

Daniel van Strien

@vanstriendaniel

9 Apr 2025

DeepCoder: Comprehensive Programming Dataset - 24K+ verified problems with 5+ test cases each - Sources include TACO, PrimeIntellect, LiveCodeBench, Codeforces - Temporal split: train on pre-Aug 2024, test on newer problems - Powers SOTA open-source 14B DeepCoder model

108

4,533

Daniel van Strien · Oct 22, 2025 · 8:55 AM UTC

Daniel van Strien

@vanstriendaniel

22 Oct 2025

OCR is one of AI's oldest challenges (first systems: early 1900s!) Modern vision-language models have transformed what's possible: handwriting, 100+ languages, math formulas, tables, signature extraction... New @huggingface guide on OCR huggingface.co/blog/ocr-open…

Supercharge your OCR Pipelines with Open Models

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

112

5,339

Daniel van Strien · Apr 10, 2025 · 7:09 AM UTC

Daniel van Strien

@vanstriendaniel

10 Apr 2025

Reasoning datasets dominate @huggingface trending datasets, but focus mainly on code/math. We launched a competition with @bespokelabsai & @togethercompute to diversify this. Create a PoC reasoning dataset and win prizes to help scale it!

108

22,176

Daniel van Strien · Mar 25, 2025 · 12:49 PM UTC

Daniel van Strien

@vanstriendaniel

25 Mar 2025

REALM: Real-World Application of Large Language Models Dataset - 93k+ documented LLM use cases from 2020-2024 - Reddit posts + news articles with full content - Dual categorization: AI Use Taxonomy & O*NET - MIT-licensed with interactive exploration dashboard

107

16,724

Daniel van Strien · Aug 6, 2025 · 7:22 AM UTC

Daniel van Strien

@vanstriendaniel

6 Aug 2025

You can now generate synthetic data using @OpenAI's GPT OSS models on @huggingface Jobs! One command, no setup: hf jobs uv run --flavor l4x4 [script-url] \ --input-dataset your/dataset \ --output-dataset your/output Works on L4 GPUs ⚡ huggingface.co/datasets/uv-s…

uv-scripts/openai-oss · Datasets at Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

104

16,317

Daniel van Strien · Sep 9, 2022 · 2:25 PM UTC

Daniel van Strien

@vanstriendaniel

9 Sep 2022

Can we combine the power of the @huggingface datasets hub and @LabelStudioHQ to create a workflow that allows you to move quickly between annotating and model training? new blog post exploring this 👉🏻 danielvanstrien.xyz/huggingf… 🧵

100

Daniel van Strien · Jan 28, 2025 · 9:57 AM UTC

Daniel van Strien

@vanstriendaniel

28 Jan 2025

SCP-116K: A massive scientific problem-solving dataset - 116K+ high-quality problem-solution pairs - Covers physics, chemistry & biology - University to PhD-level content - Model-generated solutions included

14,854

Daniel van Strien · Mar 11, 2025 · 4:53 PM UTC

Daniel van Strien

@vanstriendaniel

11 Mar 2025

Reasoning models aren't just for math and science! Using @Alibaba_Qwen's QwQ-32B + @bespokelabsai Curator to create reasoning datasets for structured data extraction from model cards. First in a series on using GRPO for increasingly weird things... danielvanstrien.xyz/posts/20…

Using QwQ to generate a reasoning dataset for structured data extraction

Learn how to use QwQ-32B to generate synthetic reasoning datasets for training smaller models on structured data extraction tasks

danielvanstrien.xyz

100

6,368

Daniel van Strien · May 29, 2024 · 5:52 PM UTC

Daniel van Strien

@vanstriendaniel

29 May 2024

Do you need a dataset to train a custom sentence transformer model? I've created a pipeline for using an LLM to create a synthetic dataset you can directly use for fine-tuning/training a Setence Transformers model. *Link in next tweet

100

14,894

Daniel van Strien · Sep 23, 2024 · 7:00 PM UTC

Daniel van Strien

@vanstriendaniel

23 Sep 2024

Want to convert a bunch of PDFs into a dataset of single-page images? There's a @huggingface Space for that! huggingface.co/spaces/Datase…

ALT Illustration showing a single PDF document being split into separate pages

100

23,167

Daniel van Strien · Feb 3, 2025 · 9:37 AM UTC

Daniel van Strien

@vanstriendaniel

3 Feb 2025

WebInstruct-CFT: Teaching LLMs to critique - 600K instruction-critique pairs - 65% math, plus business & sciences - GPT-4 generated detailed critiques - 3 sizes: 4K/50K/600K examples

ALT Screenshot of the dataset viewer on the Hugging Face Hub.

4,527

Daniel van Strien · Aug 19, 2024 · 8:31 AM UTC

Daniel van Strien

@vanstriendaniel

19 Aug 2024

🚀 Introducing Hugging Face Similar: a Chrome extension to find relevant datasets! ✨ Adds a "Similar Datasets" section to @huggingface dataset pages 🔍 Recommendations based on dataset READMEs 🏗️ Powered by @trychroma and @SnowflakeDB Get it now! chromewebstore.google.com/de…

26,787

Daniel van Strien · Nov 3, 2023 · 5:43 PM UTC

Daniel van Strien

@vanstriendaniel

3 Nov 2023

Embedding @huggingface Hub dataset cards with existing models was challenging due to the limited context window. Now, with the new long context embedding models from @JinaAI_ and @nomic_ai Atlas, we can visualize the full context of these cards. 🔗: atlas.nomic.ai/map/393ab060-…

ALT Screenshot of a embedding map consisting of different clusters of dataset cards.

23,743

Daniel van Strien · Dec 20, 2024 · 9:34 AM UTC

Daniel van Strien

@vanstriendaniel

20 Dec 2024

Hot take: shipping BERT-sized models in 2025 will benefit far more people than sharing an LLM overfitted to some saturated leaderboards 🙊 We're already seeing ModernBERT finetunes on the @huggingface Hub. My guess is we'll see hundreds of these by the end of 2025.

ALT Screenshot of a model tree of ModernBERT-base fine tunes

6,438

Daniel van Strien · Oct 2, 2024 · 4:28 PM UTC

Daniel van Strien

@vanstriendaniel

2 Oct 2024

ColPali is an exciting new approach to multimodal document retrieval, but some doubt its practical use with existing vector DBs. It turns out it's super easy to use @qdrant_engine to index and search ColPali embeddings efficiently. Blog post here: danielvanstrien.xyz/posts/po…

ALT Screenshot of a search for "top secret" in a function and an image result below which has a page from a document with UFOs.

11,785

Daniel van Strien · May 19, 2025 · 7:06 PM UTC

Daniel van Strien

@vanstriendaniel

19 May 2025

WildDoc: Real-World Document Understanding Benchmark - 12K+ document images captured in natural environments - Each document photographed in 4 different conditions - Tests 5 real-world factors - Reveals 35.3% performance drop in leading MLLMs

16,202

Daniel van Strien · Oct 7, 2025 · 3:40 PM UTC

Daniel van Strien

@vanstriendaniel

7 Oct 2025

DoTS.ocr from @xiaohongshu just got native @vllm_project support! I built a UV script so you can run SOTA multilingual OCR in seconds with zero setup using @huggingface Jobs Tested on 1800s library cards - works great ✨

37,531

Daniel van Strien · Apr 23, 2025 · 8:49 AM UTC

Daniel van Strien

@vanstriendaniel

23 Apr 2025

Academic Chains: Scientific Thinking for LLMs - Distils reasoning from papers in biology & economics - Captures researcher intuition & exploration - Fine-tuning shows 7.2% improvement on MMLU-Pro Economics - @huggingface /@bespokelabsai reasoning datasets competition submission

15,398

Daniel van Strien · Oct 24, 2024 · 7:29 PM UTC

Daniel van Strien

@vanstriendaniel

24 Oct 2024

Is your desktop full of random screenshots? I just wrote a blog post on using @MistralAI's Pixtral (via @LMStudioAI) to automatically organize screenshots—no cloud APIs needed, so you can keep the weird memes you collect private! Post here: danielvanstrien.xyz/posts/20…. More info 🧵

16,719

Daniel van Strien · Jun 13, 2024 · 9:17 AM UTC

Daniel van Strien

@vanstriendaniel

13 Jun 2024

APIGen from @sfresearch is a function-calling dataset created by APIGen, "an automated data generation pipeline designed to produce verifiable high-quality datasets for function-calling applications." 60,000 rows, 3,673 APIs, 21 categories huggingface.co/datasets/Sale…

ALT Image showing a workflow for generating data synthetically

17,765

Daniel van Strien · Jul 30, 2025 · 3:36 PM UTC

Daniel van Strien

@vanstriendaniel

30 Jul 2025

I just processed 1000s of prompts using Qwen3-235B-A22B-Instruct-2507 across 4 GPUs! How? Everyone plays their part: @astral_sh UV handles dependencies @huggingface Jobs handles GPUs @Alibaba_Qwen handles the model @vllm_project handles inference One command. Zero complexity!

17,213

Daniel van Strien · Feb 7, 2024 · 2:54 PM UTC

Daniel van Strien

@vanstriendaniel

7 Feb 2024

Awesome to see @hackernoon share a dataset of tech news on the @huggingface Hub. 📊 6.9 Millions Rows 📜 MIT licence 🔍 Datasets server preview

ALT Screenshot of the huggingface server preview for the dataset

30,848

Daniel van Strien · May 8, 2025 · 8:38 AM UTC

Daniel van Strien

@vanstriendaniel

8 May 2025

Finally documented the Beyond Words dataset from @LC_Labs/@lee_bcg for BigLAM @huggingface org! - 3.5K annotated historical newspaper pages - Bounding boxes + category labels - Photos, ads, headlines, cartoons & more Also trained some YOLO 11 models with the dataset for fun!

15,621

Daniel van Strien · Mar 11, 2025 · 9:03 AM UTC

Daniel van Strien

@vanstriendaniel

11 Mar 2025

II-Thought RL v0: First large-scale multi-domain RL dataset - 960K+ high-quality question-answer pairs - 735K math problems from multiple sources - 98K code problems with execution validation - 74K science & specialized domain samples

4,789

Daniel van Strien · Feb 21, 2025 · 10:46 AM UTC

Daniel van Strien

@vanstriendaniel

21 Feb 2025

AgentTrek: Web Agent Training Dataset - 52K annotated dialogue turns - Web browsing & shopping tasks - User-agent interaction flows - Real-world web task simulations

17,554

Daniel van Strien · Jun 17, 2024 · 4:55 PM UTC

Daniel van Strien

@vanstriendaniel

17 Jun 2024

📁✨Meet Corpus Creator! This @Gradio app turns your local files into a chunked @huggingface dataset via @llama_index. Perfect for building datasets for synthetic data pipelines, annotation, and beyond. Try it here: huggingface.co/spaces/davans…

Corpus Creator - a Hugging Face Space by davanstrien

Upload text files to create a structured dataset, customize chunking parameters, and upload to the Hugging Face Hub for NLP tasks.

huggingface.co

23,114

Daniel van Strien · Feb 14, 2025 · 2:55 PM UTC

Daniel van Strien

@vanstriendaniel

14 Feb 2025

How do you make 1M+ @HuggingFace models & datasets more discoverable? 🤔 I fine-tuned SmolLM2-360M to generate one-line summaries from a README. Its own self-description? "A model for generating concise summaries of model & dataset cards from the Hugging Face Hub"

16,242

Daniel van Strien · May 21, 2025 · 7:57 AM UTC

Daniel van Strien

@vanstriendaniel

21 May 2025

EuroSpeech: Massive Multilingual Parliamentary Speech Corpus - 78,100+ hours across 22 European languages - 50,500+ hours of quality-filtered data (CER < 20%) - Robust alignment algorithm for non-verbatim texts - Dramatically expands resources for 19+ languages

6,911

Daniel van Strien · Jan 3, 2024 · 4:23 PM UTC

Daniel van Strien

@vanstriendaniel

3 Jan 2024

Get ready for 2024: The Year of Open Machine Learning Datasets! 🌟 Here's a mega thread for all the datasets I'm showcasing this year. 🧵

An illustration in soft pastel colors featuring a whimsical robot cat seated in a cozy armchair. The robot cat, combining feline features with robotic elements, has a friendly and playful expression. It is attentively reading a book titled '2024: The Year of Open Machine Learning Datasets'. Surrounding the robot cat are various open books and digital screens, each displaying colorful graphs and data from different machine learning datasets. The room is decorated in a storybook fashion, complete with fluffy cushions, plush furnishings, and gentle sunlight streaming through a window, creating a warm and inviting atmosphere.

ALT An illustration in soft pastel colors featuring a whimsical robot cat seated in a cozy armchair. The robot cat, combining feline features with robotic elements, has a friendly and playful expression. It is attentively reading a book titled '2024: The Year of Open Machine Learning Datasets'. Surrounding the robot cat are various open books and digital screens, each displaying colorful graphs and data from different machine learning datasets. The room is decorated in a storybook fashion, complete with fluffy cushions, plush furnishings, and gentle sunlight streaming through a window, creating a warm and inviting atmosphere.

29,326

Daniel van Strien · Jun 20, 2024 · 4:12 PM UTC

Daniel van Strien

@vanstriendaniel

20 Jun 2024

Excited to introduce Synthetic Data Workshop, a @HuggingFace Space that aims to make creating synthetic datasets easier! ✅ Pre-configured environment ✅ Ready-to-use notebooks ✅ No local GPU needed huggingface.co/spaces/davans…

ALT Screenshot of the landing page for the Space

5,629

Daniel van Strien · May 7, 2025 · 8:18 AM UTC

Daniel van Strien

@vanstriendaniel

7 May 2025

SwallowCode: LLM-Rewritten Python Dataset - 16.1B tokens from The Stack v2 - Filtered by syntax + pylint (≥7.0) - Rewritten twice via Llama-3.3 - +17.0 pass@1 (HumanEval)

19,618

Daniel van Strien · Mar 4, 2024 · 9:39 AM UTC

Daniel van Strien

@vanstriendaniel

4 Mar 2024

Multimodal ArXiv is a Dataset for Improving Scientific Comprehension of Large Vision-Language Models 🖼️ figure-caption dataset 📸 6.4M images and 3.9M captions 📄 from 572K ArXiv papers spanning various scientific domains. 🤗 Available on @huggingface Hub.

ALT Screenshot of MMInstruction/ArxivCap dataset on the Hugging Face Hub

20,745

Daniel van Strien · Apr 22, 2025 · 8:02 AM UTC

Daniel van Strien

@vanstriendaniel

22 Apr 2025

NVIDIA's ClimbLab: Setting a New Standard for Pretraining - 1.2 trillion tokens in 20 semantic clusters - Two-classifier system removes low-quality content - Demonstrates superior scaling properties in 1B models - CC BY-NC 4.0 licensed for research community

14,739

Daniel van Strien · Apr 15, 2025 · 2:45 PM UTC

Daniel van Strien

@vanstriendaniel

15 Apr 2025

Fine Reasoning Questions: Expanding Web Text to Reasoning Tasks - 144 complex reasoning questions from diverse web content - Both text-dependent and independent formats beyond math & science - Aims to show how to transform "in the wild"web content into reasoning questions

13,237

Daniel van Strien · May 14, 2025 · 6:32 AM UTC

Daniel van Strien

@vanstriendaniel

14 May 2025

MIRACLRetrieval: Massive Multilingual Search Dataset – 18 languages from 10 language families – 78K queries with 726K+ relevance judgments – 106M+ unique Wikipedia documents – Expert annotations by native speakers

15,541

Daniel van Strien · Sep 4, 2025 · 3:45 PM UTC

Daniel van Strien

@vanstriendaniel

4 Sep 2025

Turn any HF dataset into an interactive embedding visualization - now with @huggingface Jobs support! One command → GPU processing → deployed Space Built on @Apple's Embedding Atlas library ❤️

7,299

Daniel van Strien · Mar 7, 2025 · 10:00 AM UTC

Daniel van Strien

@vanstriendaniel

7 Mar 2025

GSM8K-Platinum: Enhanced Math Benchmark - 1,209 math problems after removing 110 ambiguous questions - Identified 8.3% of problems had errors in original benchmark - Prevents false "benchmark saturation" - Even 96% accurate models make elementary reasoning errors

15,281

Daniel van Strien · May 31, 2024 · 8:55 AM UTC

Daniel van Strien

@vanstriendaniel

31 May 2024

Thanks to @tomaarsen's work in the latest Sentence Transformers release, training custom models is easier than ever. With improved training support and synthetic data for fine-tuning, you can build a model in less than a day. Example here👇: huggingface.co/collections/d…

ALT Screenshot of a code model and some before after metrics

35,862

Daniel van Strien · Feb 24, 2025 · 9:45 AM UTC

Daniel van Strien

@vanstriendaniel

24 Feb 2025

TimeTravel: A Comprehensive Benchmark to Evaluate LMMs on Historical and Cultural Artifacts - 10K+ expert-verified artifacts - 266 cultures across 10 regions - Full metadata & image pairs

14,980

Daniel van Strien · Feb 2, 2025 · 12:07 PM UTC

Daniel van Strien

@vanstriendaniel

2 Feb 2025

NESTFUL: A benchmark for nested API calls from @IBMResearch - 1.8K+ executable function sequences - Tests math reasoning & coding tools - Evaluates variable handling & chaining - Shows gaps in current LLM capabilities Apache 2.0 licensed & fully reproducible

ALT Screenshot of the dataset viewer on the Hugging Face Hub

35,285

Daniel van Strien · Mar 16, 2022 · 1:33 PM UTC

Daniel van Strien

@vanstriendaniel

16 Mar 2022

Super excited to have a guest post on the @huggingface blog showing how you can use 🤗Datasets and Sentence Transformers to create an image search app for 19th Century book images huggingface.co/blog/image-se… 🧵

Image search with 🤗 datasets

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

Daniel van Strien · May 2, 2025 · 4:43 PM UTC

Daniel van Strien

@vanstriendaniel

2 May 2025

🗞️ Finally documented this massive multilingual newspaper dataset we curated for BigLAM — built from the amazing @Europeanaeu Newspapers collection. - 32B tokens - 12 languages - OCR scores + metadata Find it on @huggingface

16,489

Daniel van Strien · Jun 11, 2024 · 8:50 AM UTC

Daniel van Strien

@vanstriendaniel

11 Jun 2024

"Can language models (LLMs) understand protein sequences like natural language?" ProteinLMDataset is a dataset with 17.46 billion tokens for pretraining and 893,000 instructions for supervised fine-tuning (SFT) that helps explore these questions. huggingface.co/datasets/tsyn….

17,810

Daniel van Strien · May 14, 2024 · 2:16 PM UTC

Daniel van Strien

@vanstriendaniel

14 May 2024

Created an "Awesome Synthetic Datasets" list in my ongoing quest to learn more about building synthetic datasets using large language models. Currently includes important tools, datasets, and papers. Check it out here: github.com/davanstrien/aweso…

GitHub - davanstrien/awesome-synthetic-datasets: awesome synthetic (text) datasets

awesome synthetic (text) datasets. Contribute to davanstrien/awesome-synthetic-datasets development by creating an account on GitHub.

github.com

5,976

Daniel van Strien · Feb 20, 2025 · 8:52 AM UTC

Daniel van Strien

@vanstriendaniel

20 Feb 2025

CuratedThoughts: Clean Math Data for RL - Fixes critical flaws in math reasoning datasets - Removes 5-25% of problematic examples unsuitable for RL - Prevents models from learning invalid reasoning paths Enables reliable reward verification for GRPO training

33,979

Daniel van Strien · May 22, 2024 · 10:05 AM UTC

Daniel van Strien

@vanstriendaniel

22 May 2024

RLAIF-V-Dataset is a large multimodal feedback dataset featuring images + questions alongside chosen and rejected responses. Available on the @huggingface Hub: huggingface.co/datasets/Haoy…

18,346

Daniel van Strien · Jun 30, 2020 · 3:31 PM UTC

Daniel van Strien

@vanstriendaniel

30 Jun 2020

Hi folks wanting to use deep-learning with library collections 👋 I’m *trying* to start a very informal study group aimed at people wanting to use #AI in/with #GLAMs (Galleries, Libraries Archives and Museums) and GLAM data 🖼📚🗂

Daniel van Strien · Dec 3, 2023 · 11:19 AM UTC

Daniel van Strien

@vanstriendaniel

3 Dec 2023

An intriguing new 7B merge model has just been shared on the @huggingface Hub. The model is an experimental combination of Zephyr 7B Beta and Notus 7B v1. Cool to see smaller merges also being created! Check it out here: huggingface.co/mergedlm/zeph…

9,074

Daniel van Strien · Oct 16, 2025 · 8:41 AM UTC

Daniel van Strien

@vanstriendaniel

16 Oct 2025

Great to see @Alibaba_Qwen sharing datasets on @huggingface! Many/most bigger labs are not doing this!

3,068

Daniel van Strien · Jan 13, 2022 · 6:46 PM UTC

Daniel van Strien

@vanstriendaniel

13 Jan 2022

I was excited to see that @huggingface datasets recently added support for image features 📸. I wrote a little blog post on how you can use datasets + faiss + Sentence Transformers to create an image search application for searching bl book images 👉🏻 danielvanstrien.xyz/metadata…

Daniel van Strien · Oct 13, 2025 · 6:04 PM UTC

Daniel van Strien

@vanstriendaniel

13 Oct 2025

@nanonets just shipped Nanonets-OCR2: new 3B VLM for OCR! LaTeX equations, tables, handwriting, charts, multilingual - it does it all! You can try it against your data with one command via @huggingface Jobs - no local GPU needed! The HF Jobs command/output from the model 👇

24,427

Daniel van Strien · Jan 15, 2024 · 6:43 PM UTC

Daniel van Strien

@vanstriendaniel

15 Jan 2024

Introducing Haiku DPO: a synthetically generated Direct Preference Optimization dataset. It consists of user prompts for a haiku and generations scored for "correctness". You can directly use this dataset for DPO, but it may also be interesting for exploring other questions 🧵

10,652

Daniel van Strien · Oct 14, 2025 · 5:35 PM UTC

Daniel van Strien

@vanstriendaniel

14 Oct 2025

Already doing a yolo fine-tune run of Qwen3-VL-4B Using a @huggingface Jobs script so just have to switch out a model ID!

4,024

Daniel van Strien · Jul 28, 2025 · 4:30 PM UTC

Daniel van Strien

@vanstriendaniel

28 Jul 2025

HF Jobs just launched! 🚀 One command VLM based OCR with uv Scripts: hf jobs uv run [script] ufo-images ufo-text Classified UFO docs → clean markdown. Zero setup! Try it → huggingface.co/uv-scripts

17,066

Daniel van Strien · Jun 3, 2025 · 10:24 AM UTC

Daniel van Strien

@vanstriendaniel

3 Jun 2025

Over 1.5M models on @huggingface… How do you pick the right one for your needs? 🔍 Try this semantic search prototype with size filters (0-1B to 70B+): 🔗 huggingface.co/spaces/librar…

14,522

Daniel van Strien · Feb 11, 2024 · 12:19 PM UTC

Daniel van Strien

@vanstriendaniel

11 Feb 2024

Great to see community building on top of each other's work for datasets. @argilla_io took the excellent OpenHermes-2.5 dataset from @teknium, generated new responses using a @NousResearch model and used Distilabel + PairRM from @billyuchenlin et al. to create a DPO version.

Screenshot of trending datasets with two datasets highlighted with a yellow circle indicating one dataset was used to help create another dataset.

ALT Screenshot of trending datasets with two datasets highlighted with a yellow circle indicating one dataset was used to help create another dataset.

16,271

Daniel van Strien · Jun 21, 2024 · 9:05 AM UTC

Daniel van Strien

@vanstriendaniel

21 Jun 2024

Instruction pre-training is a new approach that enhances LLM pretraining by using instruction-response pairs from an instruction synthesizer instead of raw data. Explore this method in this @gradio Space: huggingface.co/spaces/davans…

ALT A workflow diagram showing the use of a instruction synethizer

ALT An example screenshot of the gradio demo.

14,185

Daniel van Strien · Aug 17, 2023 · 5:35 PM UTC

Daniel van Strien

@vanstriendaniel

17 Aug 2023

Do you want to keep track of new models trained on a dataset from the @huggingface Hub? The Dataset-to-Model Monitor will notify you whenever a new model is released that uses a dataset you are tracking. huggingface.co/spaces/librar…

ALT A screenshot of a notification about a new model. The notification is a Hugging Face discussion which tags people tracking a dataset.

A screenshot of the app which shows some describing the app. It also shows a text box which allows you to input the ID for a dataset to track. Other buttons allow you to list datasets a user is tracking and unsubscribe from alerts.

ALT A screenshot of the app which shows some describing the app. It also shows a text box which allows you to input the ID for a dataset to track. Other buttons allow you to list datasets a user is tracking and unsubscribe from alerts.

28,637

Daniel van Strien · Sep 25, 2023 · 10:53 AM UTC

Daniel van Strien

@vanstriendaniel

25 Sep 2023

Replying to @deliprao

If you pay it's okay

3,952