Jim Fan · Sep 14, 2023 · 2:03 PM UTC

Jim Fan · Sep 14, 2023 · 2:03 PM UTC

Jim Fan

Jim Fan

@DrJimFan

14 Sep 2023

This is the way to unlock the next trillion high-quality tokens, currently frozen in textbook pixels that are not LLM-ready. Nougat: an open-source OCR model that accurately scans books with heavy math/scientific notations. It's ages ahead of other open OCR options. Meta is doing extraordinary open-source AI, sometimes without as much fanfare as Llama. My first serious AI research project (back @Columbia, 2012) was to convert chemical engineering PDFs into NLP-ready corpus. I still remember the immense pain of Tesseract, a much older OCR system (github.com/tesseract-ocr/tes…). Now Nougat runs a powerful Swin Transformer backbone and blows the benchmarks out of the water. We're talking about double-digit improvements across all metrics. Now, textbooks are all we need for the next GPT! Website: facebookresearch.github.io/n… Open-source code: github.com/facebookresearch/… Paper "Nougat: Neural Optical Understanding for Academic Documents": arxiv.org/abs/2308.13418

Sep 14, 2023 · 2:03 PM UTC

117

735

3,869