This is the way to unlock the next trillion high-quality tokens, currently frozen in textbook pixels that are not LLM-ready.
Nougat: an open-source OCR model that accurately scans books with heavy math/scientific notations. It's ages ahead of other open OCR options. Meta is doing extraordinary open-source AI, sometimes without as much fanfare as Llama.
My first serious AI research project (back @Columbia, 2012) was to convert chemical engineering PDFs into NLP-ready corpus. I still remember the immense pain of Tesseract, a much older OCR system (github.com/tesseract-ocr/tes…).
Now Nougat runs a powerful Swin Transformer backbone and blows the benchmarks out of the water. We're talking about double-digit improvements across all metrics.
Now, textbooks are all we need for the next GPT!
Website: facebookresearch.github.io/n…
Open-source code: github.com/facebookresearch/…
Paper "Nougat: Neural Optical Understanding for Academic Documents": arxiv.org/abs/2308.13418
Sep 14, 2023 · 2:03 PM UTC
117
735
3,869

