New follow-up work on the effects of synthetic data on model pre-training. It’s becoming increasingly clear that the model collapse issues predicted by prior works are not panning out in theory and practice. Industry labs now even have entire synthetic data pre-training teams.
📢New preprint📢
🔄Collapse or Thrive? Perils and Promises of Synthetic Data in a Self-Generating World 🔄
A deeper dive into the effects of self-generated synthetic data on model-data feedback loops
w/ @JoshuaK92829 @ApratimDey2 @MGerstgrasser @rm_rafailov @sanmikoyejo
1/9
Oct 23, 2024 · 6:28 PM UTC
1
13
108


