1/ F5-TTS: A Fully Non-Autoregressive Text-to-Speech System based on Flow Matching with Diffusion Transformer (DiT)
Researchers from Shanghai Jiao Tong University, the University of Cambridge, and Geely Automobile Research Institute introduced F5-TTS, a non-autoregressive text-to-speech (TTS) system that utilizes flow matching with a Diffusion Transformer (DiT). Unlike many conventional TTS models, F5-TTS does not require complex elements like duration modeling, phoneme alignment, or a dedicated text encoder. Instead, it introduces a simplified approach where text inputs are padded to match the length of the speech input, leveraging flow matching for effective synthesis. F5-TTS is designed to address the shortcomings of its predecessor, E2 TTS, which faced slow convergence and alignment issues between speech and text. Notable improvements include a ConvNeXt architecture to refine text representation and a novel Sway Sampling strategy during inference, significantly enhancing performance without retraining.
Structurally, F5-TTS leverages ConvNeXt and DiT to overcome alignment challenges between the text and generated speech. The input text is first processed by ConvNeXt blocks to prepare it for in-context learning with speech, allowing smoother alignment. The character sequence, padded with filler tokens, is fed into the model alongside a noisy version of the input speech. The Diffusion Transformer (DiT) backbone is used for training, employing flow matching to map a simple initial distribution to the data distribution effectively. Additionally, F5-TTS includes an innovative inference-time Sway Sampling technique that helps control flow steps, prioritizing early-stage inference to improve the alignment of generated speech with the input text....
1/ ⤵️ ⤵️