ByteScale is a new LLM training framework
- Evaluated 7B to 141B param models
- 256K to 2048K context lengths
- 12,000 GPUs
- Optimized for mixed long and short sequences
The crux of it is a much more dynamic parallelism strategy (as opposed to a static mesh) to account for heterogeneity in sequence length. They call this strategy Hybrid Data Parallelism (HDP), which combines regular data parallelism with context parallelism in a dynamic manner.
Their data loading strategy is very network and CPU-memory intensive and requires global coordination across workers (as opposed to each worker doing its own thing). They use Ray actor for this coordination. There are
- Servers to fetch and preprocess raw data from HDFS and generate metadata
- A scheduler to collect global metadata from all servers, figure out the the loading plan, and broadcast the plan to clients
- Clients (on GPUs), which read the partial data from servers based on the loading plan