I'm extremely excited to announce "the big bomb"!: Neo and Matrix, that we're working on with colleagues and friends from open-source community,
M-A-P.ai, wuhan ai, and
01.ai. Neo is the first fully-transparent bilingual large language model, with fully open-sourced pretrain corpus, data processing pipeline, training framework manipulated from Megatron-LM, intermediate ckpts, and relatively smaller ckpts for investigating scaling law. Matrix is a 4.7 trillion tokens directly adoptable pretrain corpus, which has gone through strict heuristic rules-based filtering and deduplication. The computational resource is supported by
01.ai and
wuhan.ai. Kudos to my colleagues!
@01AI_Yi
@MM_Art_Project
Neo Model Series:
huggingface.co/collections/m…
Matrix:
huggingface.co/datasets/m-a-…
We name the series as Neo and Matrix as a salute to the movie, the MATRIX! Neo has notably better performance on the metrics of reasoning, math, code, and Chinese, as shown the following!