Jamba model launch takeaways - Potentially a new leader for ultra-long prompt use-cases (RAG)
‣ First open-source model of this size to combine MAMBA state-space model architecture, Mixture-Of-Experts (MOE) and the transformer
‣ 256k context window, more than 2X the size of the next largest open-source model (Code Llama 70B's 100k) we measure
‣ High expected throughput tokens/s as with its MOE architecture, 12B of its 52B parameters are active at inference. For shorter prompts, expect faster than Grok-1 & Llama 2 but slower than Mixtral 8x7B
‣ Presents a potentially very attractive offer for long input token prompts / RAG as throughput scales with input token size due to MAMBA architecture. A21 declare 3X Mixtral 8x7B's tokens/s at 128k context window lengths
Congratulations
@AI21Labs. We look forward to benchmarking these declared speeds, particularly over long input token lengths 👀