Friday, September 20, 2024

Microsoft has success with Mixture of Experts technique in Phi-3.5

Must read

Microsoft announces a new family of LMs. The Phi-3.5 line includes three models, including, for the first time, a model that uses Mixture of Experts technology. This technology brings the model close to the level of GPT-4o-mini.

Phi-3.5 has been made available by Microsoft on Hugging Face. There are three variants: Phi-3.5-vision, Phi-3.5-MoE and Phi-3.5-mini. In this series of LMs, Microsoft is experimenting with the Mixture of Experts technology for the first time, and that approach appears to be paying off. Phi-3.5-MoE finishes higher than 8B’s Llama-3.1, Gemma-2-9B and Gemini-1.5 Flash in the most commonly used benchmarks for AI models. That is true, while the Phi-3.5 variant consists of much smaller models of 3.8B.

In a Mixture of Experts technique, there are multiple models present, called “experts,” in this case sixteen models. In operation, however, the model uses only 6.6 billion active parameters, by deploying two experts.

Another advantage of the use of this technology is the training of the LLM. In particular, training is less bulky and requires less computing power with cheaper models. Phi-3.5-MoE was trained on 4.9 trillion tokens on 512 H100 GPUs. The mini model from the Phi-3.5 family has the same computing power available and was trained on 3.4 trillion parameters.

Text and graphics

Finally, Phi-3.5-vision was trained with 500 billion parameters on 256 A100 GPUs. The result is a 4.2B model. This model’s notable feature is its ability to process both text and images. Images or videos can, therefore, be given as input.

All three models contain a context window of 128K tokens. They are available through Hugging Face under an MIT license. Developers can thus use the AI models as Microsoft releases them or adapt them to their own needs.

Also read: Does the future of (generative) AI lie in open source?

Latest article