阿里云AI高性能网络架构HPN

发布时间:2025-01-23 作者:钱坤,翟恩南,操佳敏

摘要:介绍了阿里云用于大型语言模型(LLM)训练的数据中心网络架构高性能网络(HPN)。HPN通过双上联、多轨、双平面的网络架构设计,避免了单链路故障带来的严重连通性影响,并且避免了哈希极化的产生。实验表明,HPN将LLM训练的端到端性能提升超过14.9%。HPN已在阿里的生产环境中部署了超过1年。

关键词:大模型训练;网络架构;数据中心网络

 

Abstract: The Alibaba cloud's data center network architecture for high-performance network (HPN) used in the training of large language models (LLMs). HPN is designed with a dual-top of rank (ToR), rail-optimized, and dual-plane architecture, which avoids severe connectivity impacts caused by single-link failures and prevents hash polarization. Experiments have shown that HPN improves the end-to-end performance of LLM training by over 14.9%. HPN has been deployed in Alibaba's production for over a year.

Keywords: large-scale model training; network architecture; data center network