摘要:生成式大模型训练需要超大规模低时延、高带宽、高可用的网络承载底座。对生成式大模型下高性能网络基础设施的技术发展路线和实现方案进行了研究,认为商用部署时需针对不同训练阶段的工作负载和流量模式,开展定制化网络架构设计和传输协议优化。流控/拥塞控制技术、负载均衡技术、自动化运维技术和面向广域远程直接内存访问(RDMA)的确定性网络传输技术是未来的重点研究方向。
关键词:生成式大模型;RDMA;网络拥塞控制;网络负载均衡
Abstract:The large generative models training has posed demands for ultra-large-scale, low latency, high bandwidth, and high-availability network infrastructure. The technological development roadmap and implementation schemes of high-performance network infrastructure for large models are investigated. It is believed that the customized network architecture design and transport protocol optimization should be carried out based on workloads and traffic patterns at different training stages during commercial deployment. Flow control/congestion control technologies, load balancing technologies, automated operation and maintenance solutions, and deterministic network transmission technologies for wide-area remote direct memory access (RDMA) are key research directions for the future.
Keywords: large generative models; RDMA; network congestion control; network load balancing