Abstract: In distributed machine learning (DML) based on the parameter server (PS) architecture, unbalanced communication load distribu⁃ tion of PSs will lead to a significant slowdown of model synchronization in heterogeneous networks due to low utilization of bandwidth. To ad⁃ dress this problem, a network-aware adaptive PS load distribution scheme is proposed, which accelerates model synchronization by proac⁃ tively adjusting the communication load on PSs according to network states. We evaluate the proposed scheme on MXNet, known as a realworld distributed training platform, and results show that our scheme achieves up to 2.68 times speed-up of model training in the dynamic and heterogeneous network environment.
Keywords: distributed machine learning; network awareness; parameter server; load distribution; heterogeneous network