ZTE Intelligent Computing AI Platform: Facilitating AI Model Training and Inference

Release Date:2024-05-16 By Zhou Xiangsheng, Sun Wenqing

In the context of explosive data growth, continuous improvement in algorithm performance, and the ongoing iteration of computing products, we are in a phase where AI is leading all-round industrial transformation. In this process, the AI platform plays a critical role. Through intensive management of data, computing, algorithms, and services, the AI platform converts workshop-style, discrete algorithm research into standardized, automated production processes, avoiding redundant efforts and allowing users to focus on high-value challenges in intelligent services.

AI Platform: Key Infrastructure for Enterprise Intelligent Transformation

As a key bridge connecting computing and algorithms, the AI platform not only systematizes and formalizes the common requirements in the algorithm development process, but also offers users customized capabilities and services. In addition, the platform should have features such as sharing and multiplexing, efficient training and inference, fast delivery, and continuous iteration. To address these needs, ZTE has developed a platform for heterogeneous computing management and AI model training and inference—the intelligent computing AI platform. This platform consists of the infrastructure layer, engine layer, service layer, and capability layer.

  • Infrastructure layer: Thousands of GPUs and CPUs provide computing power, supporting both international mainstream graphics cards and domestic graphics cards.
  • Engine layer: The engine layer includes machine learning (ML) engine, hyperparameter tuning engine, training engine, compilation engine, and inference engine. This layer integrates multiple high-performance training and inference engine frameworks, such as Tensorflow, Pytorch, Oneflow, and Deepspeed.
  • Service layer: The service layer consists of dataset management, data labeling, model training, hyperparameter tuning, as well as model evaluation, compiling, and inference, covering end-to-end services of the AI model.
  • Capability layer: The capability layer provides various built-in algorithm and inference packages to solve practical problems, available for direct deployment and calling.

 

From basic computing and scheduling technologies to deep learning frameworks and engines, as well as perception and cognition capabilities such as NLP, CV, audio processing, and AI model, the AI platform serves as a key infrastructure for enterprise intelligent transformation. It not only integrates computing hardware and software tools, but also provides R&D interfaces for AI algorithms. Through this comprehensive integration, the AI platform greatly improves resource utilization efficiency and accelerates AI implementation.

Entering the Era of AI Models

Currently, in the AI implementation scenario, many small models that solve intermediate tasks or specific field tasks are being replaced by more universal AI models, leading to the full transformation of artificial intelligence into artificial general intelligence (AGI). Aadditionally, there is a growing demand for AI models with comprehensive, stable, and efficient data storage and cleaning, as well as training and inference skills, along with cluster resources. This poses new challenges to the construction of AI platforms.

The emergence of AI modules brings about a unified model structure and a training-inference paradigm. First, the transformer structure remains the preferred choice for the basic components of the backbone model. Second, concerning training and inference methods, using the AI module as an example, the training methods (including pre-training, instruction fine-tuning, and reinforcement learning fine-tuning) and inference methods (such as random sampling decoding) initially proposed by OpenAI continue to be mainstream solutions for AI model training and inference.

However, the unification of this structure and application paradigm does not close the gap between the industry’s average level and the leading AI companies. Instead, it shifts the focus of AI competition from algorithm R&D innovation to the competition in scale and efficiency of engineering AI model training and inference. This makes it a primary requirement for AI platform construction to integrate key technologies for training and inferring AI models.

Key Engineering Technologies for AI Model Training and Inference

The key technologies in the AI model training and inference process include distributed training, AI model inference acceleration, AI model evaluation, and AI model data engineering.

  • Distributed training: The distributed training technology can extend training to multiple AI hardware products, breaking the limits of single hardware memory and computing power. The intelligent computing AI platform integrates 3D hybrid parallel technology and has independently developed automatic parallel tools. These tools support AI model training technologies such as data parallelism (DP), tensor parallelism (TP), pipeline parallelism (PP), and activation re-computation, automatically adjusting parallel hyperparameters based on clusters and model characteristics.
  • AI model inference and acceleration: The AI model inference acceleration technology is a comprehensive technique for reducing memory consumption and computational delay during the inference process. The intelligent computing AI platform improves inference efficiency through various means, such as service scheduling, memory optimization, and quantization compression. In ZTE’s industry-leading "Zhiyu” SMS anti-fraud governance system based on AI models, the inference solution provided by the intelligent computing AI platform reduces inference delay by 30% compared to the industry’s general solution.
  • AI model evaluation: The AI model evaluation method differs greatly from traditional approaches. Therefore, the AI platform provides a comprehensive objective evaluation dataset to evaluate the performance of AI models from multiple dimensions. Additionally, the platform integrates a model-based evaluation mechanism to evaluate the semantic accuracy and logical consistency of the generated contents.
  • AI model data engineering: High-quality training data can mitigate AI model hallucination and shorten the training cycle. The intelligent computing AI platform provides intelligent data engineering pipelines such as model-in-the-loop data marking, SFT data generation and expansion, data cleaning and deduplication, quality evaluation, and privacy protection.

 

With the support of key engineering technologies for AI models, ZTE’s intelligent computing AI platform has achieved preliminary success in collaboration with ZTE and Chinese telecom operators. At the company level, the AI platform supports the training of AI models across multiple domains including telecommunications, coding, computer vision (CV), and multi-modal areas. For telecom operators, the AI platform has established training and inference clusters in 31 provinces of an operator group, offering nine core functions such as model training, management, and inference services. It has become an important tool cloud for operators’ AI development.