In the context of explosive data growth, continuous improvement in algorithm performance, and the ongoing iteration of computing products, we are in a phase where AI is leading all-round industrial transformation. In this process, the AI platform plays a critical role. Through intensive management of data, computing, algorithms, and services, the AI platform converts workshop-style, discrete algorithm research into standardized, automated production processes, avoiding redundant efforts and allowing users to focus on high-value challenges in intelligent services.
AI Platform: Key Infrastructure for Enterprise Intelligent Transformation
As a key bridge connecting computing and algorithms, the AI platform not only systematizes and formalizes the common requirements in the algorithm development process, but also offers users customized capabilities and services. In addition, the platform should have features such as sharing and multiplexing, efficient training and inference, fast delivery, and continuous iteration. To address these needs, ZTE has developed a platform for heterogeneous computing management and AI model training and inference—the intelligent computing AI platform. This platform consists of the infrastructure layer, engine layer, service layer, and capability layer.
From basic computing and scheduling technologies to deep learning frameworks and engines, as well as perception and cognition capabilities such as NLP, CV, audio processing, and AI model, the AI platform serves as a key infrastructure for enterprise intelligent transformation. It not only integrates computing hardware and software tools, but also provides R&D interfaces for AI algorithms. Through this comprehensive integration, the AI platform greatly improves resource utilization efficiency and accelerates AI implementation.
Entering the Era of AI Models
Currently, in the AI implementation scenario, many small models that solve intermediate tasks or specific field tasks are being replaced by more universal AI models, leading to the full transformation of artificial intelligence into artificial general intelligence (AGI). Aadditionally, there is a growing demand for AI models with comprehensive, stable, and efficient data storage and cleaning, as well as training and inference skills, along with cluster resources. This poses new challenges to the construction of AI platforms.
The emergence of AI modules brings about a unified model structure and a training-inference paradigm. First, the transformer structure remains the preferred choice for the basic components of the backbone model. Second, concerning training and inference methods, using the AI module as an example, the training methods (including pre-training, instruction fine-tuning, and reinforcement learning fine-tuning) and inference methods (such as random sampling decoding) initially proposed by OpenAI continue to be mainstream solutions for AI model training and inference.
However, the unification of this structure and application paradigm does not close the gap between the industry’s average level and the leading AI companies. Instead, it shifts the focus of AI competition from algorithm R&D innovation to the competition in scale and efficiency of engineering AI model training and inference. This makes it a primary requirement for AI platform construction to integrate key technologies for training and inferring AI models.
Key Engineering Technologies for AI Model Training and Inference
The key technologies in the AI model training and inference process include distributed training, AI model inference acceleration, AI model evaluation, and AI model data engineering.
With the support of key engineering technologies for AI models, ZTE’s intelligent computing AI platform has achieved preliminary success in collaboration with ZTE and Chinese telecom operators. At the company level, the AI platform supports the training of AI models across multiple domains including telecommunications, coding, computer vision (CV), and multi-modal areas. For telecom operators, the AI platform has established training and inference clusters in 31 provinces of an operator group, offering nine core functions such as model training, management, and inference services. It has become an important tool cloud for operators’ AI development.