
Change-Id: Ieca800a76a942368d0a713b879a411d56da19d27
Heterogeneous Distributed Training Framework
Project Facts
Project Creation Date: Mar.20th, 2025
Primary Contact: Lei Huang, huangleiyjy@chinamobile.com; Zhengwei Chen, chenzhengwei@sd.chinamobile.com
Project Lead: Lei Huang, huangleiyjy@chinamobile.com
Committers:
- Zhengwei Chen, chenzhengwei@sd.chinamobile.com
- Yutong Tian, tianyutongcxy@sd.chinamobile.com
Mailing List: computing-force-network@lists.opendev.org
Meetings: No sub-group meeting time. Use bi-weekly meeting of CFN WG.
Repository: https://opendev.org/cfn/heterogeneous-distributed-training-framework
StoryBoard: N/A
Open Bugs: N/A
Introduction
Currently, the “resource wall” between different GPUs makes it difficult to build one heterogeneous resource pool for Large-scale models training. Heterogeneous distributed training becomes a pressing challenge for the industry to solve. We brought up the key technologies named Heterogeneous Distributed Training Technology(HDT).With the goal of generalization, this technology realizes the industry's first cross-architecture unified heterogeneous training framework. The training framework enables multiple LLMs deployed and trained on multiple types of GPUs. The Inhomogeneous Task Distribution(ITD) algorithm for heterogeneous training task splitting is innovatively proposed, which supports heterogeneous data parallelism and heterogeneous pipeline parallelism, and realizes the adaptive adjustment of parameters such as microbatches size, quantity, and parallelism of DP on heterogeneous GPUs. Currently, we’ve verified our capability on LLaMA2 7B & 13B model composed of Nvidia and other 4 types GPUs. The acceleration ratio reached 95%, loss converges to 1.8 and PPL curve converged normally.
Documentation & Training
N/A
Release Planning & Release Notes
For release of year 2025:
1.Heterogeneous Distributed Training Technology Solution: introduction of HDT technology solution, including user guide, architecture description, software,etc.
2.Others, TBD