qihuiz93 0976b87199 update sub group intro and release plan
Change-Id: Ieca800a76a942368d0a713b879a411d56da19d27
2025-07-11 11:16:52 +08:00
2025-05-23 09:54:48 +08:00
2025-06-25 09:02:36 +08:00
2025-05-23 09:54:48 +08:00
2025-05-23 09:54:48 +08:00
2025-05-23 09:54:48 +08:00
2025-05-23 09:54:48 +08:00
2025-05-14 13:34:45 +00:00
2025-05-14 13:34:45 +00:00
2025-05-23 09:54:48 +08:00
2025-05-23 09:54:48 +08:00
2025-05-23 10:04:15 +08:00
2025-06-23 16:41:26 +00:00
2025-06-23 16:42:49 +00:00
2025-05-23 09:54:48 +08:00
2025-06-23 16:42:49 +00:00

Heterogeneous Distributed Training Framework

Project Facts

Project Creation Date: Mar.20th, 2025

Primary Contact: Lei Huang, huangleiyjy@chinamobile.com; Zhengwei Chen, chenzhengwei@sd.chinamobile.com

Project Lead: Lei Huang, huangleiyjy@chinamobile.com

Committers:

Mailing List: computing-force-network@lists.opendev.org

Meetings: No sub-group meeting time. Use bi-weekly meeting of CFN WG.

Repository: https://opendev.org/cfn/heterogeneous-distributed-training-framework

StoryBoard: N/A

Open Bugs: N/A

Introduction

Currently, the “resource wall” between different GPUs makes it difficult to build one heterogeneous resource pool for Large-scale models training. Heterogeneous distributed training becomes a pressing challenge for the industry to solve. We brought up the key technologies named Heterogeneous Distributed Training Technology(HDT).With the goal of generalization, this technology realizes the industry's first cross-architecture unified heterogeneous training framework. The training framework enables multiple LLMs deployed and trained on multiple types of GPUs. The Inhomogeneous Task Distribution(ITD) algorithm for heterogeneous training task splitting is innovatively proposed, which supports heterogeneous data parallelism and heterogeneous pipeline parallelism, and realizes the adaptive adjustment of parameters such as microbatches size, quantity, and parallelism of DP on heterogeneous GPUs. Currently, weve verified our capability on LLaMA2 7B & 13B model composed of Nvidia and other 4 types GPUs. The acceleration ratio reached 95%, loss converges to 1.8 and PPL curve converged normally.

Documentation & Training

N/A

Release Planning & Release Notes

For release of year 2025

1.Heterogeneous Distributed Training Technology Solution: introduction of HDT technology solution, including user guide, architecture description, software,etc.

2.Others, TBD

Description
Heterogeneous distributed training framework researches the training method and framework over heterogeneous hardware.
Readme 6.6 MiB
Languages
Python 97.7%
Shell 1.3%
C++ 0.8%