cfn/heterogeneous-distributed-training-framework

Go to file

qihuiz93 0976b87199 update sub group intro and release plan

Change-Id: Ieca800a76a942368d0a713b879a411d56da19d27

2025-07-11 11:16:52 +08:00

.github

Initial Commit of Megatron-LM-0.8.0

2025-05-23 09:54:48 +08:00

docs

Initial Commit of Megatron-LM-0.8.0

2025-05-23 09:54:48 +08:00

examples

Initial Commit of Megatron-LM-0.8.0

2025-05-23 09:54:48 +08:00

images

Initial Commit of Megatron-LM-0.8.0

2025-05-23 09:54:48 +08:00

megatron

修复冲突代码

2025-06-25 09:02:36 +08:00

tasks

Initial Commit of Megatron-LM-0.8.0

2025-05-23 09:54:48 +08:00

tests

Initial Commit of Megatron-LM-0.8.0

2025-05-23 09:54:48 +08:00

tools

Initial Commit of Megatron-LM-0.8.0

2025-05-23 09:54:48 +08:00

.coveragerc

Initial Commit of Megatron-LM-0.8.0

2025-05-23 09:54:48 +08:00

.gitignore

Initial Commit of Megatron-LM-0.8.0

2025-05-23 09:54:48 +08:00

.gitlab-ci.yml

Initial Commit of Megatron-LM-0.8.0

2025-05-23 09:54:48 +08:00

.gitreview

Added .gitreview and .zuul.yaml

2025-05-14 13:34:45 +00:00

.zuul.yaml

Added .gitreview and .zuul.yaml

2025-05-14 13:34:45 +00:00

CODEOWNERS

Initial Commit of Megatron-LM-0.8.0

2025-05-23 09:54:48 +08:00

CONTRIBUTING.md

Initial Commit of Megatron-LM-0.8.0

2025-05-23 09:54:48 +08:00

Dockerfile.ci

Initial Commit of Megatron-LM-0.8.0

2025-05-23 09:54:48 +08:00

Dockerfile.linting

Initial Commit of Megatron-LM-0.8.0

2025-05-23 09:54:48 +08:00

jet-tests.yml

Initial Commit of Megatron-LM-0.8.0

2025-05-23 09:54:48 +08:00

LICENSE

Initial Commit of Megatron-LM-0.8.0

2025-05-23 09:54:48 +08:00

MANIFEST.in

Initial Commit of Megatron-LM-0.8.0

2025-05-23 09:54:48 +08:00

pretrain_bert.py

Initial Commit of Megatron-LM-0.8.0

2025-05-23 09:54:48 +08:00

pretrain_gpt.py

实现dp相关逻辑

2025-05-23 10:04:15 +08:00

pretrain_ict.py

Initial Commit of Megatron-LM-0.8.0

2025-05-23 09:54:48 +08:00

pretrain_llama.py

实现dp相关逻辑

2025-06-23 16:41:26 +00:00

pretrain_mamba.py

Initial Commit of Megatron-LM-0.8.0

2025-05-23 09:54:48 +08:00

pretrain_retro.py

Initial Commit of Megatron-LM-0.8.0

2025-05-23 09:54:48 +08:00

pretrain_t5.py

Initial Commit of Megatron-LM-0.8.0

2025-05-23 09:54:48 +08:00

pretrain_vision_classify.py

Initial Commit of Megatron-LM-0.8.0

2025-05-23 09:54:48 +08:00

pretrain_vision_dino.py

Initial Commit of Megatron-LM-0.8.0

2025-05-23 09:54:48 +08:00

pretrain_vision_inpaint.py

Initial Commit of Megatron-LM-0.8.0

2025-05-23 09:54:48 +08:00

pretrain_vlm.py

Initial Commit of Megatron-LM-0.8.0

2025-05-23 09:54:48 +08:00

pyproject.toml

Initial Commit of Megatron-LM-0.8.0

2025-05-23 09:54:48 +08:00

README.md

update sub group intro and release plan

2025-07-11 11:16:52 +08:00

run.sh

实现pp相关逻辑

2025-06-23 16:42:49 +00:00

setup.py

Initial Commit of Megatron-LM-0.8.0

2025-05-23 09:54:48 +08:00

USAGE.md

实现pp相关逻辑

2025-06-23 16:42:49 +00:00

README.md

Heterogeneous Distributed Training Framework

Project Facts

Project Creation Date: Mar.20th, 2025

Primary Contact: Lei Huang, huangleiyjy@chinamobile.com; Zhengwei Chen, chenzhengwei@sd.chinamobile.com

Project Lead: Lei Huang, huangleiyjy@chinamobile.com

Committers:

Zhengwei Chen, chenzhengwei@sd.chinamobile.com
Yutong Tian, tianyutongcxy@sd.chinamobile.com

Mailing List: computing-force-network@lists.opendev.org

Meetings: No sub-group meeting time. Use bi-weekly meeting of CFN WG.

Repository: https://opendev.org/cfn/heterogeneous-distributed-training-framework

StoryBoard: N/A

Open Bugs: N/A

Introduction

Currently, the “resource wall” between different GPUs makes it difficult to build one heterogeneous resource pool for Large-scale models training. Heterogeneous distributed training becomes a pressing challenge for the industry to solve. We brought up the key technologies named Heterogeneous Distributed Training Technology(HDT).With the goal of generalization, this technology realizes the industry's first cross-architecture unified heterogeneous training framework. The training framework enables multiple LLMs deployed and trained on multiple types of GPUs. The Inhomogeneous Task Distribution(ITD) algorithm for heterogeneous training task splitting is innovatively proposed, which supports heterogeneous data parallelism and heterogeneous pipeline parallelism, and realizes the adaptive adjustment of parameters such as microbatches size, quantity, and parallelism of DP on heterogeneous GPUs. Currently, we’ve verified our capability on LLaMA2 7B & 13B model composed of Nvidia and other 4 types GPUs. The acceleration ratio reached 95%, loss converges to 1.8 and PPL curve converged normally.

Documentation & Training

N/A

Release Planning & Release Notes

For release of year 2025：

1.Heterogeneous Distributed Training Technology Solution: introduction of HDT technology solution, including user guide, architecture description, software,etc.

2.Others, TBD

README.md Unescape Escape

Heterogeneous Distributed Training Framework

Project Facts

Introduction

Documentation & Training

Release Planning & Release Notes

README.md