Files
tianyutong d6ce507681 Initial Commit of Megatron-LM-0.8.0
Change-Id: Ifb4c061207ee2644a21e161ad52fc6ff40564e39
2025-05-23 09:54:48 +08:00

1.8 KiB

distributed package

This package contains various utilities to finalize model weight gradients on each rank before the optimizer step. This includes a distributed data parallelism wrapper to all-reduce or reduce-scatter the gradients across data-parallel replicas, and a finalize_model_grads method to synchronize gradients across different parallelism modes (e.g., 'tied' layers on different pipeline stages, or gradients for experts in a MoE on different ranks due to expert parallelism).

Submodules

distributed.distributed_data_parallel

Model wrapper for distributed data parallelism. Stores gradients in a contiguous buffer, and supports the option of overlapping communication (all-reduce or reduce-scatter) with backprop computation by breaking up full model's gradients into smaller buckets and running all-reduce / reduce-scatter on each bucket asynchronously.

core.distributed.distributed_data_parallel

distributed.finalize_model_grads

Finalize model gradients for optimizer step across all used parallelism modes. Synchronizes the all-reduce / reduce-scatter of model gradients across DP replicas, all-reduces the layernorm gradients for sequence parallelism, embedding gradients across first and last pipeline stages (if not tied), and expert gradients for expert parallelism.

core.distributed.finalize_model_grads

Module contents

Contains functionality to synchronize gradients across different ranks before optimizer step.

core.distributed