[feat][BREAKING] Megatron: Support learning rate scheduler #1701

ETOgaosion · May 26, 2025

Checklist Before Starting

Search for similar PR(s).

What does this PR do?

Support lr scheduler in megatron

High-Level Design

Still got some difference with FSDP's optimizer in APIs

Specific Changes

List the specific changes.

API

    optim:
      lr: 1e-6
      clip_grad: 1.0
      total_training_steps: -1  # must be override by program
      lr_warmup_init: 0.0  # initial learning rate for warmup, default to 0.0
      lr_warmup_steps: -1 # Prioritized. Negative values mean delegating to lr_warmup_steps_ratio.
      lr_warmup_steps_ratio: 0.  # the total steps will be injected during runtime
      lr_decay_steps: null
      lr_decay_style: linear # select from constant/linear/cosine/inverse_square_root
      min_lr: 0.0 # minimum learning rate, default to 0.0
      weight_decay: 0.01
      weight_decay_incr_style: constant # select from constant/linear/cosine
      lr_wsd_decay_style: exponential # select from constant/exponential/cosine
      lr_wsd_decay_steps: null
      use_checkpoint_opt_param_scheduler: False # use checkpoint optimizer parameter scheduler

Notice that there are some differences in APIs between Megatron optimizer and FSDP optimizer.

Megatron optimizer scheduler names the period after lr_warmup as lr_decay_steps, so the warmup_style actually means the style of lr decay after warmup.
Megatron optimizer also support weight decay decay mechanism
use_checkpoint_opt_param_scheduler determines whether to use the checkpoint optimizer parameter scheduler. If set to True, the optimizer parameter scheduler will be saved in the checkpoint and loaded from the checkpoint during resuming training.

Usage Example

Provide usage example(s) for easier usage.

# Add code snippet or script demonstrating how to use this

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluatuion results, etc.

Additional Info.

Issue Number: Fixes issue # or discussion # if any.
Training: [Note which backend this PR will affect: FSDP, Megatron, both, or none]
Inference: [Note which backend this PR will affect: vLLM, SGLang, both, or none]

Checklist Before Submitting

Read the Contribute Guide.
Apply pre-commit checks.
Add [BREAKING] to the PR title if it breaks any API.
Update the documentation about your changes in the docs.
Add CI test(s) if necessary.

…e#1701) ### Checklist Before Starting - [x] Search for similar PR(s). ### What does this PR do? Support lr scheduler in megatron ### High-Level Design Still got some difference with FSDP's optimizer in APIs ### Specific Changes > List the specific changes. ### API ```yaml optim: lr: 1e-6 clip_grad: 1.0 total_training_steps: -1 # must be override by program lr_warmup_init: 0.0 # initial learning rate for warmup, default to 0.0 lr_warmup_steps: -1 # Prioritized. Negative values mean delegating to lr_warmup_steps_ratio. lr_warmup_steps_ratio: 0. # the total steps will be injected during runtime lr_decay_steps: null lr_decay_style: linear # select from constant/linear/cosine/inverse_square_root min_lr: 0.0 # minimum learning rate, default to 0.0 weight_decay: 0.01 weight_decay_incr_style: constant # select from constant/linear/cosine lr_wsd_decay_style: exponential # select from constant/exponential/cosine lr_wsd_decay_steps: null use_checkpoint_opt_param_scheduler: False # use checkpoint optimizer parameter scheduler ``` Notice that there are some differences in APIs between Megatron optimizer and FSDP optimizer. - Megatron optimizer scheduler names the period after lr_warmup as lr_decay_steps, so the ``warmup_style`` actually means the style of lr decay after warmup. - Megatron optimizer also support weight decay decay mechanism - ``use_checkpoint_opt_param_scheduler`` determines whether to use the checkpoint optimizer parameter scheduler. If set to True, the optimizer parameter scheduler will be saved in the checkpoint and loaded from the checkpoint during resuming training. ### Usage Example > Provide usage example(s) for easier usage. ```python # Add code snippet or script demonstrating how to use this ``` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluatuion results, etc. ### Additional Info. - **Issue Number**: Fixes issue # or discussion # if any. - **Training**: [Note which backend this PR will affect: FSDP, Megatron, both, or none] - **Inference**: [Note which backend this PR will affect: vLLM, SGLang, both, or none] ### Checklist Before Submitting - [x] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [x] Add `[BREAKING]` to the PR title if it breaks any API. - [x] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [x] Add CI test(s) if necessary.

ETOgaosion and others added 9 commits May 26, 2025 21:54

try to support lr schedule

4cc6f91

understand the arguments

9323a25

fix optimizer config

4d1ee68

fix error and add CI

5a8612d

fix a typo

65999eb

fix lr scheduler path

051a550

fix checkpoint var

bb549e2

reduce same funtion tests

1a7575a

Merge branch 'main' into align_lr_scheduler

5775644

ccclyu added the status: need review label May 28, 2025

ETOgaosion and others added 2 commits June 4, 2025 10:45

fix critic missing optimizer

312fa57

Merge branch 'main' into align_lr_scheduler

7fa12fa

vermouth1992 approved these changes Jun 7, 2025

View reviewed changes

vermouth1992 merged commit 01ae019 into volcengine:main Jun 7, 2025
36 of 41 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[feat][BREAKING] Megatron: Support learning rate scheduler #1701

[feat][BREAKING] Megatron: Support learning rate scheduler #1701

Uh oh!

ETOgaosion commented May 26, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Search code, repositories, users, issues, pull requests...

[feat][BREAKING] Megatron: Support learning rate scheduler #1701

[feat][BREAKING] Megatron: Support learning rate scheduler #1701

Uh oh!

Conversation

ETOgaosion commented May 26, 2025

Checklist Before Starting

What does this PR do?

High-Level Design

Specific Changes

API

Usage Example

Test

Additional Info.

Checklist Before Submitting

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants