Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Conversation

ETOgaosion
Copy link
Collaborator

Checklist Before Starting

  • Search for similar PR(s).

What does this PR do?

Support lr scheduler in megatron

High-Level Design

Still got some difference with FSDP's optimizer in APIs

Specific Changes

List the specific changes.

API

    optim:
      lr: 1e-6
      clip_grad: 1.0
      total_training_steps: -1  # must be override by program
      lr_warmup_init: 0.0  # initial learning rate for warmup, default to 0.0
      lr_warmup_steps: -1 # Prioritized. Negative values mean delegating to lr_warmup_steps_ratio.
      lr_warmup_steps_ratio: 0.  # the total steps will be injected during runtime
      lr_decay_steps: null
      lr_decay_style: linear # select from constant/linear/cosine/inverse_square_root
      min_lr: 0.0 # minimum learning rate, default to 0.0
      weight_decay: 0.01
      weight_decay_incr_style: constant # select from constant/linear/cosine
      lr_wsd_decay_style: exponential # select from constant/exponential/cosine
      lr_wsd_decay_steps: null
      use_checkpoint_opt_param_scheduler: False # use checkpoint optimizer parameter scheduler

Notice that there are some differences in APIs between Megatron optimizer and FSDP optimizer.

  • Megatron optimizer scheduler names the period after lr_warmup as lr_decay_steps, so the warmup_style actually means the style of lr decay after warmup.
  • Megatron optimizer also support weight decay decay mechanism
  • use_checkpoint_opt_param_scheduler determines whether to use the checkpoint optimizer parameter scheduler. If set to True, the optimizer parameter scheduler will be saved in the checkpoint and loaded from the checkpoint during resuming training.

Usage Example

Provide usage example(s) for easier usage.

# Add code snippet or script demonstrating how to use this 

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluatuion results, etc.

Additional Info.

  • Issue Number: Fixes issue # or discussion # if any.
  • Training: [Note which backend this PR will affect: FSDP, Megatron, both, or none]
  • Inference: [Note which backend this PR will affect: vLLM, SGLang, both, or none]

Checklist Before Submitting

  • Read the Contribute Guide.
  • Apply pre-commit checks.
  • Add [BREAKING] to the PR title if it breaks any API.
  • Update the documentation about your changes in the docs.
  • Add CI test(s) if necessary.

@vermouth1992 vermouth1992 merged commit 01ae019 into volcengine:main Jun 7, 2025
36 of 41 checks passed
yellowbee686 pushed a commit to yellowbee686/verl that referenced this pull request Jun 10, 2025
…e#1701)

### Checklist Before Starting

- [x] Search for similar PR(s).

### What does this PR do?

Support lr scheduler in megatron

### High-Level Design

Still got some difference with FSDP's optimizer in APIs

### Specific Changes

> List the specific changes.

### API

```yaml
    optim:
      lr: 1e-6
      clip_grad: 1.0
      total_training_steps: -1  # must be override by program
      lr_warmup_init: 0.0  # initial learning rate for warmup, default to 0.0
      lr_warmup_steps: -1 # Prioritized. Negative values mean delegating to lr_warmup_steps_ratio.
      lr_warmup_steps_ratio: 0.  # the total steps will be injected during runtime
      lr_decay_steps: null
      lr_decay_style: linear # select from constant/linear/cosine/inverse_square_root
      min_lr: 0.0 # minimum learning rate, default to 0.0
      weight_decay: 0.01
      weight_decay_incr_style: constant # select from constant/linear/cosine
      lr_wsd_decay_style: exponential # select from constant/exponential/cosine
      lr_wsd_decay_steps: null
      use_checkpoint_opt_param_scheduler: False # use checkpoint optimizer parameter scheduler
```


Notice that there are some differences in APIs between Megatron
optimizer and FSDP optimizer.

- Megatron optimizer scheduler names the period after lr_warmup as
lr_decay_steps, so the ``warmup_style`` actually means the style of lr
decay after warmup.
- Megatron optimizer also support weight decay decay mechanism
- ``use_checkpoint_opt_param_scheduler`` determines whether to use the
checkpoint optimizer parameter scheduler. If set to True, the optimizer
parameter scheduler will be saved in the checkpoint and loaded from the
checkpoint during resuming training.

### Usage Example

> Provide usage example(s) for easier usage.

```python
# Add code snippet or script demonstrating how to use this 
```

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluatuion results, etc.

### Additional Info.

- **Issue Number**: Fixes issue # or discussion # if any.
- **Training**: [Note which backend this PR will affect: FSDP, Megatron,
both, or none]
- **Inference**: [Note which backend this PR will affect: vLLM, SGLang,
both, or none]

### Checklist Before Submitting

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [x] Add `[BREAKING]` to the PR title if it breaks any API.
- [x] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add CI test(s) if necessary.
whatadayG pushed a commit to whatadayG/verl that referenced this pull request Sep 5, 2025
…e#1701)

### Checklist Before Starting

- [x] Search for similar PR(s).

### What does this PR do?

Support lr scheduler in megatron

### High-Level Design

Still got some difference with FSDP's optimizer in APIs

### Specific Changes

> List the specific changes.

### API

```yaml
    optim:
      lr: 1e-6
      clip_grad: 1.0
      total_training_steps: -1  # must be override by program
      lr_warmup_init: 0.0  # initial learning rate for warmup, default to 0.0
      lr_warmup_steps: -1 # Prioritized. Negative values mean delegating to lr_warmup_steps_ratio.
      lr_warmup_steps_ratio: 0.  # the total steps will be injected during runtime
      lr_decay_steps: null
      lr_decay_style: linear # select from constant/linear/cosine/inverse_square_root
      min_lr: 0.0 # minimum learning rate, default to 0.0
      weight_decay: 0.01
      weight_decay_incr_style: constant # select from constant/linear/cosine
      lr_wsd_decay_style: exponential # select from constant/exponential/cosine
      lr_wsd_decay_steps: null
      use_checkpoint_opt_param_scheduler: False # use checkpoint optimizer parameter scheduler
```


Notice that there are some differences in APIs between Megatron
optimizer and FSDP optimizer.

- Megatron optimizer scheduler names the period after lr_warmup as
lr_decay_steps, so the ``warmup_style`` actually means the style of lr
decay after warmup.
- Megatron optimizer also support weight decay decay mechanism
- ``use_checkpoint_opt_param_scheduler`` determines whether to use the
checkpoint optimizer parameter scheduler. If set to True, the optimizer
parameter scheduler will be saved in the checkpoint and loaded from the
checkpoint during resuming training.

### Usage Example

> Provide usage example(s) for easier usage.

```python
# Add code snippet or script demonstrating how to use this 
```

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluatuion results, etc.

### Additional Info.

- **Issue Number**: Fixes issue # or discussion # if any.
- **Training**: [Note which backend this PR will affect: FSDP, Megatron,
both, or none]
- **Inference**: [Note which backend this PR will affect: vLLM, SGLang,
both, or none]

### Checklist Before Submitting

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [x] Add `[BREAKING]` to the PR title if it breaks any API.
- [x] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add CI test(s) if necessary.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Morty Proxy This is a proxified and sanitized view of the page, visit original site.