v0.2 release

Highlights

New algorithms and features

GRPO
ReMax
REINFORCE++
Checkpoint manager for FSDP backend
Sandbox for reward verification and scoring in PRIME

Performance optimization:

Remove padding tokens (i.e. sequence packing). Significant throughput increase expected for Llama, Mistral, Gemma, Qwen2 transformer models. Documentation

actor_rollout_ref.model.use_remove_padding=True
critic.model.use_remove_padding=True

Dynamic batch size. Significant throughput increase for variable length sequences. Documentation and example

actor_rollout_ref.actor.ppo_max_token_len_per_gpu
actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu
actor_rollout_ref.ref.log_prob_max_token_len_per_gpu
critic.ppo_max_token_len_per_gpu
critic.forward_micro_batch_size_per_gpu
reward_model.forward_micro_batch_size_per_gpu

Sequence parallelism for long context training. Documentation and example

actor_rollout_ref.actor.ulysses_sequence_parallel_size
critic.ulysses_sequence_parallel_size
reward_model.ulysses_sequence_parallel_size

vllm v0.7+ integration (preview). For the qwen2 ppo example, 25% time reduction in rollout compared to v0.6.3, and 45% time reduction when cuda graph is enabled. Documentation

actor_rollout_ref.rollout.enforce_eager=False
actor_rollout_ref.rollout.free_cache_engine=False

Liger-kernel integration for SFT. Documentation

model.use_liger=True

Changelog

New Features

Algorithm Support:
- Added support for GRPO algorithm (#124).
- Implemented REINFORCE++ algorithm (#228).
- Added ReMax algorithm (#234)
Performance Improvements:
- Enabled dynamic batch size support (#118).
- Added meta device initialization and parallel load for FSDP to avoid OOMs during init (#123).
- Improved gradient accumulation in sequence balance (#141).
- Added ref/RM offload support (#121).
- Added LoRA support for SFT (#127).
- feat: spport rmpad/data-packing in FSDP with transformers (#91)
- Liger kernel integration (#133)
Experiment Tracking:
- Integrated SwanLab for experiment tracking with online/offline mode and local dashboard support (#218).
- Added Mlflow support (#74).

Bug Fixes

Critical Fixes:
- Fixed checkpoint save with existing directories (#174).
- Fixed incorrect response_attention_mask in vLLM rollout (#213).
- Fixed gradient accumulation loss value (#102).
- Fixed reward model issues with TokenClassification models (#99).
Code Fixes:
- Fixed redundant non_zero_mask (#152).
- Fixed validation dp_size (#90).
- Fixed response_mask index (#60).

Improvements

Performance:
- Improved memory efficiency in logprobs_from_logits_v2 (#220).
- Enabled multiprocess dataloader in SFT trainer (#122).
- Added MFU calculation support (#117).
Miscellaneous:
- Added option to log validation generations to wandb (#177).

Deprecations and Breaking Changes

Breaking Changes:
- Changed micro_batch_size to micro_batch_size_per_gpu (#136).
- Removed @ray.remote on workers to allow inheritance (#61).
- Refactored old_log_prob into a separate function (#129).

Contributors

A big thank you to all the contributors who made this release possible:
@zhanluxianshen @xingyaoww @fzyzcjy @emergenz @openhands-agent @ZSL98 @YSLIU627 @ZefanW @corbt @jaysonfrancis @hiyouga @Jiayi-Pan @hongpeng-guo @eltociear @chujiezheng @PanAndy @zwhe99 @pcmoritz @huiyeruzhou @VPeterV @uygnef @zhiqi-0 @ExtremeViscent @liziniu @nch0w @Cppowboy @TonyLianLong @4332001876 @tyler-romero @ShaohonChen @kinman0224 @willem-bd @bebetterest @WeiXiongUST @dignfei

Pypi package will be soon available! Please let us know on Github if there's a problem extending RL training recipe based on the pip installed version fo verl.

Full Changelog: v0.1...v0.2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v0.2 release

Highlights

Changelog

New Features

Bug Fixes

Improvements

Deprecations and Breaking Changes

Contributors

Contributors

Uh oh!

Search code, repositories, users, issues, pull requests...

v0.2 release

Highlights

Changelog

New Features

Bug Fixes

Improvements

Deprecations and Breaking Changes

Contributors

Contributors

Uh oh!