Releases: volcengine/verl
v0.5.0: agentic RL rollout, prototypes for disaggregated async training & GenerativeRM, better rollout load balance & improved sglang+megatron/vlm support
Highlights
Agentic RL rollout interface [beta]
verl v0.5 introduces the AgentLoop abstraction that allows easy extension to custom rollout with tool/agent interactions. Server-based asynchronous rollout is adopted to efficiently utilize GPUs. verl provides a few example agent loop implementations including:
- Multi-turn conversations and tool calls
- LangGraph-based Agent
Please check the documentation for the system architecture design.
Disaggregated placement & async training [prototype]
verl v0.5 includes a community-contributed one-step-off async training recipe, with trainer and rollout deployed on disaggregated resources and off-policy model updates with staleness = 1. In a small scale experiment, the reference recipe provides 20-40% throughput gain compared to the on-policy baseline depending on the configuration. Please checkout the code and documentation for example configurations.
Remote generative reward models [prototype]
A recipe is provided as a prototype to demonstrate the recommended way to use generative reward models in verl. Documentation and code.
New features
- LoRA RL support for VLMs: #2182
- Better checkpoint manager support for SFT trainer #2292
- Support rollout trajectory tracing and RolloutViewer with improved debug-ability and visualization
- Megatron with mbridge integration, which better supports hf model loading into megatron #2064
Important fixes & improvements
- Fixed an issue with FSDP2 state_dict memory usage caused by torch 2.6. Either using verl v0.5 or torch 2.7 avoids OOMs #2606
- Significantly reduced the overhead of vllm async server performance (v.s. vllm engine) #2246
- Fixed sglang + Megatron TP16 #2336
- Improved SGLang + Megatron weight resharding by 10x #2418 and MoE weight resharding by 3x #2692
- Significant rollout load balancing for GRPO-like algorithms via repeating samples before dispatching them #2324
Breaking changes and deprecations
Full list: #2270
Rollout
-
When generate_sequences with sampling params n>1, change DataProto repeat behavior:
- chunk-dispatch-repeat: DataProto is chunked and dispatched to rollout workers, then repeated in rollout workers.
- repeat-chunk-dispatch: DataProto is repeated by n in driver, then chunked and dispatched to rollout workers.
Switch fromchunk-dispatch-repeat
torepeat-chunk-dispatch
, this change may break almost all recipes and projects using verl GRPO as submodules. #2324
-
verl.workers.rollout.sglang_rollout.
AsyncSglangServer
is now renamed asAsyncSGLangServer
-
vllm <= v0.6 support is dropped
Multi-turn
- We are moving multi-turn supports from ChatScheduler to AgentLoop to improve usability. #2124
Megatron
- Megatron recomputation options are moved to
*.megatron.override_transformer_config
. #2651 Default values are:
override_transformer_config:
recompute_granularity: null
recompute_modules:
- core_attn
recompute_method: null
recompute_num_layers: null
- Merged config
actor_rollout_ref.(actor, ref, rollout).profiler
toactor_rollout_ref.profiler
What's Changed
Trainer & FSDP
- [fsdp] fix: Change the data in the update_actor function from to.('cpu') to to.(get_device_id()) by @Keilo001 in #2477
- [fsdp] fix: vlm dynamic batch & unify dynamic batch api by @hiyouga in #2524
- [fsdp] fix: change geo3k model name from non-vl to vl by @nanjiangwill in #2555
- [trainer, recipe] feat: add support for external generative reward models by @yyDing1 in #2121
- [trainer] fix: fix split placement by @vermouth1992 in #2227
- [trainer, vllm] feat: add lora exclude_modules to support VL model lora training by @Cccei000 in #2182
- [trainer] fix: pre-commit broken by #2354 by @ETOgaosion in #2358
- [trainer, cfg] feat: add BaseConfig for all dataclass configs. Introduce dataclass for algorithm related configs by @eric-haibin-lin in https://github.com/
- [trainer] fix: Use safe masked mean/sum to handle NaN values outside the mask by @Yangruipis in #2377
- [trainer, data] feat: Dynamic Data Generation by @jwong8314 in #2312
- [trainer] fix: use .keys() to check 'response_mask' in TensorDict by @askender in #2491
- [trainer] fix: Allow FSDP2 when doing strategy check by @HollowMan6 in #2497
- [trainer] refactor: no need to call load_reward_manager in compute_reward_async by @eric-haibin-lin in #2557
- [trainer, fsdp, vllm, recipe] feat: one step off async training recipe by @imh966 in #2231
- [trainer] fix: maybe_filter_out_long_prompts on image and video by @firefighter-eric in #2553
- [trainer] refactor: Training Engine Interface and Development Plan by @ZihengJiang in #1977
- [trainer] feat: Add FSDPCheckpointManager for SFTtrainer, support resume training, manage the number of CKPTS in keep by @Pursuer-Hsf in #2292
Rollout & SGLang
- [rollout] feat: add agent loop by @wuxibin89 in #2124
- [rollout] feat: add zeromq vllm distributed executor by @wuxibin89 in #2246
- [BREAKING][rollout] refactor: drop vllm v0.5.4 and v0.6.3 support by @eric-haibin-lin in #2257
- [rollout] feat: Allow customization of async server class by @ultmaster in #2326
- [rollout] fix: fix hf rollout and add single gpu test by @eric-haibin-lin in #2371
- [BREAKING][rollout] feat: repeat DataProto when n>1 in driver instead of rollout workers by @wuxibin89 in #2324
- [misc] feat: trace rollout generation and tool calls using weave by @chenhaiq in #2345
- [cfg] refactor: make the rollout & ref configs more modular by @eric-haibin-lin in #2410
- [perf] feat: add range tag to start/stop profile; clean actor_rollout_ref.profiler by @davidmlw in #2456
- [rollout] feat: support mlflow in rollout trace by @chenhaiq in #2440
- [rollout] feat: add ReactAgentLoop based on LangGraph by @wuxibin89 in #2463
- [rollout] fix: fix bug for remax when the rollout mode is async by @none0663 in #2574
- [tool] chore: introduce RolloutViewer TUI tools by @Yangruipis in #2469
- [rollout,vllm] fix: A major issue in random sampling of vllm engine by @guanning03 in #2646
- [tool] chore: Add log for AsyncRolloutRequest ID, and rollout viewr to support request id display and search by @Hecate0821 in https://github.com/volcengine/
- [rollout] fix: use flashattn3 backend in sglang to avoid error in tool call by @chenhaiq in #2244
- [rollout] fix: Make
free_cache_engine
option workable in latest vLLM/SGLang by @HollowMan6 in #1464 - [rollout] fix: #1646 stop words for sglang rollout by @linxxx3 in #1991
- [sglang, rollout] refactor: use torch.Tensor in async rollout schemas by @nanjiangwill in #2362
- [rollout] fix: sglang async fail with Multi-stage Awake feature by @chenhaiq in #2365
- [sglang] feat: Add multi-interaction registry support and testing by @SwordFaith in #2184
- [sglang] feat: Repeat sampling parameter n into requests of GRPO in SGLang by @zhaochenyang20 in #2258
- [sglang,tool] feat: Add support for tools that generate multimodal data by @nanjiangwill in #2146
- [sglang] fix: only wake up weights on infer_tp 0 by @zhaochenyang20 in #2403
- [sglang] fix: Import Error in the latest sglang by @yyDing1 in #2275
- [sglang] fix: Fix qwen2vl weight keys issue by @hebiao064 in #2434
- [sglang] fix: Only flush cache on TP rank=0. by @SuperCB in https...
v0.4.1 patch release: checkpoint fixes for MoE EP & LoRA, OpenAI/MCP tool calling schema, and SGLang memory optimizations
v0.4.1 patch release: checkpoint fixes for MoE EP & LoRA, OpenAI/MCP tool calling schema, and SGLang memory optimizations
Key changes
PPO fixes and enhancements
- Fixed a bug related to vf_loss coefficient for PPO, which was introduced in v0.4 #2016
- Improved numerical stability when clamping KL divergence-related values #1779
Checkpoints related
- Switched Megatron checkpointer to mcore's dist_checkpoint, which reduces peak memory usage and improves distributed model saving performance via
*.checkpoint.async_save=True
. - [BREAKING] Megatron's checkpoint directory layout is updated accordingly. Documentation
- [BREAKING] Checkpoint manager constructor now takes
checkpoint_config
as the keyword to replacecheckpoint_contents
#2125 - Checkpoint merger for LoRA is fixed #1821 via
python -m verl.model_merger merge ...
. Documentation
Experimental function calling & MCP interfaces
These features are experimental and subject to changes in the future
- Chat completion scheduler now speaks the OpenAI function-calling schema with an OpenAI server #1831
- SGLang rollout with MCP client #1948 Documentation
- SGLang multi-turn rollout code walk-through documentation
- Multi-turn interaction system with SGLang, enabling dynamic conversational feedback and iterative problem-solving scenarios #1630, the building block for SCoRe
New models and recipes
- New recipe/entropy to reproduce the paper The Entropy Mechanism of Reinforcement Learning for Large Language Model Reasoning with
Clip-Cov
andKL-Cov
methods - Megatron support for Qwen-2.5-VL #1286
- Multi-turn SFT support for Qwen-3 #1889
- Enhanced kimi-vl with sequence parallelism #1899
SGLang optimizations
- rollout with SGLang memory usage is further optimized. Blog (requires sglang v0.4.8 #2187)
- async multi-turn rollout with multi-modal support now available in SGLang #2014
Other performance profiling & optimizations
- Nsight system profiling is available. Documentation
- FSDP prefetch can be enabled via
[actor|ref].fsdp_config.forward_prefetch=True
#1927 - The memory usage for entropy computation can be drastically reduced with fused kernels using
[actor|ref].entropy_checkpointing=True
and[actor|ref].entropy_from_logits_with_chunking=True
#1927
Other breaking changes and deprecations
- See #1902
- vllm v0.6.3 support will be removed in the next release.
What's Changed
- [feat] Wandb Timing: Add more detailed timing of gen_sequence and weights resharding by @ETOgaosion in #1834
- [rollout] feat: follow OpenAI tool calling schema in chat scheduler by @wuxibin89 in #1831
- [release] chore: bump version to v0.4 by @eric-haibin-lin in #1897
- Dockerfile.rocm update tensordict==0.6.2 by @vickytsang in #1898
- [feat] add validation shuffle by @mlpod in #1886
- [feat][BREAKING] Megatron: Support learning rate scheduler by @ETOgaosion in #1701
- fix errors in megatron_workers.py by @davidjsonn in #1906
- [tests] chore: add PR title check by @eric-haibin-lin in #1901
- fix qwen2vl grpo for vllm 0.9 and transformers 4.52 by @hiyouga in #1880
- [rollout] fix: error in __collect_lora_params() in FSDPVLLMShardingManager by @rocke2020 in #1909
- [recipe] feat: char count by @vermouth1992 in #1908
- fix typos by @davidjsonn in #1912
- [trainer] refactor: refactor reward manager, advantage estimator by @eric-haibin-lin in #1916
- set CUDA and HIP VISIBLE DEVICES by @YangWang92 in #1914
- [ppo] feat: add critic valuehead model support for multi-modal PPO by @Yangruipis in #1839
- [bugfix] fix megatron model merger by @ShareLer in #1774
- revert HIP_VISIBLE_DEVICES in worker.py by @YangWang92 in #1920
- [worker] fix: do not break dynamic bsz in dp critic by @hiyouga in #1922
- [sglang] feat: Efficient and model-agnostic multi-turn messages tokenization and masking by @jybsuper in #1668
- [rollout] fix: fix async llm config passing by @eric-haibin-lin in #1933
- [sglang] fix: Fix tool call parser not found error for SGLang==0.4.6.post5 by @jybsuper in #1852
- fix sequence parallelism conflict in kimiVL by @ShareLer in #1899
- [megatron] refactor: support MLATransformerConfig abstraction for DeepSeek V3 by @jinqinn in #1836
- [rollout] feat: add async llm perf script by @wuxibin89 in #1930
- [megatron] feat: qwen2.5vl by @ISEEKYAN in #1286
- [ckpt] feat: model_merger.py support processing checkpoints with LoRA adapters by @thelongestusernameofall in #1821
- [hardware] fix: fix issue when sp>1 on ASCEND NPU by @as12138 in #1942
- [megatron] fix: rope_type typo in config_converter.py by @donpromax in #1944
- [training_utils] Add qwen3 multi-turn sft support by @SwordFaith in #1889
- [fsdp] fix: fsdp entropy metrics by @ETOgaosion in #1943
- [FSDP] feat: Add FSDP forward pefetch and recompute chunking entropy by @CurryRice233 in #1927
- [rollout] fix: set repetition_penalty=1.0 to AsyncLLM by @wuxibin89 in #1949
- [fsdp] feat: Memory efficient cross entropy with a linear layer fused by @Jianbing-D in #462
- [recipe] feat: qwen2.5vl 7b report and guide by @ISEEKYAN in #1969
- [ckpt] refactor: enhance FSDP checkpoint manager flexibility by @0x404 in #1350
- [env] fix: npu ray verion to 2.46.0 for CI problem by @wyz649296016 in #1987
- Fix TypeError by Removing Duplicate Arguments in run_deepseek671b_math_megatron.sh by @none0663 in #1996
- [megatron] feat: Config NCCL Timeout for Megatron Backend Model Loading by @none0663 in #1983
- [tests] chore: ppo workflow runs on volcengine machine learning platform by @htc070011 in #1979
- [megatron] fix: multiple key error when trying to override megatron tr… by @donpromax in #1990
- [megatron] feat: robust and efficient mcore converter with meta device init and numel check for dpsk by @Yangruipis in #1995
- Stabilize loss calculations by clamping KL divergence values by @syo093c in #1779
- [ckpt] fix: run converter_hf_to_mcore with --test will raise an AttributeError by @lxg2015 in #2010
- [algo] fix:
vf_loss
factor by @tongyx361 in #2016 - [data] fix: fix retool sft data source by @vermouth1992 in #2018
- [fsdp] fix: position_ids in qwen-vl by @ShareLer in #1947
- [hardware] refactor: refactor part of device management by @FightingZhen in #1974
- [trainer] fix: fix sft max_position_embeddings by @vermouth1992 in #2019
- [misc] fix: fix format by @vermouth1992 in #2023
- [megatron] fix: dpskv3 convert src and dst mixed up bug by @Yangruipis in #2029
- fix: TensorDict usage error by @zhihe-wang in #2046
- [hardware] feat: support qwen2_5_vl on ASCEND NPU by @as12138 in #1924
- [trainer] chore: Reducing the number of calls to the write by @RuixiangMa in #2043
- [Bug] fix
None
check in ...
v0.4.0 release: large MoEs, tool calling, and low resource friendly
Highlights
Large MoE models support: DeepSeek 671b & Qwen3 235b
Preview features are provided to enable large MoE RL training with Megatron backend, such as DeepSeek 671b documentation. The Megatron backend now supports:
- expert parallelism, context parallelism, gradient checkpointing
- DeepSeek-V3, Qwen3-235b, Mixtral, Moonlight
- dist-ckpt support
Tool-calling, multi-turn RL, SGLang rollout
Sample-level rollout with tool calling and multi-turn RL is supported via SGLang. We provide the Search-R1 recipe built on top of that.
A prototype for sample-level async tool calling is also available with vllm AsyncLLM server.
Multiple enhancements and improvements are made to SGLang rollout, supporting multi-node and multimodal.
Sandbox fusion is integrated.
Low resource friendly
LoRA support is available, enabling 70B+ models on a single node with A100x8 GPUs.
Fused cross entropy kernel to drastically reduce peak memory: actor_rollout_ref.model.use_fused_kernels=True
New models, algorithms and recipes
- Documentation for PPO and GRPO
- Recipe: DAPO
- Recipe: Self-Play Fine-Tuning (SPIN)
- Recipe: Self-Play Preference Optimization (SPPO)
- OPO: On-Policy RL with Optimal Reward Baseline, DrGRPO, REINFORCE++, Dual-Clip PPO
New models and training utils include:
- kimi_vl example
- qwen3 example
- video inputs support
- Warmup-Stable-Decay scheduler
- rope scaling
- evals for GPQA, livecodebench
- logging to ClearML
FSDP2 and training optimizations
FSDP2 is recommended to replace FSDP1, providing better throughput and memory usage, and is composable with other features (e.g. torch.compile):
actor_rollout_ref.ref.strategy=fsdp2
actor_rollout_ref.actor.strategy=fsdp2
critic.strategy=fsdp2
reward_model.strategy=fsdp2
Furthermore, FSDP2 cpu offloading is compatible with gradient accumulation. You can turn it on to save memory with actor_rollout_ref.actor.offload_policy=True
.
Other optimizations include:
- Activation offloading
- ulysses sequence parallelism for vlm
- compute reward during log_prob for ppo trainer
- timeline for ray profiling
Deployment and hardware
- Easy deployment with dstack
- Enhancements to non-nvidia GPUs
Breaking changes and deprecations
- FSDPSFTTrainer now requires the dataset arguments #1282
- SFTDataset and RLHFDataset now take a config as the input #924
- entropy_coeff now defaults to 0 #1770
- FSDP1 support will be dropped in the next release.
- vllm v0.5.4 support will be dropped in the next release.
- A few options are included into the default yaml file, existing script may throw errors such as
+{config}={value}
. Please try removing the + to fix such errors.- ppo_trainer.yaml:
trainer.val_before_train
- sft_trainer.yaml:
data.{prompt,response}_dict_keys
- ppo_trainer.yaml:
verl.utils.reward_score._default_compute_score
is deprecated. Useverl.utils.reward_score.default_compute_score
instead.- the name of ray actor will change from "WorkerDict_xxxx" to "FusedWorker_xxxx", the name of tasks will change from {cls_name}_{method_name}" to "fuw_execute".
New Contributors
@zhao9797 @frederrx @dingyuan-shi @SwordFaith @CJReinforce @linjc16 @wkcn @hijkzzz @JustinTong0323 @mertunsall @Altair-Alpha @czczup @SparkJiao @sunjin-k @tsaoyu @XueruiSu @zhaochenyang20 @NascentAscension @corgilee @lei-lei @pengsun @silverriver @mingruimingrui @Ann-Qin @lilei199908 @YeonwooSung @himalalps @tao-githup @as12138 @thibautbar @aoshen524 @MantasBaksys @YangWang92 @patrik-bartak @mansicer @wangfuchun-fc @survivi @RainBowLuoCS @gzpan @HuaizhengZhang @HollowMan6 @zTonyZhao @lxg2015 @estsauver @jhinpan @yhyang201 @qingquansong @chenhaiq @ShareLer @Artessay @Jackory @swtheing @U-rara @Andrewzh112 @mansoor-s @Necolizer @llkn-2 @yuyuz @linxxx3 @gaokaiz2 @ccchow @ezyang @zw0610 @pavelgein @plutoZZZZ @jybsuper @hebiao064 @GaotangLi @zhangyongxin121 @spacegoing @cedricbeta @Geaming2002 @imh966 @zyzshishui @zzong2006 @langfengQ @zheliuyu @casper-hansen @Bihan @czx6858 @GHGmc2 @DtYXs @thelongestusernameofall @xichengpro @Irvingwangjr @shinytang6 @qyhfrank @mlpod @popomen @liyc-ai @leo-pony @LiuXTao @Lins-01 @yzlnew @vllbc @ZDJeffrey @sukrucildirr @Moyu-42 @YRdddream @jdf-prog @HUGHNew @ElliottYan @NileZhou @shizhediao @rj42 @Crispig @omahs @CurryRice233 @china10s
Thank you for your first contributions!
Full Changelog: v0.3.0.post1...v0.4.0
v0.3.0.post1
This release include fixes for sequence parallelism and sglang:
- Fixed ulysses sequence parallel issue, which may hang with specific kv head num #850
- SGLang stability & memory improvements #773 #756
Full Changelog: v0.3.0.post0...v0.3.0.post1
v0.3.0.post0 release
Highlights
New algorithms and recipes
- Vision language reasoning with qwen2.5-vl #386
- PRIME, RLOO, remax #753 #234 #341
- FIRE sampling algorithm, math-verify rewards #545 #683
Engine
- sglang integration is available for preview (single node with FSDP). Blazing fast! Please try and give us feedbacks! We recommend using verl main branch for continuous slang related fixes and improvement upon feedbacks.
--actor_rollout_ref.rollout.name='sglang'
- Megatron is now upgraded to v0.11. Supporting checkpoint manager, qwen model & GRPO algorithm
- vllm upgraded to v0.8.2, much faster than vllm v0.7 & v0.6.3 during rollout with the v1 engine! Please remember to enable cuda graph with the following option. There were memory leak issues before vllm v0.8.2, we recommend either using vllm v0.6.3 or v0.8.2.
actor_rollout_ref.rollout.enforce_eager=False \
actor_rollout_ref.rollout.free_cache_engine=False \
Hardware:
- AMD support is available for vllm and FSDP backend. Getting started one pager is here
Docs:
- tutorial for distributed training setup, debugging, and the programming model
Roadmap for Q2: #710. Contributions are welcome!
Changelog
New Features
Algorithm Support
- Support for
extra_info
in reward calculation - RLOO advantage estimator
- PRIME algorithm (recipe and baseline)
- Initial support for VLMs (Vision-Language Models), including Qwen2.5VL GRPO example
- Math-Verify Support
- Support for GRPO with Megatron backend
- Added FIRE sampling in rollout
- Replaced
DataLoader
withStatefulDataLoader
for checkpoint resuming - Support for external reward function loading
Performance Improvements
- Support for SGLang as a rollout engine
- Support for Ulysses sequence parallel (transformers >= 0.48)
- Support offloading parameters and optimizer during rollout
- Tracking support for vemlp and TensorBoard
- MFU (Model FLOPS Utilization) calculation for Megatron workers
- Support for AMD (ROCm kernel)
- Improved checkpoint loading (Megatron support for Llama/Qwen models)
- Remove unnecessary
torch.cuda.empty_cache()
calls - Optimized weight loading (replaced custom VLLM loader with
model.load_weights
)
Bug Fixes
- Fixed wrong args description
- Fixed Gemma2 example and NGC Dockerfile
- Fixed offload/load optimizer implementation
- Fixed VLLM documentation links
- Fixed typos and spelling errors
- Fixed evaluation file path in Remax training scripts
- Fixed OOM when resuming from checkpoint
- Fixed position embedding for Qwen2.5-VL
- Fixed PRIME algorithm issues (filtering long prompts, padding side, xformers)
- Fixed FSDP checkpoint loading
- Fixed SGLang rollout under multi-node
- Fixed Python environment issues in installation
- Fixed validation batch repeat before feeding into rollout
Deprecations and Breaking Changes
- Deprecated
val_batch_size
- Removed redundant config parameters
- Reverted RLHFDataset truncation config
Improvements
Documentation
- Added Ray on Slurm example
- Added FAQ for VLLM illegal memory access
- Added distributed training docs (RLOO, VolcEngine)
- Updated VLLM (>=0.7, >=0.8) documentation
- Added meetup info, blogs, and project references
- Improved Slurm example parameters
- Added multi-node training and debug tutorial
Tooling & CI/CD
- Added Dependabot action
- Added secrets scan action
- Added CI timeout and auto-cancel previous CI runs
- Added e2e_ascend CI
- Improved dataset handling in CI
Miscellaneous
- Added assertion checks for PPO mini-batch size
- Improved logging (SwanLab integration)
- Pre-check resource pool availability to prevent hangs
- Added tqdm progress bar for RayPPOTrainer
- Skip special tokens in processing
- Support for faster model downloads from ModelScope
- Added Dockerfile for AWS SageMaker
New Contributors
This new release is contributed by 60 contributors, of which 47 are new contributors!
@AnselCmy @BASARANOMO @BaiqingL @BeSkyer @BearBiscuit05 @CajZella @Django-Jiang @DolbyUUU @ETOgaosion @HaoshengZou @ISEEKYAN @Kunlun-Zhu @PeterSH6 @PzySeere @Raf-Chen @WillemJiang @Yifan-Song793 @ZSL98 @Zeetc @ZefanW @Zeyi-Lin @caaatch22 @celestialli @danielz02 @dependabot @dirtyDan0 @eltociear @eric-haibin-lin @fyqqyf @gameofdimension @ganler @haoy-zzz @hiyouga @hongpeng-guo @iceflame89 @jayl940712 @kinman0224 @laonahongchen @liudayuan-carrot @maksimstw @mi804 @minleminzui @nomadlx @none0663 @nwiad @ocss884 @pat-jj @thomZ1 @tongyx361 @uygnef @vermouth1992 @wangchengnuo @wuxibin89 @xffxff @yaguanghu @yushengsu-thu @yyDing1 @zhanluxianshen @zhr2001 @zpqiu
Thank you all for making verl better!!
Full Changelog: v0.2.0.post2...v0.3.0.post0
Known issues tracker: #827
v0.2.0.post2
What's Changed
- Fixed installation issues.
- Fixed the remove padding flags in the gemma example.
New Contributors
Full Changelog: v0.2...v0.2.0.post2
v0.2 release
Highlights
New algorithms and features
- GRPO
- ReMax
- REINFORCE++
- Checkpoint manager for FSDP backend
- Sandbox for reward verification and scoring in PRIME
Performance optimization:
- Remove padding tokens (i.e. sequence packing). Significant throughput increase expected for Llama, Mistral, Gemma, Qwen2 transformer models. Documentation
actor_rollout_ref.model.use_remove_padding=True
critic.model.use_remove_padding=True
- Dynamic batch size. Significant throughput increase for variable length sequences. Documentation and example
actor_rollout_ref.actor.ppo_max_token_len_per_gpu
actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu
actor_rollout_ref.ref.log_prob_max_token_len_per_gpu
critic.ppo_max_token_len_per_gpu
critic.forward_micro_batch_size_per_gpu
reward_model.forward_micro_batch_size_per_gpu
- Sequence parallelism for long context training. Documentation and example
actor_rollout_ref.actor.ulysses_sequence_parallel_size
critic.ulysses_sequence_parallel_size
reward_model.ulysses_sequence_parallel_size
- vllm v0.7+ integration (preview). For the qwen2 ppo example, 25% time reduction in rollout compared to v0.6.3, and 45% time reduction when cuda graph is enabled. Documentation
actor_rollout_ref.rollout.enforce_eager=False
actor_rollout_ref.rollout.free_cache_engine=False
- Liger-kernel integration for SFT. Documentation
model.use_liger=True
Changelog
New Features
-
Algorithm Support:
-
Performance Improvements:
- Enabled dynamic batch size support (#118).
- Added meta device initialization and parallel load for FSDP to avoid OOMs during init (#123).
- Improved gradient accumulation in sequence balance (#141).
- Added ref/RM offload support (#121).
- Added LoRA support for SFT (#127).
- feat: spport rmpad/data-packing in FSDP with transformers (#91)
- Liger kernel integration (#133)
-
Experiment Tracking:
Bug Fixes
-
Critical Fixes:
-
Code Fixes:
Improvements
-
Performance:
-
Miscellaneous:
- Added option to log validation generations to wandb (#177).
Deprecations and Breaking Changes
- Breaking Changes:
Contributors
A big thank you to all the contributors who made this release possible:
@zhanluxianshen @xingyaoww @fzyzcjy @emergenz @openhands-agent @ZSL98 @YSLIU627 @ZefanW @corbt @jaysonfrancis @hiyouga @Jiayi-Pan @hongpeng-guo @eltociear @chujiezheng @PanAndy @zwhe99 @pcmoritz @huiyeruzhou @VPeterV @uygnef @zhiqi-0 @ExtremeViscent @liziniu @nch0w @Cppowboy @TonyLianLong @4332001876 @tyler-romero @ShaohonChen @kinman0224 @willem-bd @bebetterest @WeiXiongUST @dignfei
Pypi package will be soon available! Please let us know on Github if there's a problem extending RL training recipe based on the pip installed version fo verl.
Full Changelog: v0.1...v0.2
v0.1
What's Changed
- [misc] feat: update tutorial for opensource version by @PeterSH6 in #4
- [misc] fix: vllm gpu executor issue when world_size is 1 and typo in doc by @PeterSH6 in #9
- [ci] feat: add test files for ray hybrid programming model by @PeterSH6 in #23
- [chore] remove unnecessary updating of
_worker_names
by @kevin85421 in #19 - [misc] feat: add gemma example for small scale debug and fix gradient checkpoint in critic by @PeterSH6 in #27
- [misc] fix issue in hf_weight_loader and fix typo in doc by @PeterSH6 in #30
- [ci] test lint ci and lint tests dir by @PeterSH6 in #28
- [example] fix: fix math circular dependency by @eric-haibin-lin in #31
- [example] fix: make wandb optional dependency. allow extra args in existing scripts by @eric-haibin-lin in #32
- [docs] feat: add related publications by @eric-haibin-lin in #35
- [tokenizer] feat: support tokenizers whose pad_token_id is none by @eric-haibin-lin in #36
- [rollout] feat: support vLLM v0.6.3 and fix hf rollout import issue by @PeterSH6 in #33
- [distro] feat: add docker support by @eric-haibin-lin in #41
- [example] add a split placement tutorial by @PeterSH6 in #43
- [doc] add a new quickstart section by @PeterSH6 in #44
- [BREAKING][core] move single_controller into verl directory by @PeterSH6 in #45
New Contributors
- @eric-haibin-lin made their first contribution in #31
Full Changelog: v0.1rc...v0.1
v0.1rc
What's Changed
- [init] feat: first commit for open source
- [doc] feat: fix typo and delete deprecated config element by @PeterSH6 in #2
- [misc] fix: resolve pypi missing directory by @PeterSH6 in #3
Credit To
@PeterSH6 @vermouth1992 @zw0610 @wuxibin89 @YipZLF @namizzz @pengyanghua @eric-haibin-lin @Meteorix and others in Seed Foundation MLSys Team