Training DeepSeek 671b

Last updated: 08/20/2025.

verl integrates Megatron to support large MoE models such as Qwen3-235B-A22B and deepseek-ai/DeepSeek-V3. This is an ongoing community effort.

In the journey the community added the following features and optimizations that enable verl with larger models:

per tensor weight resharding between rollout and training
context parallelism and expert parallelism enabled via megatron
dynamic batch size (sequence balance) for megatron
reduced ray-related serialization overhead
optimizer offloading, recomputation, and efficient kernels
various debugging metrics and utils
hybrid optimizer

and the megatron backend now has a wider list of models supported:

DeepSeek-V3
Moonlight
Qwen3
Qwen2.5-VL (to be merged soon)
Qwen2
Mixtral

Getting Started

preparation

The recommended image with pre-built Megatron dependency is verlai/verl:app-verl0.4-vllm0.8.5-mcore0.13.0-preview, which is built using the Dockerfile at docker/verl0.4-cu124-torch2.6-fa2.7.4/Dockerfile.app.vllm.mcore0.13.preview.

The image is build in Hopper GPUs with DeepEP. It does not support None-Hopper GPUs, such as A100. You may need to reinstall DeepEP to work with A100.

With OFFLOAD_FRACTION=1, the system’s minimum requirements are lowered. It can run on as few as 96 H20 (96GB) GPUs for DeepSeek-V3, and on as few as 32 H20 (96GB) GPUs for Qwen3-235B-A22B. However, this configuration will use 1.6TB CPU memory per node. If you run out of CPU memory or require faster training speed, you can add more nodes.

DeepSeek 671b

For DeepSeek-V3 671b, please refer to examples/grpo_trainer/run_deepseek671b_math_megatron_96gb.sh.

MTP and quantilization is disabled during RL training.

To train your project, configure the following environment variables based on the number of available GPUs. These are recommended settings and can be adjusted based on your specific hardware.

num gpus	NNODES	TP	PP	EP	OFFLOAD_FRACTION	OFFLOAD_OPTIM	LAST_LAYER
96	12	8	12	8	1.	False	6
128	16	8	16	8	0.5	True	1
256	32	8	16	8	0.	True	1
512	64	1	16	32	0	True	1

Qwen3 235b

For Qwen3-235b, please refer to examples/grpo_trainer/run_qwen3-235b_megatron_96gb.sh.

To train your project, configure the following environment variables based on the number of available GPUs. These are recommended settings and can be adjusted based on your specific hardware.

num gpus	NNODES	TP	PP	EP	OFFLOAD_FRACTION	OFFLOAD_OPTIM	LAST_LAYER
32	4	4	8	4	1.	False	6
64	8	4	8	4	0.5	True	6
128	16	4	8	4	0	True	6
256	32	4	8	4	0	True	6

Benchmark

Here are some benchmark results for DeepSeek / Qwen3-235B. All configurations match the recommended settings based on the number of GPUs.

model	num gpus	mean response length	rollout time(s)	GPU memory(GB)	CPU memory(GB)	MFU	step time(s)
DeepSeek 671b	96	1960	1050	66	1500	0.19	1700

Qwen3-30B-A3B MOE

For Qwen3-30b, please refer to examples/grpo_trainer/run_qwen3moe-30b_megatron_96gb.sh.

To train your project, configure the following environment variables based on the number of available GPUs. These are recommended settings and can be adjusted based on your specific hardware.

num gpus	NNODES	TP	PP	EP	OFFLOAD_FRACTION	OFFLOAD_OPTIM	MFU
8	1	1	1	8	1.	True	0.4
16	2	1	1	8	1.	True	0.37
32	4	1	1	8	1.	True	0.31

Upcoming Optimizations

The community continue to optimize large MoE models further, ongoing efforts include:

further optimizing memory consumption, and provide recommended/tuned configurations with various machine types
optimizing long context RL training performance
performance improvement with SGLang x Megatron

We invite the community to try and improve verl together. Get connected with us on slack/wechat/Github issues!

Acknowledgement

@vermouth1992 @ISEEKYAN @ETOgaosion @yzlnew @ShareLer @BearBiscuit05 @ccclyu @ann-qin-lu @SwordFaith @zzong2006 @zhaochenyang20 @ocss884 @eric-haibin-lin @chenhaiq @techkang