Training DeepSeek 671b

Last updated: 08/20/2025.

verl integrates Megatron to support large MoE models such as Qwen3-235B-A22B and deepseek-ai/DeepSeek-V3. This is an ongoing community effort.

In the journey the community added the following features and optimizations that enable verl with larger models:

  • per tensor weight resharding between rollout and training

  • context parallelism and expert parallelism enabled via megatron

  • dynamic batch size (sequence balance) for megatron

  • reduced ray-related serialization overhead

  • optimizer offloading, recomputation, and efficient kernels

  • various debugging metrics and utils

  • hybrid optimizer

and the megatron backend now has a wider list of models supported:

  • DeepSeek-V3

  • Moonlight

  • Qwen3

  • Qwen2.5-VL (to be merged soon)

  • Qwen2

  • Mixtral

Getting Started

preparation

The recommended image with pre-built Megatron dependency is verlai/verl:app-verl0.4-vllm0.8.5-mcore0.13.0-preview, which is built using the Dockerfile at docker/verl0.4-cu124-torch2.6-fa2.7.4/Dockerfile.app.vllm.mcore0.13.preview.

The image is build in Hopper GPUs with DeepEP. It does not support None-Hopper GPUs, such as A100. You may need to reinstall DeepEP to work with A100.

With OFFLOAD_FRACTION=1, the system’s minimum requirements are lowered. It can run on as few as 96 H20 (96GB) GPUs for DeepSeek-V3, and on as few as 32 H20 (96GB) GPUs for Qwen3-235B-A22B. However, this configuration will use 1.6TB CPU memory per node. If you run out of CPU memory or require faster training speed, you can add more nodes.

DeepSeek 671b

For DeepSeek-V3 671b, please refer to examples/grpo_trainer/run_deepseek671b_math_megatron_96gb.sh.

MTP and quantilization is disabled during RL training.

To train your project, configure the following environment variables based on the number of available GPUs. These are recommended settings and can be adjusted based on your specific hardware.

num gpus

NNODES

TP

PP

EP

OFFLOAD_FRACTION

OFFLOAD_OPTIM

LAST_LAYER

96

12

8

12

8

1.

False

6

128

16

8

16

8

0.5

True

1

256

32

8

16

8

0.

True

1

512

64

1

16

32

0

True

1

Qwen3 235b

For Qwen3-235b, please refer to examples/grpo_trainer/run_qwen3-235b_megatron_96gb.sh.

To train your project, configure the following environment variables based on the number of available GPUs. These are recommended settings and can be adjusted based on your specific hardware.

num gpus

NNODES

TP

PP

EP

OFFLOAD_FRACTION

OFFLOAD_OPTIM

LAST_LAYER

32

4

4

8

4

1.

False

6

64

8

4

8

4

0.5

True

6

128

16

4

8

4

0

True

6

256

32

4

8

4

0

True

6

Benchmark

Here are some benchmark results for DeepSeek / Qwen3-235B. All configurations match the recommended settings based on the number of GPUs.

model

num gpus

mean response length

rollout time(s)

GPU memory(GB)

CPU memory(GB)

MFU

step time(s)

DeepSeek 671b

96

1960

1050

66

1500

0.19

1700

Qwen3-30B-A3B MOE

For Qwen3-30b, please refer to examples/grpo_trainer/run_qwen3moe-30b_megatron_96gb.sh.

To train your project, configure the following environment variables based on the number of available GPUs. These are recommended settings and can be adjusted based on your specific hardware.

num gpus

NNODES

TP

PP

EP

OFFLOAD_FRACTION

OFFLOAD_OPTIM

MFU

8

1

1

1

8

1.

True

0.4

16

2

1

1

8

1.

True

0.37

32

4

1

1

8

1.

True

0.31

Upcoming Optimizations

The community continue to optimize large MoE models further, ongoing efforts include:

  • further optimizing memory consumption, and provide recommended/tuned configurations with various machine types

  • optimizing long context RL training performance

  • performance improvement with SGLang x Megatron

We invite the community to try and improve verl together. Get connected with us on slack/wechat/Github issues!

Acknowledgement

@vermouth1992 @ISEEKYAN @ETOgaosion @yzlnew @ShareLer @BearBiscuit05 @ccclyu @ann-qin-lu @SwordFaith @zzong2006 @zhaochenyang20 @ocss884 @eric-haibin-lin @chenhaiq @techkang

Morty Proxy This is a proxified and sanitized view of the page, visit original site.