Adding verl support by Harsh270519 · Pull Request #5498 · aws/sagemaker-python-sdk

Harsh270519 · Jan 20, 2026

Issue #, if available:

Description of changes:
Adding verl support to v2

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Steps for manual testing

From the private-sagemaker-hyperpod-recipes-staging (https://github.com/aws/private-sagemaker-hyperpod-recipes-staging) repo use the below commands to launch the jobs on SMTJ
Make sure to set the cluster.sm_jobs_config.api_type=model_trainer to launch a job using the model_trainer. Another option is the estimator.

Launch command for llmft job

HYDRA_FULL_ERROR=1 python main.py \
  cluster=sm_jobs \
  cluster_type=sm_jobs \
  cluster.sm_jobs_config.api_type=model_trainer \
  instance_type="ml.p4de.24xlarge" \
  recipes=fine-tuning/llama/llmft_llama3_2_1b_instruct_seq4k_gpu_sft_lora \
  base_results_dir="$(pwd)/results" \
  ++cluster.sm_jobs_config.inputs.s3=null \
  ++cluster.sm_jobs_config.inputs.file_system.id=fs-079b3411789c02c3f \
  ++cluster.sm_jobs_config.inputs.file_system.type=FSxLustre \
  ++cluster.sm_jobs_config.inputs.file_system.directory_path=/olyr5bev \
  cluster.sm_jobs_config.output_path="s3://hyperpod-recipes-validation-artifacts/validation_run" \
  "cluster.sm_jobs_config.tensorboard_config=''" \
  cluster.sm_jobs_config.wait=False \
  ++cluster.sm_jobs_config.additional_estimator_kwargs.max_run=30000 \
  ++cluster.sm_jobs_config.additional_estimator_kwargs.instance_count=1 \
  '++cluster.sm_jobs_config.additional_estimator_kwargs.subnets=["subnet-0193afce112b25931"]' \
  '++cluster.sm_jobs_config.additional_estimator_kwargs.security_group_ids=["sg-0cd8958d241530753"]' \
  ++cluster.sm_jobs_config.recipe_overrides.run.results_dir="/opt/ml/model" \
  ++cluster.sm_jobs_config.recipe_overrides.training_config.datasets.train_data.name="tatqa_train" \
  ++cluster.sm_jobs_config.recipe_overrides.training_config.datasets.train_data.file_path="/opt/ml/input/data/training/hp-recipe-validator/datasets/tatqa/zc_train_10k.jsonl" \
  ++cluster.sm_jobs_config.recipe_overrides.training_config.datasets.val_data.name="tatqa_val" \
  ++cluster.sm_jobs_config.recipe_overrides.training_config.datasets.val_data.file_path="/opt/ml/input/data/training/hp-recipe-validator/datasets/tatqa/zc_dev.jsonl" \
  ++cluster.sm_jobs_config.recipe_overrides.training_config.model_config.model_name_or_path="/opt/ml/input/data/training/users/changnit/models/meta-llama/Llama-3.2-1B-Instruct" \
  ++cluster.sm_jobs_config.additional_estimator_kwargs.image_uri="839249767557.dkr.ecr.us-west-2.amazonaws.com/hyperpod-recipes:llmft-v1.0.0" \
  +model.model_type=llm_finetuning_aws \
  container="839249767557.dkr.ecr.us-west-2.amazonaws.com/hyperpod-recipes:llmft-v1.0.0"

Launch command for verl job

HYDRA_FULL_ERROR=1 python main.py \
  cluster=sm_jobs \
  cluster_type=sm_jobs \
  cluster.sm_jobs_config.api_type=model_trainer \
  instance_type="ml.p4de.24xlarge" \
  recipes=fine-tuning/llama/verl-grpo-rlvr-llama-3-dot-2-1b-instruct-lora \
  base_results_dir="$(pwd)/results" \
  ++cluster.sm_jobs_config.inputs.s3=null \
  ++cluster.sm_jobs_config.inputs.file_system.id=fs-079b3411789c02c3f \
  ++cluster.sm_jobs_config.inputs.file_system.type=FSxLustre \
  ++cluster.sm_jobs_config.inputs.file_system.directory_path=/olyr5bev \
  cluster.sm_jobs_config.output_path="s3://hyperpod-recipes-validation-artifacts/validation_run" \
  "cluster.sm_jobs_config.tensorboard_config=''" \
  cluster.sm_jobs_config.wait=False \
  ++cluster.sm_jobs_config.additional_estimator_kwargs.image_uri="920498770698.dkr.ecr.us-west-2.amazonaws.com/hyperpod-recipes:verl-v1.0.0-smtj" \
  ++cluster.sm_jobs_config.additional_estimator_kwargs.use_training_recipe=true \
  ++cluster.sm_jobs_config.additional_estimator_kwargs.max_run=86400 \
  ++cluster.sm_jobs_config.additional_estimator_kwargs.instance_count=1 \
  '++cluster.sm_jobs_config.additional_estimator_kwargs.subnets=["subnet-0193afce112b25931"]' \
  '++cluster.sm_jobs_config.additional_estimator_kwargs.security_group_ids=["sg-0cd8958d241530753"]' \
  ++cluster.sm_jobs_config.recipe_overrides.run.results_dir="/opt/ml/model" \
  ++cluster.sm_jobs_config.recipe_overrides.training_config.data.train_files="/opt/ml/input/data/training/hp-recipe-validator/datasets/gsm8k/train.parquet" \
  ++cluster.sm_jobs_config.recipe_overrides.training_config.data.val_files="/opt/ml/input/data/training/hp-recipe-validator/datasets/gsm8k/test.parquet" \
  ++cluster.sm_jobs_config.recipe_overrides.training_config.actor_rollout_ref.model.path="/opt/ml/input/data/training/users/changnit/models/meta-llama/Llama-3.2-1B-Instruct" \
  ++cluster.sm_jobs_config.recipe_overrides.training_config.critic.model.path="/opt/ml/input/data/training/users/changnit/models/deepseek-ai/deepseek-llm-7b-chat" \
  ++cluster.sm_jobs_config.recipe_overrides.training_config.critic.model.tokenizer_path="/opt/ml/input/data/training/users/changnit/models/meta-llama/Llama-3.2-1B-Instruct" \
  ++cluster.sm_jobs_config.recipe_overrides.training_config.reward_model.model.path="/opt/ml/input/data/training/users/changnit/models/sfairX/FsfairX-LLaMA3-RM-v0.1" \
  ++cluster.sm_jobs_config.recipe_overrides.training_config.reward_model.model.input_tokenizer="/opt/ml/input/data/training/users/changnit/models/meta-llama/Llama-3.2-1B-Instruct" \
  ++cluster.sm_jobs_config.recipe_overrides.training_config.custom_reward_function.lambda_arn="" \
  container="920498770698.dkr.ecr.us-west-2.amazonaws.com/hyperpod-recipes:verl-v1.0.0-smtj"

Monitor the job on smtj

Adding verl support

8f44fb1

Harsh270519 requested a review from a team as a code owner January 20, 2026 18:47

Harsh270519 requested a review from jam-jee January 20, 2026 18:47

Harsh270519 temporarily deployed to manual-approval January 20, 2026 18:48 — with GitHub Actions Inactive

rsareddy0329 approved these changes Jan 20, 2026

View reviewed changes

rsareddy0329 mentioned this pull request Jan 20, 2026

Nova training support #5489

Merged

Merge branch 'master-v2' into verl-support-master-v2

ee88815

rsareddy0329 had a problem deploying to manual-approval February 3, 2026 22:42 — with GitHub Actions Failure

rsareddy0329 merged commit 2e37c82 into aws:master-v2 Feb 3, 2026
1 of 2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding verl support#5498

Adding verl support#5498
rsareddy0329 merged 2 commits intoaws:master-v2aws/sagemaker-python-sdk:master-v2from
Harsh270519:verl-support-master-v2Harsh270519/sagemaker-python-sdk-main-repo:verl-support-master-v2Copy head branch name to clipboard

Harsh270519 commented Jan 20, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Search code, repositories, users, issues, pull requests...

Conversation

Harsh270519 commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Harsh270519 commented Jan 20, 2026 •

edited

Loading