Add GRPO main entry point and scripts (GRPO olmo-core: PR 5 of 5) by finbarrtimbers · Pull Request #1399 · allenai/open-instruct

finbarrtimbers · Jan 20, 2026

Summary

Add grpo.py main training orchestration script using OLMo-core's Trainer with Ray actors
Add test scripts: single_gpu_grpo.sh, multi_node_grpo.sh, tool_grpo.sh
Unify duplicated functions between grpo.py and grpo_fast.py: grpo.py now calls grpo_fast.setup_runtime_variables and grpo_fast.create_generation_configs instead of maintaining its own copies
Move is_beaker_job() guard into maybe_get_beaker_config() so callers don't need to wrap every call
Fix bug in grpo_fast.create_generation_configs where vllm_config was referenced but not passed as a parameter
Add grpo.py to type checking in pyproject.toml
Update CHANGELOG.md

Depends on: #1398

GPU_TESTS=01KKY8PKQYXPDTJFCT37Q20E9X

Runs:

Single GPU GRPO: Beaker
Multi-node GRPO: Beaker

…ion: PR 1 of 4) This refactoring extracts the shared configuration class that both grpo_fast.py (existing DeepSpeed trainer) and the upcoming grpo.py (new OLMo-core trainer) need. - Create grpo_utils.py with ExperimentConfig dataclass (moved from grpo_fast.py Args) - Update grpo_fast.py to import from grpo_utils - Update benchmark_generators.py to import from grpo_utils Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

…: PR 2 of 4) Add foundational components for the OLMo-core GRPO trainer: - grpo_callbacks.py: VLLMWeightSyncCallback, RefPolicyUpdateCallback, olmo_core_to_hf_name() - olmo_core_train_modules.py: GRPOTrainModule class for OLMo-core training - pyproject.toml: Add both files to type checking Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Add the Ray actor that wraps OLMo-core training: - grpo_olmo_core_actor.py: PolicyTrainerOLMoCoreProcess and OLMoCoreModelGroup classes - pyproject.toml: Add file to type checking The actor coordinates distributed training via torch.distributed for FSDP gradient synchronization. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

… PR 4 of 4) Add the final integration for the OLMo-core GRPO trainer: - grpo.py: Main training orchestration script using OLMo-core's Trainer - scripts/train/debug/single_gpu_grpo.sh: Single GPU test script - scripts/train/debug/multi_node_grpo.sh: Multi-node test script - scripts/train/debug/tool_grpo.sh: Tool use test script - pyproject.toml: Add grpo.py to type checking - CHANGELOG.md: Document the new trainer Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

gemini-code-assist · Jan 20, 2026

Summary of Changes

Hello @finbarrtimbers, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request delivers the foundational implementation for Group Relative Policy Optimization (GRPO) training, integrating it deeply with the OLMo-core framework. The changes enable robust distributed training capabilities, shifting from DeepSpeed to FSDP, and introduce comprehensive scripting for various training scenarios, including single-GPU, multi-node, and tool-augmented training. This is a crucial step in completing the OLMo-core GRPO component integration, providing a unified and scalable training solution.

Highlights

GRPO Core Implementation: Introduces grpo.py as the main entry point for Group Relative Policy Optimization (GRPO) training, leveraging OLMo-core's Trainer and replacing DeepSpeed with FSDP for distributed training.
New Debug Scripts: Adds three new debug scripts (single_gpu_grpo.sh, multi_node_grpo.sh, tool_grpo.sh) to facilitate testing of single-GPU, multi-node, and tool-use GRPO training configurations.
Type Checking Integration: Integrates the new grpo.py module into the project's type checking configuration in pyproject.toml to maintain code quality.
Changelog Update: Updates CHANGELOG.md to document the addition of the new OLMo-core based GRPO trainer.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces the main entry point and associated scripts for GRPO training using OLMo-core's Trainer. It also updates the CHANGELOG.md and pyproject.toml for type checking. The changes integrate various components for distributed training with Ray and vLLM, including experiment tracking, tool initialization, and model saving. The new scripts provide debug configurations for single-GPU, multi-node, and tool-use scenarios.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 811ba9adde

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · Jan 20, 2026

+
+    wait_for_gpus(sum(args.num_learners_per_node))
+
+    bundles = [{"GPU": n, "CPU": n} for n in args.num_learners_per_node]


Reserve enough CPUs in the Ray placement group

The placement-group bundles reserve CPU equal to the GPU count ({"GPU": n, "CPU": n}), but each PolicyTrainerOLMoCoreProcess actor requests 4 CPUs (num_cpus_per_actor = 4 in open_instruct/grpo_olmo_core_actor.py:391-409). On a 1‑GPU run this makes the bundle provide only 1 CPU while the actor needs 4, so the actors cannot be scheduled and the training will hang at pg.ready() or actor creation. The bundle CPU should scale to at least 4 * n (or whatever the actor CPU requirement is) to make the placement group feasible.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · Jan 20, 2026

+        logger.info(f"Only {available_gpus} GPUs available, waiting for {expected_gpus}...")
+        time.sleep(poll_interval)
+    logger.error(f"Timeout waiting for GPUs. Only {available_gpus} available, needed {expected_gpus}")


Fail fast when GPUs never appear in the cluster

When the Ray cluster never reaches the expected GPU count, wait_for_gpus only logs an error and then returns, so the code proceeds to create a placement group that will block indefinitely. This means a misconfigured or undersized cluster will hang the job instead of terminating with a clear failure. Consider raising an exception (or exiting) after the timeout so the run fails fast in that scenario.

Useful? React with 👍 / 👎.

# Conflicts: # CHANGELOG.md # open_instruct/grpo_callbacks.py # open_instruct/grpo_fast.py # open_instruct/grpo_utils.py # open_instruct/olmo_core_train_modules.py # pyproject.toml

…of ~30 individual params, matching grpo_fast.py pattern Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

# Conflicts: # CHANGELOG.md # open_instruct/grpo_olmo_core_actor.py # pyproject.toml

… consolidate changelog entries Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ed-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…alified imports, docstrings Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

hamishivi

From a quick scan, seems good minus one comment. Would like to test this more tho!

… (1M context) <noreply@anthropic.com>

finbarrtimbers and others added 4 commits January 20, 2026 12:39

gemini-code-assist Bot reviewed Jan 20, 2026

View reviewed changes

Comment thread open_instruct/grpo.py Outdated

Comment thread open_instruct/grpo.py

chatgpt-codex-connector Bot reviewed Jan 20, 2026

View reviewed changes

finbarrtimbers changed the title ~~Add GRPO main entry point and scripts (GRPO olmo-core implementation: PR 4 of 4)~~ Add GRPO main entry point and scripts (GRPO olmo-core: PR 5 of 5) Jan 22, 2026

finbarrtimbers added 2 commits March 13, 2026 11:26

Merge remote-tracking branch 'origin/main' into finbarr/grpo-main

0ff9248

# Conflicts: # CHANGELOG.md # open_instruct/grpo_callbacks.py # open_instruct/grpo_fast.py # open_instruct/grpo_utils.py # open_instruct/olmo_core_train_modules.py # pyproject.toml

Refactor PolicyTrainerOLMoCoreProcess to take config objects instead …

be2ab6b

…of ~30 individual params, matching grpo_fast.py pattern Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Base automatically changed from finbarr/grpo-olmo-core-actor to main March 16, 2026 20:46

finbarrtimbers added 7 commits March 16, 2026 16:22

Merge remote-tracking branch 'origin/main' into finbarr/grpo-main

6d15910

# Conflicts: # CHANGELOG.md # open_instruct/grpo_olmo_core_actor.py # pyproject.toml

Fix logging convention violation, NameError bug in wait_for_gpus, and…

df9dcf4

… consolidate changelog entries Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Restore OLMo-core GRPO actor changelog entry for PR #1398 Co-Authored…

db2b70c

…-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Cleaned up PR.

e58feb8

Unify duplicated functions between grpo.py and grpo_fast.py Co-Author…

503ad3e

…ed-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Clean up grpo.py: use backoff for wait_for_gpus, walrus operators, qu…

1b1934b

…alified imports, docstrings Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

cleaned up code

302cc35

hamishivi approved these changes Mar 19, 2026

View reviewed changes

Comment thread open_instruct/grpo.py Outdated

Merge branch 'main' into finbarr/grpo-main

5f341f5

finbarrtimbers enabled auto-merge March 19, 2026 21:45

Extend GPU wait timeout to 20 minutes Co-Authored-By: Claude Opus 4.6…

49db4a6

… (1M context) <noreply@anthropic.com>

finbarrtimbers added this pull request to the merge queue Mar 19, 2026

Merged via the queue into main with commit a978fde Mar 19, 2026
5 of 7 checks passed

finbarrtimbers deleted the finbarr/grpo-main branch March 19, 2026 22:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add GRPO main entry point and scripts (GRPO olmo-core: PR 5 of 5)#1399

Add GRPO main entry point and scripts (GRPO olmo-core: PR 5 of 5)#1399
finbarrtimbers merged 15 commits intomainallenai/open-instruct:mainfrom
finbarr/grpo-mainallenai/open-instruct:finbarr/grpo-mainCopy head branch name to clipboard

finbarrtimbers commented Jan 20, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Jan 20, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jan 20, 2026

Uh oh!

chatgpt-codex-connector Bot Jan 20, 2026

Uh oh!

hamishivi left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		wait_for_gpus(sum(args.num_learners_per_node))

		bundles = [{"GPU": n, "CPU": n} for n in args.num_learners_per_node]

Search code, repositories, users, issues, pull requests...

Conversation

finbarrtimbers commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

gemini-code-assist Bot commented Jan 20, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

hamishivi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

finbarrtimbers commented Jan 20, 2026 •

edited

Loading