Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Add OLMo-core Ray actor (GRPO olmo-core: PR 4 of 5)#1398

Merged
finbarrtimbers merged 10 commits intomainallenai/open-instruct:mainfrom
finbarr/grpo-olmo-core-actorallenai/open-instruct:finbarr/grpo-olmo-core-actorCopy head branch name to clipboard
Mar 16, 2026
Merged

Add OLMo-core Ray actor (GRPO olmo-core: PR 4 of 5)#1398
finbarrtimbers merged 10 commits intomainallenai/open-instruct:mainfrom
finbarr/grpo-olmo-core-actorallenai/open-instruct:finbarr/grpo-olmo-core-actorCopy head branch name to clipboard

Conversation

@finbarrtimbers
Copy link
Copy Markdown
Collaborator

@finbarrtimbers finbarrtimbers commented Jan 20, 2026

Summary

  • Add grpo_olmo_core_actor.py with PolicyTrainerOLMoCoreProcess and OLMoCoreModelGroup classes
  • Add file to type checking in pyproject.toml

This Ray actor wraps OLMo-core's training infrastructure for distributed GRPO training across multiple GPUs and nodes.

finbarrtimbers and others added 3 commits January 20, 2026 12:39
…ion: PR 1 of 4)

This refactoring extracts the shared configuration class that both grpo_fast.py
(existing DeepSpeed trainer) and the upcoming grpo.py (new OLMo-core trainer) need.

- Create grpo_utils.py with ExperimentConfig dataclass (moved from grpo_fast.py Args)
- Update grpo_fast.py to import from grpo_utils
- Update benchmark_generators.py to import from grpo_utils

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…: PR 2 of 4)

Add foundational components for the OLMo-core GRPO trainer:

- grpo_callbacks.py: VLLMWeightSyncCallback, RefPolicyUpdateCallback, olmo_core_to_hf_name()
- olmo_core_train_modules.py: GRPOTrainModule class for OLMo-core training
- pyproject.toml: Add both files to type checking

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add the Ray actor that wraps OLMo-core training:

- grpo_olmo_core_actor.py: PolicyTrainerOLMoCoreProcess and OLMoCoreModelGroup classes
- pyproject.toml: Add file to type checking

The actor coordinates distributed training via torch.distributed for FSDP gradient synchronization.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @finbarrtimbers, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates OLMo-core's training capabilities into a distributed GRPO training system by introducing a dedicated Ray actor and its manager. This enables scalable training of large language models across multiple GPUs and nodes, leveraging Ray for efficient distributed orchestration. The changes facilitate the use of OLMo-core models within a distributed reinforcement learning framework, supporting various training configurations and essential callbacks for monitoring and synchronization.

Highlights

  • New OLMo-core Ray Actor: Introduced PolicyTrainerOLMoCoreProcess, a new Ray actor designed to wrap OLMo-core's training infrastructure for distributed GRPO (Generalized Reinforcement Learning with Policy Optimization) training across multiple GPUs and nodes. Each actor manages training on a single GPU.
  • OLMo-core Model Group Manager: Added OLMoCoreModelGroup, a manager class responsible for orchestrating multiple PolicyTrainerOLMoCoreProcess actors within a Ray placement group, facilitating distributed training setup and coordination.
  • Distributed Training Setup: The new actor handles the initialization of the distributed training environment (e.g., torch.distributed, OLMo-core's prepare_training_environment), loading HuggingFace model weights into an OLMo-core model, and configuring data loaders and optimizers.
  • Callback Integration: The PolicyTrainerOLMoCoreProcess supports various callbacks for critical functionalities such as vLLM weight synchronization, reference policy updates, Beaker progress tracking, and Weights & Biases logging.
  • Type Checking Update: The pyproject.toml file has been updated to include the newly added grpo_olmo_core_actor.py for type checking, ensuring code quality and consistency.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new Ray actor (grpo_olmo_core_actor.py) for distributed GRPO training using OLMo-core. The implementation wraps OLMo-core's training infrastructure, including model setup, data loading, and callback management for weight synchronization and reference policy updates. The pyproject.toml file has been updated to include the new file for type checking. Overall, the code is well-structured and follows the intended design for integrating OLMo-core with Ray for distributed training. However, there are several areas for improvement regarding robustness, consistency, and clarity, particularly in parameter handling, type consistency, and resource allocation.

Comment thread open_instruct/grpo_olmo_core_actor.py
Comment thread open_instruct/grpo_olmo_core_actor.py Outdated
Comment thread open_instruct/grpo_olmo_core_actor.py
Comment thread open_instruct/grpo_olmo_core_actor.py Outdated

model_basename = self.model_name_or_path.split("/")[-1]
config_name = model_basename.replace("-", "_").replace(".", "_")
config_name = config_name[:-1].lower() + "B" if config_name.endswith("B") else config_name.lower()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic for deriving config_name from model_basename is specific and might be fragile. It assumes a particular naming convention (e.g., ending with 'B' for billion) and converts to lowercase. This could lead to issues if model names deviate from this pattern or if TransformerConfig expects specific casing. Consider making this more robust or adding comments about expected model naming conventions.

Comment thread open_instruct/grpo_olmo_core_actor.py Outdated
Comment thread open_instruct/grpo_olmo_core_actor.py Outdated
Comment on lines +297 to +302
if hasattr(self, "vllm_engines") and self.vllm_engines:
trainer_callbacks["vllm_sync"] = VLLMWeightSyncCallback(
vllm_engines=self.vllm_engines,
model_update_group=getattr(self, "model_update_group", None),
actor_manager=getattr(self, "actor_manager", None),
name_mapper=olmo_core_to_hf_name,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using getattr(self, "vllm_engines", None) and getattr(self, "actor_manager", None) with a default of None can mask potential issues. If vllm_engines or actor_manager are expected to be initialized by setup_model_update_group and setup_callbacks respectively, it's better to ensure they are always set or raise an explicit error if they are not, rather than silently proceeding with None.

Comment thread open_instruct/grpo_olmo_core_actor.py Outdated
Comment thread open_instruct/grpo_olmo_core_actor.py Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 386a649e76

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread open_instruct/grpo_olmo_core_actor.py Outdated
@finbarrtimbers finbarrtimbers force-pushed the finbarr/grpo-callbacks-module branch 4 times, most recently from 8c2e075 to 0baad0b Compare January 22, 2026 17:23
@finbarrtimbers finbarrtimbers changed the title Add OLMo-core Ray actor (GRPO olmo-core implementation: PR 3 of 4) Add OLMo-core Ray actor (GRPO olmo-core: PR 4 of 5) Jan 22, 2026
@finbarrtimbers finbarrtimbers force-pushed the finbarr/grpo-callbacks-module branch from 563461f to 477fe56 Compare January 26, 2026 16:25
@finbarrtimbers finbarrtimbers force-pushed the finbarr/grpo-callbacks-module branch from fdd29d8 to 92963ef Compare March 11, 2026 23:57
Base automatically changed from finbarr/grpo-callbacks-module to main March 12, 2026 14:26
…e-actor

# Conflicts:
#	open_instruct/grpo_callbacks.py
#	open_instruct/grpo_fast.py
#	open_instruct/grpo_utils.py
#	open_instruct/olmo_core_train_modules.py
#	pyproject.toml
…l, add comments Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator

@hamishivi hamishivi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally seems good, some questions around future scalability

Comment thread open_instruct/grpo_olmo_core_actor.py Outdated
Comment thread open_instruct/grpo_olmo_core_actor.py Outdated
Comment thread open_instruct/grpo_olmo_core_actor.py Outdated
Comment thread open_instruct/grpo_olmo_core_actor.py
Comment thread open_instruct/grpo_olmo_core_actor.py Outdated
self.model = self.model_config.build(init_device="cpu")

logger.info(f"[Rank {self.rank}] Loading HuggingFace weights from {self.model_name_or_path}")
load_hf_model(self.model_name_or_path, self.model.state_dict(), work_dir=self.grpo_config.output_dir)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this handle distributed loading? If we have a really, really large model that can't fit into RAM on the machine, does this only load the required tensors?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't handle that. Neither does DPO. We'll have to handle this later.

Comment thread open_instruct/grpo_olmo_core_actor.py Outdated
if self.grpo_config.single_gpu_mode:
# In distributed mode, FSDP handles dtype via dp_config.param_dtype
logger.info(f"[Rank {self.rank}] Converting model to bfloat16 for single_gpu_mode")
self.model = self.model.to(dtype=torch.bfloat16)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hardcoded dtype?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to be configurable via the config.

Comment thread open_instruct/grpo_olmo_core_actor.py
@finbarrtimbers finbarrtimbers added this pull request to the merge queue Mar 16, 2026
Merged via the queue into main with commit d879149 Mar 16, 2026
6 of 7 checks passed
@finbarrtimbers finbarrtimbers deleted the finbarr/grpo-olmo-core-actor branch March 16, 2026 20:46
finbarrtimbers added a commit that referenced this pull request Mar 17, 2026
…-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
github-merge-queue Bot pushed a commit that referenced this pull request Mar 19, 2026
)

* Extract ExperimentConfig to grpo_utils.py (GRPO olmo-core implementation: PR 1 of 4)

This refactoring extracts the shared configuration class that both grpo_fast.py
(existing DeepSpeed trainer) and the upcoming grpo.py (new OLMo-core trainer) need.

- Create grpo_utils.py with ExperimentConfig dataclass (moved from grpo_fast.py Args)
- Update grpo_fast.py to import from grpo_utils
- Update benchmark_generators.py to import from grpo_utils

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Add GRPO callbacks and training module (GRPO olmo-core implementation: PR 2 of 4)

Add foundational components for the OLMo-core GRPO trainer:

- grpo_callbacks.py: VLLMWeightSyncCallback, RefPolicyUpdateCallback, olmo_core_to_hf_name()
- olmo_core_train_modules.py: GRPOTrainModule class for OLMo-core training
- pyproject.toml: Add both files to type checking

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Add OLMo-core Ray actor (GRPO olmo-core implementation: PR 3 of 4)

Add the Ray actor that wraps OLMo-core training:

- grpo_olmo_core_actor.py: PolicyTrainerOLMoCoreProcess and OLMoCoreModelGroup classes
- pyproject.toml: Add file to type checking

The actor coordinates distributed training via torch.distributed for FSDP gradient synchronization.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Add GRPO main entry point and scripts (GRPO olmo-core implementation: PR 4 of 4)

Add the final integration for the OLMo-core GRPO trainer:

- grpo.py: Main training orchestration script using OLMo-core's Trainer
- scripts/train/debug/single_gpu_grpo.sh: Single GPU test script
- scripts/train/debug/multi_node_grpo.sh: Multi-node test script
- scripts/train/debug/tool_grpo.sh: Tool use test script
- pyproject.toml: Add grpo.py to type checking
- CHANGELOG.md: Document the new trainer

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Refactor PolicyTrainerOLMoCoreProcess to take config objects instead of ~30 individual params, matching grpo_fast.py pattern Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Fix logging convention violation, NameError bug in wait_for_gpus, and consolidate changelog entries Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Restore OLMo-core GRPO actor changelog entry for PR #1398 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Cleaned up PR.

* Unify duplicated functions between grpo.py and grpo_fast.py Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Clean up grpo.py: use backoff for wait_for_gpus, walrus operators, qualified imports, docstrings Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* cleaned up code

* Extend GPU wait timeout to 20 minutes Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Morty Proxy This is a proxified and sanitized view of the page, visit original site.