Add GRPO callbacks for OLMo-core Trainer (GRPO olmo-core: PR 3 of 5)#1397
Add GRPO callbacks for OLMo-core Trainer (GRPO olmo-core: PR 3 of 5)#1397finbarrtimbers merged 1 commit intomainallenai/open-instruct:mainfrom finbarr/grpo-callbacks-moduleallenai/open-instruct:finbarr/grpo-callbacks-moduleCopy head branch name to clipboard
Conversation
Summary of ChangesHello @finbarrtimbers, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces foundational components for Group Relative Policy Optimization (GRPO) within the OLMo-core framework. It establishes new callback mechanisms for managing model weight synchronization with vLLM inference engines and updating reference policies, alongside a dedicated training module that implements the GRPO algorithm. These additions lay the groundwork for advanced reinforcement learning from human feedback (RLHF) training workflows, specifically for the OLMo-core actor. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3a8a137290
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
Code Review
This pull request introduces GRPO-specific callbacks and a training module for OLMo-core, which are foundational components for the OLMo-core actor. The changes include VLLMWeightSyncCallback for synchronizing weights, RefPolicyUpdateCallback for Polyak updates, and GRPOTrainModule for the GRPO training algorithm. The pyproject.toml file has been updated to include these new files for type checking. Overall, the code is well-structured and follows existing patterns, but there are a few areas for improvement regarding type safety, code duplication, and error handling specificity.
e48dd77 to
cd3503f
Compare
f1f3628 to
691668a
Compare
8c2e075 to
0baad0b
Compare
563461f to
477fe56
Compare
hamishivi
left a comment
There was a problem hiding this comment.
Happy to merge this, although probably we want to more thoroughly test these code chunks once more the the GRPO implementation is in.
Agreed! I did add some test, but I agree completely. |
edfb0e3 to
63008d7
Compare
…R 3 of 5) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
fdd29d8 to
92963ef
Compare
Adds callbacks for GRPO training with OLMo-core's Trainer:
VLLMWeightSyncCallback: syncs weights to vLLM engines after each stepRefPolicyUpdateCallback: Polyak averaging for reference policy updatesBased on PR #1412 (GRPOTrainModule)
GPU_TESTS=01KFKCWJQKNEB71EZSA93XRKWF