Add GRPO main entry point and scripts (GRPO olmo-core: PR 5 of 5)#1399
Add GRPO main entry point and scripts (GRPO olmo-core: PR 5 of 5)#1399finbarrtimbers merged 15 commits intomainallenai/open-instruct:mainfrom
Conversation
…ion: PR 1 of 4) This refactoring extracts the shared configuration class that both grpo_fast.py (existing DeepSpeed trainer) and the upcoming grpo.py (new OLMo-core trainer) need. - Create grpo_utils.py with ExperimentConfig dataclass (moved from grpo_fast.py Args) - Update grpo_fast.py to import from grpo_utils - Update benchmark_generators.py to import from grpo_utils Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…: PR 2 of 4) Add foundational components for the OLMo-core GRPO trainer: - grpo_callbacks.py: VLLMWeightSyncCallback, RefPolicyUpdateCallback, olmo_core_to_hf_name() - olmo_core_train_modules.py: GRPOTrainModule class for OLMo-core training - pyproject.toml: Add both files to type checking Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add the Ray actor that wraps OLMo-core training: - grpo_olmo_core_actor.py: PolicyTrainerOLMoCoreProcess and OLMoCoreModelGroup classes - pyproject.toml: Add file to type checking The actor coordinates distributed training via torch.distributed for FSDP gradient synchronization. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
… PR 4 of 4) Add the final integration for the OLMo-core GRPO trainer: - grpo.py: Main training orchestration script using OLMo-core's Trainer - scripts/train/debug/single_gpu_grpo.sh: Single GPU test script - scripts/train/debug/multi_node_grpo.sh: Multi-node test script - scripts/train/debug/tool_grpo.sh: Tool use test script - pyproject.toml: Add grpo.py to type checking - CHANGELOG.md: Document the new trainer Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Summary of ChangesHello @finbarrtimbers, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request delivers the foundational implementation for Group Relative Policy Optimization (GRPO) training, integrating it deeply with the OLMo-core framework. The changes enable robust distributed training capabilities, shifting from DeepSpeed to FSDP, and introduce comprehensive scripting for various training scenarios, including single-GPU, multi-node, and tool-augmented training. This is a crucial step in completing the OLMo-core GRPO component integration, providing a unified and scalable training solution. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces the main entry point and associated scripts for GRPO training using OLMo-core's Trainer. It also updates the CHANGELOG.md and pyproject.toml for type checking. The changes integrate various components for distributed training with Ray and vLLM, including experiment tracking, tool initialization, and model saving. The new scripts provide debug configurations for single-GPU, multi-node, and tool-use scenarios.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 811ba9adde
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| wait_for_gpus(sum(args.num_learners_per_node)) | ||
|
|
||
| bundles = [{"GPU": n, "CPU": n} for n in args.num_learners_per_node] |
There was a problem hiding this comment.
Reserve enough CPUs in the Ray placement group
The placement-group bundles reserve CPU equal to the GPU count ({"GPU": n, "CPU": n}), but each PolicyTrainerOLMoCoreProcess actor requests 4 CPUs (num_cpus_per_actor = 4 in open_instruct/grpo_olmo_core_actor.py:391-409). On a 1‑GPU run this makes the bundle provide only 1 CPU while the actor needs 4, so the actors cannot be scheduled and the training will hang at pg.ready() or actor creation. The bundle CPU should scale to at least 4 * n (or whatever the actor CPU requirement is) to make the placement group feasible.
Useful? React with 👍 / 👎.
| logger.info(f"Only {available_gpus} GPUs available, waiting for {expected_gpus}...") | ||
| time.sleep(poll_interval) | ||
| logger.error(f"Timeout waiting for GPUs. Only {available_gpus} available, needed {expected_gpus}") |
There was a problem hiding this comment.
Fail fast when GPUs never appear in the cluster
When the Ray cluster never reaches the expected GPU count, wait_for_gpus only logs an error and then returns, so the code proceeds to create a placement group that will block indefinitely. This means a misconfigured or undersized cluster will hang the job instead of terminating with a clear failure. Consider raising an exception (or exiting) after the timeout so the run fails fast in that scenario.
Useful? React with 👍 / 👎.
# Conflicts: # CHANGELOG.md # open_instruct/grpo_callbacks.py # open_instruct/grpo_fast.py # open_instruct/grpo_utils.py # open_instruct/olmo_core_train_modules.py # pyproject.toml
…of ~30 individual params, matching grpo_fast.py pattern Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
# Conflicts: # CHANGELOG.md # open_instruct/grpo_olmo_core_actor.py # pyproject.toml
… consolidate changelog entries Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ed-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…alified imports, docstrings Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
hamishivi
left a comment
There was a problem hiding this comment.
From a quick scan, seems good minus one comment. Would like to test this more tho!
… (1M context) <noreply@anthropic.com>
Summary
grpo.pymain training orchestration script using OLMo-core's Trainer with Ray actorssingle_gpu_grpo.sh,multi_node_grpo.sh,tool_grpo.shgrpo.pyandgrpo_fast.py:grpo.pynow callsgrpo_fast.setup_runtime_variablesandgrpo_fast.create_generation_configsinstead of maintaining its own copiesis_beaker_job()guard intomaybe_get_beaker_config()so callers don't need to wrap every callgrpo_fast.create_generation_configswherevllm_configwas referenced but not passed as a parametergrpo.pyto type checking inpyproject.tomlCHANGELOG.mdDepends on: #1398
GPU_TESTS=01KKY8PKQYXPDTJFCT37Q20E9X
Runs: