Fixes some checkpointing bugs in `dpo.py` by finbarrtimbers · Pull Request #1468 · allenai/open-instruct

finbarrtimbers · Feb 10, 2026

Fixes several bugs in dpo_tune_cache.py that prevented checkpoint save/resume from working correctly, and adds an integration test to verify the fix.

Bug fixes:

End-of-training checkpoint cleanup deleted all checkpoints: clean_last_n_checkpoints was hardcoded to keep_last_n_checkpoints=0, wiping all checkpoints at the end of training. Now respects the user's --keep_last_n_checkpoints setting.
beaker_config UnboundLocalError: beaker_config was only assigned inside the push_to_hub block but referenced unconditionally by with_tracking and try_auto_save_to_beaker. Moved initialization outside push_to_hub and added None guards.
build_reference_logprobs_cache stale call signature: The call site passed accelerator and reference_cache_hash, but the function signature was updated to take device, cache_path, is_main_process, and model_dims. Updated to match.
gradient_checkpointing AttributeError: ExperimentConfig doesn't have a gradient_checkpointing field. Replaced with activation_memory_budget < 1 to match the actual config.
Checkpoint resume race condition and non-deterministic output dir (earlier commits).

Other changes

Added scripts/train/debug/dpo/checkpoint_integration_test.sh: a two-run integration test that trains 1 epoch (saving checkpoints), then launches a second run that resumes from those checkpoints.
Added scripts/train/debug/dpo/multi_node_cache.sh for multi-node dpo_tune_cache testing.
Removed add_seed_and_date_to_exp_name, always append seed, added timestamps to DPO scripts.

Ran the checkpoint integration test and it succeeded:

- Add ignore_errors=True to shutil.rmtree in checkpoint cleanup to handle race conditions on shared filesystems where multiple nodes may delete the same directory - Change is_local_main_process to is_main_process for checkpoint cleanup in dpo_tune_cache.py so only global rank 0 does cleanup on shared Weka - Remove non-deterministic int(time.time()) from exp_name construction in dpo_tune_cache.py, finetune.py, and utils.py so output_dir is stable across Beaker retries - Add tests for clean_last_n_checkpoints Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

… integration test - New scripts/train/debug/dpo/multi_node_cache.sh: 2-node dpo_tune_cache run that exercises checkpoint cleanup with checkpointing_steps=5 and keep_last_n_checkpoints=2 - Enable checkpointing in dpo_integration_test.sh to exercise the is_main_process and ignore_errors fixes Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

gemini-code-assist · Feb 10, 2026

Summary of Changes

Hello @finbarrtimbers, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the robustness and reproducibility of checkpointing mechanisms within the training pipelines, particularly for distributed DPO and finetuning. By eliminating non-deterministic elements from experiment naming and fortifying checkpoint cleanup against race conditions and errors, it ensures more reliable training runs and easier resumption from failures. Additionally, it improves testing coverage for checkpoint management and updates related scripts to reflect these changes and facilitate multi-node debugging.

Highlights

Checkpointing Reliability: Addressed a race condition during checkpoint cleanup on shared filesystems by using ignore_errors=True with shutil.rmtree and ensuring cleanup is restricted to the global main process.
Reproducibility and Resume Stability: Fixed checkpoint resume failures, particularly on Beaker retries, by removing non-deterministic timestamps from experiment names (exp_name) in DPO and finetuning scripts, ensuring consistent output directory naming.
Distributed Process Handling: Changed checkpoint cleanup logic from accelerator.is_local_main_process to accelerator.is_main_process to correctly handle cleanup in distributed training environments.
Checkpoint Cleanup Testing: Introduced new unit tests for the clean_last_n_checkpoints utility function to verify its behavior under various scenarios, including keeping a specific number of checkpoints and handling already removed directories.
Benchmarking and Debugging Scripts: Updated benchmarking scripts to use more VLLM engines and added a new multi-node DPO debug script with checkpointing enabled, along with enabling checkpointing in an existing DPO integration test.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

CHANGELOG.md
- Fixed checkpoint cleanup race condition on shared filesystems by using ignore_errors=True and restricting cleanup to global rank 0.
- Fixed checkpoint resume failing on Beaker retries by removing non-deterministic timestamp from exp_name.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

Updated changelog with recent fixes and improvements.

gemini-code-assist

Code Review

This pull request introduces crucial fixes for two checkpointing-related bugs, significantly improving the robustness of multi-node training and job resumption. The first fix addresses a race condition during checkpoint cleanup on shared filesystems by correctly restricting the cleanup operation to the global main process and handling cases where directories might already be removed. The second fix ensures deterministic experiment paths by removing timestamps from experiment names, which is essential for reliable checkpoint resumption after a job restart. The changes are well-implemented, and the addition of new unit tests for checkpoint cleanup is a commendable improvement that strengthens the codebase. Overall, this is an excellent contribution that enhances the stability of the training pipeline.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8f5b678037

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

…mps to DPO scripts The add_seed_and_date_to_exp_name flag is removed. Instead, seed is always appended to exp_name in setup_experiment_paths, and each DPO script includes a timestamp via $(date +%s). Also fixes main_process_only kwarg bug in dpo_tune_cache.py. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…gration test The end-of-training cleanup was hardcoded to delete all checkpoints (keep_last_n_checkpoints=0), preventing checkpoint resume from working. Now respects the user's --keep_last_n_checkpoints setting. Adds a multi-node integration test that runs two sequential Beaker jobs to verify checkpoint save/resume works end-to-end. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

finbarrtimbers and others added 3 commits February 10, 2026 15:20

Now, uses 2 engines with tp4

89b0d35

Update CHANGELOG with recent fixes and improvements

b7037f0

Updated changelog with recent fixes and improvements.

gemini-code-assist Bot reviewed Feb 10, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Feb 10, 2026

View reviewed changes

Comment thread open_instruct/dpo_tune_cache.py

finbarrtimbers and others added 2 commits February 10, 2026 15:43

finbarrtimbers requested a review from hamishivi February 10, 2026 22:44

finbarrtimbers enabled auto-merge February 10, 2026 22:44

hamishivi approved these changes Feb 10, 2026

View reviewed changes

finbarrtimbers added this pull request to the merge queue Feb 10, 2026

Merged via the queue into main with commit c9daa73 Feb 10, 2026
7 checks passed

finbarrtimbers deleted the finbarr/fix-checkpointing branch February 10, 2026 22:59

gemini-code-assist Bot mentioned this pull request Feb 22, 2026

Add TextRLEnvironment for text-based RL environments #1489

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes some checkpointing bugs in `dpo.py`#1468

Fixes some checkpointing bugs in `dpo.py`#1468
finbarrtimbers merged 6 commits intomainallenai/open-instruct:mainfrom
finbarr/fix-checkpointingallenai/open-instruct:finbarr/fix-checkpointingCopy head branch name to clipboard

finbarrtimbers commented Feb 10, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Feb 10, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Search code, repositories, users, issues, pull requests...

Conversation

finbarrtimbers commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Other changes

Uh oh!

gemini-code-assist Bot commented Feb 10, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

finbarrtimbers commented Feb 10, 2026 •

edited

Loading