Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Fixes some checkpointing bugs in dpo.py#1468

Merged
finbarrtimbers merged 6 commits intomainallenai/open-instruct:mainfrom
finbarr/fix-checkpointingallenai/open-instruct:finbarr/fix-checkpointingCopy head branch name to clipboard
Feb 10, 2026
Merged

Fixes some checkpointing bugs in dpo.py#1468
finbarrtimbers merged 6 commits intomainallenai/open-instruct:mainfrom
finbarr/fix-checkpointingallenai/open-instruct:finbarr/fix-checkpointingCopy head branch name to clipboard

Conversation

@finbarrtimbers
Copy link
Copy Markdown
Collaborator

@finbarrtimbers finbarrtimbers commented Feb 10, 2026

Fixes several bugs in dpo_tune_cache.py that prevented checkpoint save/resume from working correctly, and adds an integration test to verify the fix.

Bug fixes:

  • End-of-training checkpoint cleanup deleted all checkpoints: clean_last_n_checkpoints was hardcoded to keep_last_n_checkpoints=0, wiping all checkpoints at the end of training. Now respects the user's --keep_last_n_checkpoints setting.
  • beaker_config UnboundLocalError: beaker_config was only assigned inside the push_to_hub block but referenced unconditionally by with_tracking and try_auto_save_to_beaker. Moved initialization outside push_to_hub and added None guards.
  • build_reference_logprobs_cache stale call signature: The call site passed accelerator and reference_cache_hash, but the function signature was updated to take device, cache_path, is_main_process, and model_dims. Updated to match.
  • gradient_checkpointing AttributeError: ExperimentConfig doesn't have a gradient_checkpointing field. Replaced with activation_memory_budget < 1 to match the actual config.
  • Checkpoint resume race condition and non-deterministic output dir (earlier commits).

Other changes

  • Added scripts/train/debug/dpo/checkpoint_integration_test.sh: a two-run integration test that trains 1 epoch (saving checkpoints), then launches a second run that resumes from those checkpoints.
  • Added scripts/train/debug/dpo/multi_node_cache.sh for multi-node dpo_tune_cache testing.
  • Removed add_seed_and_date_to_exp_name, always append seed, added timestamps to DPO scripts.

Ran the checkpoint integration test and it succeeded:

finbarrtimbers and others added 3 commits February 10, 2026 15:20
- Add ignore_errors=True to shutil.rmtree in checkpoint cleanup to handle
  race conditions on shared filesystems where multiple nodes may delete
  the same directory
- Change is_local_main_process to is_main_process for checkpoint cleanup
  in dpo_tune_cache.py so only global rank 0 does cleanup on shared Weka
- Remove non-deterministic int(time.time()) from exp_name construction
  in dpo_tune_cache.py, finetune.py, and utils.py so output_dir is stable
  across Beaker retries
- Add tests for clean_last_n_checkpoints

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… integration test

- New scripts/train/debug/dpo/multi_node_cache.sh: 2-node dpo_tune_cache
  run that exercises checkpoint cleanup with checkpointing_steps=5 and
  keep_last_n_checkpoints=2
- Enable checkpointing in dpo_integration_test.sh to exercise the
  is_main_process and ignore_errors fixes

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @finbarrtimbers, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the robustness and reproducibility of checkpointing mechanisms within the training pipelines, particularly for distributed DPO and finetuning. By eliminating non-deterministic elements from experiment naming and fortifying checkpoint cleanup against race conditions and errors, it ensures more reliable training runs and easier resumption from failures. Additionally, it improves testing coverage for checkpoint management and updates related scripts to reflect these changes and facilitate multi-node debugging.

Highlights

  • Checkpointing Reliability: Addressed a race condition during checkpoint cleanup on shared filesystems by using ignore_errors=True with shutil.rmtree and ensuring cleanup is restricted to the global main process.
  • Reproducibility and Resume Stability: Fixed checkpoint resume failures, particularly on Beaker retries, by removing non-deterministic timestamps from experiment names (exp_name) in DPO and finetuning scripts, ensuring consistent output directory naming.
  • Distributed Process Handling: Changed checkpoint cleanup logic from accelerator.is_local_main_process to accelerator.is_main_process to correctly handle cleanup in distributed training environments.
  • Checkpoint Cleanup Testing: Introduced new unit tests for the clean_last_n_checkpoints utility function to verify its behavior under various scenarios, including keeping a specific number of checkpoints and handling already removed directories.
  • Benchmarking and Debugging Scripts: Updated benchmarking scripts to use more VLLM engines and added a new multi-node DPO debug script with checkpointing enabled, along with enabling checkpointing in an existing DPO integration test.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • CHANGELOG.md
    • Fixed checkpoint cleanup race condition on shared filesystems by using ignore_errors=True and restricting cleanup to global rank 0.
    • Fixed checkpoint resume failing on Beaker retries by removing non-deterministic timestamp from exp_name.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Updated changelog with recent fixes and improvements.
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces crucial fixes for two checkpointing-related bugs, significantly improving the robustness of multi-node training and job resumption. The first fix addresses a race condition during checkpoint cleanup on shared filesystems by correctly restricting the cleanup operation to the global main process and handling cases where directories might already be removed. The second fix ensures deterministic experiment paths by removing timestamps from experiment names, which is essential for reliable checkpoint resumption after a job restart. The changes are well-implemented, and the addition of new unit tests for checkpoint cleanup is a commendable improvement that strengthens the codebase. Overall, this is an excellent contribution that enhances the stability of the training pipeline.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8f5b678037

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread open_instruct/dpo_tune_cache.py
finbarrtimbers and others added 2 commits February 10, 2026 15:43
…mps to DPO scripts

The add_seed_and_date_to_exp_name flag is removed. Instead, seed is always
appended to exp_name in setup_experiment_paths, and each DPO script includes
a timestamp via $(date +%s). Also fixes main_process_only kwarg bug in
dpo_tune_cache.py.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…gration test

The end-of-training cleanup was hardcoded to delete all checkpoints
(keep_last_n_checkpoints=0), preventing checkpoint resume from working.
Now respects the user's --keep_last_n_checkpoints setting. Adds a
multi-node integration test that runs two sequential Beaker jobs to
verify checkpoint save/resume works end-to-end.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@finbarrtimbers finbarrtimbers added this pull request to the merge queue Feb 10, 2026
Merged via the queue into main with commit c9daa73 Feb 10, 2026
7 checks passed
@finbarrtimbers finbarrtimbers deleted the finbarr/fix-checkpointing branch February 10, 2026 22:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Morty Proxy This is a proxified and sanitized view of the page, visit original site.