Fixes some checkpointing bugs in dpo.py#1468
Fixes some checkpointing bugs in dpo.py#1468finbarrtimbers merged 6 commits intomainallenai/open-instruct:mainfrom finbarr/fix-checkpointingallenai/open-instruct:finbarr/fix-checkpointingCopy head branch name to clipboard
dpo.py#1468Conversation
- Add ignore_errors=True to shutil.rmtree in checkpoint cleanup to handle race conditions on shared filesystems where multiple nodes may delete the same directory - Change is_local_main_process to is_main_process for checkpoint cleanup in dpo_tune_cache.py so only global rank 0 does cleanup on shared Weka - Remove non-deterministic int(time.time()) from exp_name construction in dpo_tune_cache.py, finetune.py, and utils.py so output_dir is stable across Beaker retries - Add tests for clean_last_n_checkpoints Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… integration test - New scripts/train/debug/dpo/multi_node_cache.sh: 2-node dpo_tune_cache run that exercises checkpoint cleanup with checkpointing_steps=5 and keep_last_n_checkpoints=2 - Enable checkpointing in dpo_integration_test.sh to exercise the is_main_process and ignore_errors fixes Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary of ChangesHello @finbarrtimbers, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the robustness and reproducibility of checkpointing mechanisms within the training pipelines, particularly for distributed DPO and finetuning. By eliminating non-deterministic elements from experiment naming and fortifying checkpoint cleanup against race conditions and errors, it ensures more reliable training runs and easier resumption from failures. Additionally, it improves testing coverage for checkpoint management and updates related scripts to reflect these changes and facilitate multi-node debugging. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
Updated changelog with recent fixes and improvements.
There was a problem hiding this comment.
Code Review
This pull request introduces crucial fixes for two checkpointing-related bugs, significantly improving the robustness of multi-node training and job resumption. The first fix addresses a race condition during checkpoint cleanup on shared filesystems by correctly restricting the cleanup operation to the global main process and handling cases where directories might already be removed. The second fix ensures deterministic experiment paths by removing timestamps from experiment names, which is essential for reliable checkpoint resumption after a job restart. The changes are well-implemented, and the addition of new unit tests for checkpoint cleanup is a commendable improvement that strengthens the codebase. Overall, this is an excellent contribution that enhances the stability of the training pipeline.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8f5b678037
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…mps to DPO scripts The add_seed_and_date_to_exp_name flag is removed. Instead, seed is always appended to exp_name in setup_experiment_paths, and each DPO script includes a timestamp via $(date +%s). Also fixes main_process_only kwarg bug in dpo_tune_cache.py. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…gration test The end-of-training cleanup was hardcoded to delete all checkpoints (keep_last_n_checkpoints=0), preventing checkpoint resume from working. Now respects the user's --keep_last_n_checkpoints setting. Adds a multi-node integration test that runs two sequential Beaker jobs to verify checkpoint save/resume works end-to-end. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fixes several bugs in
dpo_tune_cache.pythat prevented checkpoint save/resume from working correctly, and adds an integration test to verify the fix.Bug fixes:
clean_last_n_checkpointswas hardcoded tokeep_last_n_checkpoints=0, wiping all checkpoints at the end of training. Now respects the user's--keep_last_n_checkpointssetting.beaker_configUnboundLocalError:beaker_configwas only assigned inside thepush_to_hubblock but referenced unconditionally bywith_trackingandtry_auto_save_to_beaker. Moved initialization outsidepush_to_huband addedNoneguards.build_reference_logprobs_cachestale call signature: The call site passedacceleratorandreference_cache_hash, but the function signature was updated to takedevice,cache_path,is_main_process, andmodel_dims. Updated to match.gradient_checkpointingAttributeError:ExperimentConfigdoesn't have agradient_checkpointingfield. Replaced withactivation_memory_budget < 1to match the actual config.Other changes
scripts/train/debug/dpo/checkpoint_integration_test.sh: a two-run integration test that trains 1 epoch (saving checkpoints), then launches a second run that resumes from those checkpoints.scripts/train/debug/dpo/multi_node_cache.shfor multi-nodedpo_tune_cachetesting.add_seed_and_date_to_exp_name, always append seed, added timestamps to DPO scripts.Ran the checkpoint integration test and it succeeded: