Fix ZeRO-2 discarding gradients during manual accumulation#1498
Fix ZeRO-2 discarding gradients during manual accumulation#1498hamishivi merged 3 commits intomainallenai/open-instruct:mainfrom fix/zero2-gradient-accumulationallenai/open-instruct:fix/zero2-gradient-accumulationCopy head branch name to clipboard
Conversation
With gradient_accumulation_steps=1 in the DeepSpeed config, every backward() call is treated as a gradient accumulation boundary. In ZeRO-2 (stage_1_and_2.py), hitting the boundary resets the internal gradient accumulator (all_grad_tensors = None), so only the last backward's gradient survives — all previous micro-batches are discarded. ZeRO-3 doesn't have this bug because it uses a separate micro_step_id counter that only resets in step()/zero_grad(), correctly accumulating across all backward calls. Fix: use DeepSpeed's set_gradient_accumulation_boundary() API to explicitly mark only the final backward before step() as the boundary. This works with varying accumulation step counts. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a critical issue in DeepSpeed's ZeRO-2 optimization, which previously caused incorrect gradient accumulation during manual training loops. By explicitly signaling gradient accumulation boundaries, the fix ensures that all micro-batch gradients are correctly aggregated, thereby restoring the intended learning rate and improving training stability and performance. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request addresses a critical bug in the gradient accumulation logic when using DeepSpeed ZeRO-2, where gradients were being discarded during manual accumulation. The fix correctly uses set_gradient_accumulation_boundary() to inform DeepSpeed about the accumulation boundaries, ensuring gradients from all micro-batches are properly accumulated. The implementation is correct and improves code clarity. I've only suggested a minor update to the changelog to replace the placeholder pull request URL.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Fix ZeRO-2 discarding gradients during manual gradient accumulation With gradient_accumulation_steps=1 in the DeepSpeed config, every backward() call is treated as a gradient accumulation boundary. In ZeRO-2 (stage_1_and_2.py), hitting the boundary resets the internal gradient accumulator (all_grad_tensors = None), so only the last backward's gradient survives — all previous micro-batches are discarded. ZeRO-3 doesn't have this bug because it uses a separate micro_step_id counter that only resets in step()/zero_grad(), correctly accumulating across all backward calls. Fix: use DeepSpeed's set_gradient_accumulation_boundary() API to explicitly mark only the final backward before step() as the boundary. This works with varying accumulation step counts. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Update CHANGELOG with PR number Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * clean comment --------- Co-authored-by: root <root@saturn-cs-aus-230.reviz.ai2.in> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
gradient_accumulation_steps=1in the DS config, everybackward()call is treated as a gradient accumulation boundary. In ZeRO-2 (stage_1_and_2.py:881), hitting the boundary resets the internal gradient accumulator (all_grad_tensors[i] = None), so only the last micro-batch's gradient survives tostep(). ZeRO-3 doesn't have this bug because it uses a separatemicro_step_idcounter that only resets instep()/zero_grad().set_gradient_accumulation_boundary()API to explicitly mark only the final backward beforestep()as the boundary. Handles varying accumulation step counts naturally.Details
Root cause (in DeepSpeed
stage_1_and_2.py):With
gradient_accumulation_steps=1, every backward is a boundary, soall_grad_tensorsis reset each time andaveraged_gradientsonly ever holds the last backward's gradient.ZeRO-3 is unaffected (in
stage3.py):Impact: With
num_mini_batches=1and ~256 samples per rank, ZeRO-2 was using only the last sample's gradient (divided by the total token count of all 256 samples), giving an effective learning rate ~256x smaller than intended.Validation
Running on a wordle sample.
🤖 Generated with Claude Code