Replaces lambda collators with a single_example_collator to use a consistent interface#1472
Replaces lambda collators with a single_example_collator to use a consistent interface#1472finbarrtimbers merged 3 commits intomainallenai/open-instruct:mainfrom finbarr/consistent-collationallenai/open-instruct:finbarr/consistent-collationCopy head branch name to clipboard
lambda collators with a single_example_collator to use a consistent interface#1472Conversation
Summary of ChangesHello @finbarrtimbers, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request streamlines data collation and DPO training by replacing ad-hoc lambda functions with a consistent Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
- Add single_example_collator to replace lambda x: x[0] everywhere, ensuring batch["index"] is always a tensor - Track overflow in HFDataLoader._iter_batches so features filtered by the DPO collator are carried to the next batch instead of being silently dropped - Overflow persists across epoch reshuffles - Remove unnecessary "index" in features[0] guards in DPO collators Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a significant set of improvements for DPO training. The primary change is enabling packing with torch.compile, which was previously unsupported. This is achieved through a major refactoring of the data collators in padding_free_collator.py, introducing padding to a fixed max_seq_length to work around torch.compile's limitations with variable shapes.
The refactoring is well-executed, with helper functions extracted for better modularity and readability. Several correctness bugs are also fixed, including an off-by-one error in seq_idx calculation for packed DPO batches and a more robust implementation of get_batch_logps using torch.bucketize for better performance and correctness with variable-length sequences.
The PR also includes other valuable changes such as adding a profiling option, improving performance metric calculations in PerfCallback to correctly handle packed sequences, and replacing lambda collators with a named function for consistency, as stated in the title. The addition of comprehensive tests for the new DPO packing logic in test_padding_free_collator.py is excellent and provides high confidence in the correctness of these complex changes.
Overall, this is a high-quality contribution that improves performance, correctness, and maintainability of the DPO training code.
c18148a to
1a578cf
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c18148ab25
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if len(batch_examples) == self._per_rank_batch_size: | ||
| yield to_device(self._collator(batch_examples), self._device) | ||
| all_examples = self._overflow + batch_examples | ||
| batch = to_device(self._collator(all_examples), self._device) | ||
| self._overflow = all_examples[len(batch["index"]) :] | ||
| yield batch |
There was a problem hiding this comment.
Drain deferred overflow examples before ending an epoch
HFDataLoader._iter_batches() now defers trimmed samples into _overflow, but iteration still stops after effective_size source rows and never flushes any remaining overflow. With packing (TensorDataCollatorWithFlatteningDPO._prefilter_features) this means long-sequence workloads can repeatedly keep fewer than per_rank_batch_size items, leaving many deferred samples untrained in that epoch (and potentially accumulating across epochs), which silently skews DPO sampling and can invalidate experiment results.
Useful? React with 👍 / 👎.
| if num_seqs is None: | ||
| num_seqs = self.per_device_train_batch_size * 2 | ||
| self._interval_num_sequences += num_seqs * self.num_training_gpus |
There was a problem hiding this comment.
Include grad accumulation in non-packing sequence counts
In PerfCallback.pre_step, the non-packing fallback uses per_device_train_batch_size * 2, but DPO batches are built with per_device_train_batch_size * gradient_accumulation_steps per rank. When gradient_accumulation_steps > 1 and cu_seq_lens_k is absent, _interval_num_sequences is undercounted, so avg_sequence_length and MFU are systematically inflated, making performance metrics misleading for accumulation-heavy runs.
Useful? React with 👍 / 👎.
…r guard Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This lets us avoid some branching in #1466.
Experiment links: