Adds type checking for dataset_transformation.py. #1390
Adds type checking for dataset_transformation.py. #1390finbarrtimbers merged 6 commits intomainallenai/open-instruct:mainfrom finbarr/more-type-checkingallenai/open-instruct:finbarr/more-type-checkingCopy head branch name to clipboard
dataset_transformation.py. #1390Conversation
Summary of ChangesHello @finbarrtimbers, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly improves the type safety and robustness of the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
The pull request primarily focuses on enhancing type safety and robustness across various dataset transformation functions and configurations. Key improvements include adding explicit type assertions for loaded datasets, refining conditional checks to prevent potential None related errors, and updating type hints for better clarity and correctness. These changes contribute to a more reliable and maintainable codebase.
| self.pad_token_id = pad_token_id | ||
|
|
||
| def __call__(self, batch: List[Dict[str, int]]): | ||
| def __call__(self, batch: List[Dict[str, List[int]]]): |
There was a problem hiding this comment.
The type hint for the batch parameter in SimplePreferenceCollator.__call__ has been updated from List[Dict[str, int]] to List[Dict[str, List[int]]]. This is a critical correction as the input_ids_chosen and input_ids_rejected fields within the batch are lists of integers, not single integers. The previous type hint was misleading and could lead to incorrect assumptions about the data structure.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c1d35e8a6c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if commit_hash is None: | ||
| raise ValueError(f"Could not extract commit hash from {file}") |
There was a problem hiding this comment.
Avoid hard failure when commit hash is missing
The new ValueError here makes get_commit_hash fail whenever custom_cached_file returns None, which is the normal case on a fresh machine or first run where the dataset/model files haven’t been cached yet. Because DatasetConfig.__post_init__ calls get_commit_hash before load_dataset, this change now aborts the run instead of downloading the dataset. Consider allowing None (or triggering a download) so first-time usage doesn’t crash.
Useful? React with 👍 / 👎.
| if self.tokenizer_revision is None: | ||
| raise ValueError("tokenizer_revision must be set") |
There was a problem hiding this comment.
Allow tokenizer revision to default when unset
This now raises if tokenizer_revision is unset, but several entrypoints use TokenizerConfig without providing a revision (e.g., scripts/data/convert_sft_data_for_olmocore.py only passes --tokenizer_name_or_path). Transformers accept revision=None and default to main, so this change turns previously valid invocations into hard errors. Either supply a default or keep accepting None here.
Useful? React with 👍 / 👎.
| frac_or_num_samples: Optional[Union[int, float]] = None | ||
| original_dataset_size: Optional[int] = None | ||
| is_upsampled: bool = False | ||
| dataset: Dataset = field(init=False) |
There was a problem hiding this comment.
Exclude Dataset objects from config hashing
Adding dataset as a dataclass field means asdict(dc) in compute_config_hash will now include a datasets.Dataset instance; json.dumps will raise TypeError: Object of type Dataset is not JSON serializable. This will break cache hash computation whenever dataset_config_hash is not provided (the default). The dataset object should be excluded from the config dict or marked so it isn’t serialized.
Useful? React with 👍 / 👎.
* Now, type checking should pass * Added changelog update. * lots of type checking fixes * type checking passes * updated changelog.
* Now, type checking should pass * Added changelog update. * lots of type checking fixes * type checking passes * updated changelog.
Also enables rule
UP006, which bans usingfrom typing import List, Dict, etc. instead of using the nativelist,dict, etc. types.