Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Adds type checking for dataset_transformation.py. #1390

Merged
finbarrtimbers merged 6 commits intomainallenai/open-instruct:mainfrom
finbarr/more-type-checkingallenai/open-instruct:finbarr/more-type-checkingCopy head branch name to clipboard
Jan 20, 2026
Merged

Adds type checking for dataset_transformation.py. #1390
finbarrtimbers merged 6 commits intomainallenai/open-instruct:mainfrom
finbarr/more-type-checkingallenai/open-instruct:finbarr/more-type-checkingCopy head branch name to clipboard

Conversation

@finbarrtimbers
Copy link
Copy Markdown
Collaborator

@finbarrtimbers finbarrtimbers commented Jan 20, 2026

Also enables rule UP006, which bans using from typing import List, Dict, etc. instead of using the native list, dict, etc. types.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @finbarrtimbers, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly improves the type safety and robustness of the dataset_transformation.py module. By introducing more explicit type hints, runtime type assertions, and comprehensive None value checks, the changes aim to prevent common programming errors, enhance code readability, and facilitate easier maintenance. The module is now also included in the project's type checking configuration.

Highlights

  • Enhanced Type Safety and Error Handling: Added explicit ValueError checks for None values in get_commit_hash and within the TokenizerConfig class for tokenizer_name_or_path and tokenizer_revision. This prevents potential runtime errors due to uninitialized or missing values.
  • Improved Type Hinting and Runtime Assertions: Introduced assert isinstance checks for Dataset objects after load_dataset calls and for torch.Tensor results from apply_chat_template. Type hints for chat_template_kwargs and CollatorForPreferenceDataset.__call__'s batch parameter were also refined for better clarity and type checking.
  • Conditional Logic Refinement: Modified the get_tokenizer_tulu_v2_2 function to correctly handle cases where tc.chat_template_name might be None. The upsampling_factor calculation now also explicitly checks for dc.dataset_range being None.
  • Type Checking Configuration: The pyproject.toml file has been updated to include open_instruct/dataset_transformation.py in the type checking scope, ensuring that these new type annotations and checks are enforced.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request primarily focuses on enhancing type safety and robustness across various dataset transformation functions and configurations. Key improvements include adding explicit type assertions for loaded datasets, refining conditional checks to prevent potential None related errors, and updating type hints for better clarity and correctness. These changes contribute to a more reliable and maintainable codebase.

Comment thread open_instruct/dataset_transformation.py Outdated
self.pad_token_id = pad_token_id

def __call__(self, batch: List[Dict[str, int]]):
def __call__(self, batch: List[Dict[str, List[int]]]):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The type hint for the batch parameter in SimplePreferenceCollator.__call__ has been updated from List[Dict[str, int]] to List[Dict[str, List[int]]]. This is a critical correction as the input_ids_chosen and input_ids_rejected fields within the batch are lists of integers, not single integers. The previous type hint was misleading and could lead to incorrect assumptions about the data structure.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c1d35e8a6c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread open_instruct/dataset_transformation.py Outdated
Comment on lines +80 to +81
if commit_hash is None:
raise ValueError(f"Could not extract commit hash from {file}")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid hard failure when commit hash is missing

The new ValueError here makes get_commit_hash fail whenever custom_cached_file returns None, which is the normal case on a fresh machine or first run where the dataset/model files haven’t been cached yet. Because DatasetConfig.__post_init__ calls get_commit_hash before load_dataset, this change now aborts the run instead of downloading the dataset. Consider allowing None (or triggering a download) so first-time usage doesn’t crash.

Useful? React with 👍 / 👎.

Comment thread open_instruct/dataset_transformation.py Outdated
Comment on lines +884 to +885
if self.tokenizer_revision is None:
raise ValueError("tokenizer_revision must be set")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Allow tokenizer revision to default when unset

This now raises if tokenizer_revision is unset, but several entrypoints use TokenizerConfig without providing a revision (e.g., scripts/data/convert_sft_data_for_olmocore.py only passes --tokenizer_name_or_path). Transformers accept revision=None and default to main, so this change turns previously valid invocations into hard errors. Either supply a default or keep accepting None here.

Useful? React with 👍 / 👎.

frac_or_num_samples: Optional[Union[int, float]] = None
original_dataset_size: Optional[int] = None
is_upsampled: bool = False
dataset: Dataset = field(init=False)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Exclude Dataset objects from config hashing

Adding dataset as a dataclass field means asdict(dc) in compute_config_hash will now include a datasets.Dataset instance; json.dumps will raise TypeError: Object of type Dataset is not JSON serializable. This will break cache hash computation whenever dataset_config_hash is not provided (the default). The dataset object should be excluded from the config dict or marked so it isn’t serialized.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator

@natolambert natolambert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@finbarrtimbers finbarrtimbers added this pull request to the merge queue Jan 20, 2026
Merged via the queue into main with commit 6947a0d Jan 20, 2026
6 of 7 checks passed
@finbarrtimbers finbarrtimbers deleted the finbarr/more-type-checking branch January 20, 2026 16:59
sang1583535 pushed a commit to sang1583535/open-instruct that referenced this pull request Feb 3, 2026
* Now, type checking should pass

* Added changelog update.

* lots of type checking fixes

* type checking passes

* updated changelog.
lukashelff pushed a commit to lukashelff/open-instruct-slurm that referenced this pull request Feb 19, 2026
* Now, type checking should pass

* Added changelog update.

* lots of type checking fixes

* type checking passes

* updated changelog.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Morty Proxy This is a proxified and sanitized view of the page, visit original site.