Adds type checking for `dataset_transformation.py`. by finbarrtimbers · Pull Request #1390 · allenai/open-instruct

finbarrtimbers · Jan 20, 2026

Also enables rule UP006, which bans using from typing import List, Dict, etc. instead of using the native list, dict, etc. types.

gemini-code-assist · Jan 20, 2026

Summary of Changes

Hello @finbarrtimbers, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly improves the type safety and robustness of the dataset_transformation.py module. By introducing more explicit type hints, runtime type assertions, and comprehensive None value checks, the changes aim to prevent common programming errors, enhance code readability, and facilitate easier maintenance. The module is now also included in the project's type checking configuration.

Highlights

Enhanced Type Safety and Error Handling: Added explicit ValueError checks for None values in get_commit_hash and within the TokenizerConfig class for tokenizer_name_or_path and tokenizer_revision. This prevents potential runtime errors due to uninitialized or missing values.
Improved Type Hinting and Runtime Assertions: Introduced assert isinstance checks for Dataset objects after load_dataset calls and for torch.Tensor results from apply_chat_template. Type hints for chat_template_kwargs and CollatorForPreferenceDataset.__call__'s batch parameter were also refined for better clarity and type checking.
Conditional Logic Refinement: Modified the get_tokenizer_tulu_v2_2 function to correctly handle cases where tc.chat_template_name might be None. The upsampling_factor calculation now also explicitly checks for dc.dataset_range being None.
Type Checking Configuration: The pyproject.toml file has been updated to include open_instruct/dataset_transformation.py in the type checking scope, ensuring that these new type annotations and checks are enforced.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

The pull request primarily focuses on enhancing type safety and robustness across various dataset transformation functions and configurations. Key improvements include adding explicit type assertions for loaded datasets, refining conditional checks to prevent potential None related errors, and updating type hints for better clarity and correctness. These changes contribute to a more reliable and maintainable codebase.

gemini-code-assist · Jan 20, 2026

        self.pad_token_id = pad_token_id

-    def __call__(self, batch: List[Dict[str, int]]):
+    def __call__(self, batch: List[Dict[str, List[int]]]):


The type hint for the batch parameter in SimplePreferenceCollator.__call__ has been updated from List[Dict[str, int]] to List[Dict[str, List[int]]]. This is a critical correction as the input_ids_chosen and input_ids_rejected fields within the batch are lists of integers, not single integers. The previous type hint was misleading and could lead to incorrect assumptions about the data structure.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c1d35e8a6c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · Jan 20, 2026

+    if commit_hash is None:
+        raise ValueError(f"Could not extract commit hash from {file}")


Avoid hard failure when commit hash is missing

The new ValueError here makes get_commit_hash fail whenever custom_cached_file returns None, which is the normal case on a fresh machine or first run where the dataset/model files haven’t been cached yet. Because DatasetConfig.__post_init__ calls get_commit_hash before load_dataset, this change now aborts the run instead of downloading the dataset. Consider allowing None (or triggering a download) so first-time usage doesn’t crash.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · Jan 20, 2026

+        if self.tokenizer_revision is None:
+            raise ValueError("tokenizer_revision must be set")


Allow tokenizer revision to default when unset

This now raises if tokenizer_revision is unset, but several entrypoints use TokenizerConfig without providing a revision (e.g., scripts/data/convert_sft_data_for_olmocore.py only passes --tokenizer_name_or_path). Transformers accept revision=None and default to main, so this change turns previously valid invocations into hard errors. Either supply a default or keep accepting None here.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · Jan 20, 2026

    frac_or_num_samples: Optional[Union[int, float]] = None
    original_dataset_size: Optional[int] = None
    is_upsampled: bool = False
+    dataset: Dataset = field(init=False)


Exclude Dataset objects from config hashing

Adding dataset as a dataclass field means asdict(dc) in compute_config_hash will now include a datasets.Dataset instance; json.dumps will raise TypeError: Object of type Dataset is not JSON serializable. This will break cache hash computation whenever dataset_config_hash is not provided (the default). The dataset object should be excluded from the config dict or marked so it isn’t serialized.

Useful? React with 👍 / 👎.

natolambert

LGTM

* Now, type checking should pass * Added changelog update. * lots of type checking fixes * type checking passes * updated changelog.

Now, type checking should pass

c1d35e8

finbarrtimbers requested a review from natolambert January 20, 2026 15:49

Added changelog update.

a59459a

gemini-code-assist Bot reviewed Jan 20, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Jan 20, 2026

View reviewed changes

natolambert approved these changes Jan 20, 2026

View reviewed changes

finbarrtimbers added 4 commits January 20, 2026 09:14

lots of type checking fixes

62b0bc5

Merge branch 'main' into finbarr/more-type-checking

020b1ce

type checking passes

a93411e

updated changelog.

72af614

finbarrtimbers enabled auto-merge January 20, 2026 16:43

finbarrtimbers added this pull request to the merge queue Jan 20, 2026

Merged via the queue into main with commit 6947a0d Jan 20, 2026
6 of 7 checks passed

finbarrtimbers deleted the finbarr/more-type-checking branch January 20, 2026 16:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds type checking for `dataset_transformation.py`. #1390

Adds type checking for `dataset_transformation.py`. #1390
finbarrtimbers merged 6 commits intomainallenai/open-instruct:mainfrom
finbarr/more-type-checkingallenai/open-instruct:finbarr/more-type-checkingCopy head branch name to clipboard

finbarrtimbers commented Jan 20, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Jan 20, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jan 20, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jan 20, 2026

Uh oh!

chatgpt-codex-connector Bot Jan 20, 2026

Uh oh!

chatgpt-codex-connector Bot Jan 20, 2026

Uh oh!

natolambert left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		if commit_hash is None:
		raise ValueError(f"Could not extract commit hash from {file}")

		if self.tokenizer_revision is None:
		raise ValueError("tokenizer_revision must be set")

Search code, repositories, users, issues, pull requests...

Conversation

finbarrtimbers commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot commented Jan 20, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

natolambert left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

finbarrtimbers commented Jan 20, 2026 •

edited

Loading