Add TextRLEnvironment for text-based RL environments by hamishivi · Pull Request #1489 · allenai/open-instruct

hamishivi · Feb 22, 2026

Summary

Adds TextRLEnvironment, a new RLEnvironment subclass for environments that work with plain text instead of parsed tool calls, and makes tool output formatting role-aware across all parsers. This opens up for the possibility of training on simulated users in true multi-turn RL :D

TextRLEnvironment (environments/base.py):

Subclasses implement text_step(text: str) -> StepResult instead of step(EnvCall)
response_role = "user" by default (base RLEnvironment defaults to "tool")
step() extracts args["text"] from a shadow EnvCall and forwards to text_step
reset() always returns an empty tool list
submits via a 'shadow tool call' that always just adds the full last output.
Example env is WordleTextEnv, which matches the TextArena implementation.

Multi-turn improvements

env_response_role is passed to process_tool_tokens → format_tool_outputs for role-aware formatting (only supported by vllm parsers rn).
per_turn_max_tokens support added
Env metrics surfaced to rollout info (→ wandb)
Removed task_id from reset() interface; envs read their own kwargs (more natural than shoe-horning in things via task_id)

Tests

I've validated our results match Prime-RL's wordle experiments:
us:

prime:

Note the rewards are off a bit because we don't have as complex a formatting reward, but the curves increase by similar amounts.

We also match in final evals:
prime:

│ │ • correct_answer: 0.600                                                                  │ │
│ │ • partial_answer: 0.173                                                                  │ │
│ │ • length_bonus: 0.154                                                                    │ │
│ │ • format_reward: 0.731                                                                   │ │
│ │ • num_turns: 5.150

us:

│ │ • correct_answer: 0.667                                                                                                                             │ │
│ │ • partial_answer: 0.130                                                                                                                             │ │
│ │ • length_bonus: 0.171                                                                                                                               │ │
│ │ • format_reward: 0.744                                                                                                                              │ │
│ │ • num_turns: 5.117

The script for recreating these results are in scripts/train/debug/envs/wordle_8gpu.sh. Wandb id for the successful job is https://wandb.ai/ai2-llm/open_instruct_internal/runs/vi0zqrvc.

Some other runs to ensure other things aren't broken:

successful 'regular' tool run: https://beaker.org/orgs/ai2/workspaces/open-instruct-dev/work/01KJE7T8N7S0Q1R55ACTJCJAQY?taskId=01KJE7T8P57ME0QJ9XJB6MWK61&jobId=01KJE7T8V7858N07DS4RGMEQAC

successful 'regular' math run: https://beaker.org/orgs/ai2/workspaces/open-instruct-dev/work/01KJE7G3RB75QCTMXZ7H1D4QAG?taskId=01KJE7G3RK4RKWN1WTFKVFC5YD&jobId=01KJE7G3Z8GGZBBTZQFTAAQ4QD

gemini-code-assist · Feb 22, 2026

Summary of Changes

Hello @hamishivi, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the reinforcement learning capabilities by introducing a new abstraction for text-based environments, unifying the handling of diverse interactive tasks. It streamlines the integration of these environments into the model generation pipeline and refines the DPO training process with performance and feature improvements, including dynamic rubric generation for more adaptive reward signals.

Highlights

Text-based RL Environments: Introduced TextRLEnvironment, a new base class for environments that process plain text model outputs, enabling interactive text-based tasks like games.
Unified Environment/Tool System: Refactored the environment and tool management system under a single RLEnvironment interface, utilizing Ray's EnvironmentPool for efficient actor management and auto-discovery of environments from datasets.
Example Text Environment: Added WordleTextEnv as an example of a text-based RL environment, demonstrating how models can interact with text-based games and receive formatted feedback.
VLLM Integration for Text Environments: Integrated text environments into the vLLM generation loop, allowing the full model output to be routed to text_step() and environment responses to be tokenized using the tokenizer's chat template.
DPO Training Improvements: Implemented micro-batching and padding-free collation for DPO training, reloaded HuggingFace weights after parallelization, and added new performance metrics to the PerfCallback.
Evolving Rubrics Support: Added support for evolving rubrics with RubricVerifier and utility functions for GRPO training, including dynamic rubric generation and buffer management.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

Added
- Wire RL environments into vLLM generation loop and preprocessing: unified tool/env system with single TOOL_REGISTRY, pooled actors via shared EnvironmentPool Ray actor (async acquire/release, auto-sized to rollout concurrency), RolloutState tracks all per-rollout state, PassthroughVerifier + RewardAggregator for per-turn rewards (verifier score folded into last turn before aggregation), BaseEnvConfig in environments/base.py, --max_steps unified, --pool_size configurable, auto-discovery of tools from datasets, 1-GPU debug scripts for counter/guess_number envs (Wire RL environments into vLLM generation loop and preprocessing #1479).
- RL environment abstraction: RLEnvironment base class with Tool as a subclass, unifying tools and environments under a single step(EnvCall) -> StepResult interface. Removes Executable/EnvOutput/_execute/safe_execute indirection. Moves tools under open_instruct/environments/tools/. Includes example environments (CounterEnv, GuessNumberEnv) (Make Tool a subclass of RLEnvironment #1478).
- Enable packing with torch.compile for DPO training, fix cu_seq_lens offset bug for padded chosen/rejected sequences, add tokens_per_second_per_gpu metric (Enable packing + compile for DPO with wasted tokens metric #1466).
- Production DPO script for OLMo3-7B hybrid (Update DPO debug scripts and add production script #1449).
- Gradient accumulation/microbatching support for OLMo-core DPO training (Add gradient accumulation/microbatching support to DPOTrainModule #1447).
- Evolving rubrics support with RubricVerifier and utility functions for GRPO training (Add adaptive rubrics support with RubricVerifier and utility functions #1460).
- New perf metrics in PerfCallback: total_tokens, data_loading_seconds, data_loading_pct, wall_clock_per_step, step_overhead_pct (Fix DPO MFU calculation and add new perf metrics #1457).
Changed
- Replaces lambda collators with a "single_example_collator" (Replaces lambda collators with a single_example_collator to use a consistent interface #1472).
- Made a bunch of changes to dpo.py so it matches dpo_tune_cache.py perfectly (Now, dpo.py matches dpo_tune_cache.py almost perfectly on the single GPU experiments #1451).
Fixed
- Fixed weight sync thread hang when inflight_updates=False: wait for all vLLM engine.update_weight RPCs to complete before unpausing actors, preventing health_check_fn from blocking indefinitely (Wait for vLLM update_weight RPCs before unpausing actors in weight sync #1480).
- Fixed nodes_needed calculation in grpo_fast kv_cache_max_concurrency warning using math.ceil() instead of floor division to avoid undercounting required inference nodes (Fix nodes_needed calculation for kv_cache_max_concurrency warning in grpo_fast #1474).
- Fixed eval_on_step_0 never triggering in grpo_fast because it was gated behind the training_step % local_eval_every == 0 modulo check; also guard local_eval_every <= 0 to prevent accidental every-step eval or ZeroDivisionError (Fix eval_on_step_0 never triggering in grpo_fast #1485).
- Fixed TypeError in pack_padded_sequences when attention_mask is a float tensor, and vectorized the packing to avoid per-sequence host-device synchronizations (Fix float slice index in pack_padded_sequences #1486).
- Fixed silent prompt/ground-truth mismatch in RLVR caused by redundant dataset shuffle desyncing the "index" column from positional indices, leading to wrong rewards and wrong exclude_index exclusions (RLVR: remove redundant dataset shuffle that causes prompt/ground-truth mismatch #1484).
- Fixed test single_example_collator returning raw int for index, causing TypeError in _iter_batches (Fix test single_example_collator to wrap index as tensor #1477).
- Fixed SFT integration test failing due to missing --try_launch_beaker_eval_jobs false flag (Fix SFT integration test launch failure #1470).
- Fixed checkpoint cleanup race condition on shared filesystems by using ignore_errors=True and restricting cleanup to global rank 0 (Fixes some checkpointing bugs in dpo.py #1468).
- Fixed checkpoint resume failing on Beaker retries by removing non-deterministic timestamp from exp_name (Fixes some checkpointing bugs in dpo.py #1468).
- Fixed MFU calculation to count LM head FLOPs per token (Fix DPO MFU calculation and add new perf metrics #1457).

Activity

Added 15 new unit tests covering TextRLEnvironment base class flags, WordleTextEnv gameplay (correct/wrong/invalid guesses, game over, feedback format, case insensitivity, metrics, state, delegation from step to text_step).
All 17 tests in tests/test_environments.py passed.
Planned testing with an actual training loop on a GPU cluster using the Wordle environment.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This is a substantial pull request that introduces a new abstraction for text-based reinforcement learning environments, TextRLEnvironment, and refactors the existing tool system to unify it under a common RLEnvironment base class. The addition of WordleTextEnv is a great example of how to use the new text-based environment. The changes to support this new abstraction, especially in vllm_utils.py and grpo_fast.py, are well-implemented.

I'm particularly impressed by the correctness fixes included here, such as the proper handling of gradient accumulation for packed sequences in DPO, the fix for the weight sync hang, and the correction to the MFU calculation. These show great attention to detail.

I have one main point of feedback regarding the new appworld_env.py file, which is quite large and could benefit from being broken down into smaller, more focused modules for better maintainability. I've also highlighted a TODO in that file that points to a potential resource leak, which should be addressed.

hamishivi · Feb 22, 2026

/gemini review

gemini-code-assist

Code Review

This pull request introduces TextRLEnvironment, a new base class for text-based reinforcement learning environments. This is a significant feature that allows models to be trained on interactive tasks where the entire model generation is treated as a single string input to the environment. The implementation includes a WordleTextEnv example, modifications to the tool parsing and processing pipeline to accommodate text-based environments, and a comprehensive set of unit tests.

The overall approach of using a "shadow tool call" to integrate text environments into the existing tool-based dispatch logic is clever and minimizes code duplication. The changes are well-structured across the different files.

I have one main concern regarding the logic in vllm_utils.py that decides when to create this shadow tool call, which could lead to the text environment being skipped if the model's output is incidentally parsed as a tool call. I've left a detailed comment with a suggestion to address this.

hamishivi · Feb 22, 2026

/gemini review

gemini-code-assist

Code Review

The pull request introduces TextRLEnvironment for text-based RL environments, along with a WordleTextEnv example. It also updates the ToolParser to handle role for formatting outputs and integrates text environments into the process_request flow. The changes are well-structured and include necessary updates to the CHANGELOG.md and unit tests. The new WordleTextEnv provides a clear demonstration of the text-based environment capabilities.

hamishivi · Feb 22, 2026

/gemini review

gemini-code-assist

Code Review

The introduction of TextRLEnvironment is a valuable addition, enabling training on interactive text-based tasks like Wordle where the model's full generation is treated as the environment input. The implementation of WordleTextEnv is robust, correctly handling positional feedback and duplicate letters. The integration with vLLM rollouts via role_template in the tool parsers is well-designed, allowing environment responses to be formatted naturally within the conversation. I have identified one high-priority issue in the request processing logic where text environments could be triggered twice if the model output happens to match a tool call pattern.

…ast match - Regex now accepts any-length word in <guess> tags, length validated in step - Wrong-length guesses get specific feedback + 0.2 format reward - Repeated guesses rejected with feedback, don't consume a guess - Multiple <guess> tags: take last match (matches prime-rl parse_answer) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

TextArena's set_invalid_move does NOT end the game — it gives feedback and lets the player retry (with an error allowance). Match that: - Invalid moves (no tags, wrong length, repeats) don't set done=True - Invalid moves don't count as guesses (no turn increment) - Carry forward last valid guess's reward so reward_aggregator=last sees actual game performance, not a meaningless rejection reward Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…utive errors TextArena's SinglePlayerState allows 1 error before ending the game. Match that: first invalid move (no tags, wrong length, repeat) gets a retry with feedback. Second consecutive invalid move ends the game with reward=0. Valid moves reset the error count. Removes _last_valid_reward carry-forward — cleaner and matches how prime-rl's rubric would score an invalid final move. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Previously async_steps only controlled how many prompts were seeded initially, but the data preparation loop would run unboundedly ahead of training (40+ steps observed in practice). This caused significant staleness in prepared data. Now the prep loop blocks when it gets more than async_steps ahead of the last step consumed by training. This makes async_steps a true window: data prep stays within [consumed, consumed + async_steps]. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…environment

When the game ends due to consecutive invalid moves, compute reward like prime-rl's rubric would: partial_answer from last valid guess scoring + format_reward based on fraction of properly tagged messages. Previously returned reward=0, losing all signal from good play. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Match TextArena's full validation chain: 1. Format check (no <guess> tags) 2. Invalid characters (spaces, non-alpha in guess) 3. Length check (not 5 letters) 4. Repeat check (already guessed) 5. Dictionary check (not an English word via NLTK) All invalid moves use error_allowance=1. Previously fake words like "doule", "volfe", "tlope" consumed guess slots; now they're rejected with feedback, giving the model a chance to retry with a real word. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Add --per_turn_max_tokens option: caps each generation turn while overall response_length still applies. Matches prime-rl's max_tokens=1024 per turn. - Fix feedback format: remove extra blank line between scoring and remaining guesses to match TextArena's format. - Fix length bonus: count all <guess> attempts (including invalid) to match prime-rl's length_bonus function. - Wordle 8GPU script set to --per_turn_max_tokens 1024. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Format reward is now 0.2 * (turns_with_tags / total_turns), averaged across all turns like prime-rl's rubric. Both valid and invalid-but- tagged turns count toward format compliance. Length bonus uses total_turns (including untagged) so every wasted turn penalizes the bonus. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add --truncated_importance_sampling_ratio_cap 5.0 to reweight samples based on policy divergence. Should help with the flat loss/ratio observed over 40 steps where the policy wasn't changing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace the three overlapping role-formatting mechanisms (default_role_template, role_map, role_templates) with a single role_templates dict where each value has the role name baked in. Also remove unnecessary kwargs guards in example envs and fix pre-existing Wordle test failures. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

hamishivi · Feb 26, 2026

/gemini review

gemini-code-assist

Code Review

This pull request introduces TextRLEnvironment for text-based reinforcement learning environments, a valuable abstraction that simplifies handling of non-tool-based interactions. The changes are well-structured, including a new WordleTextEnv example, role-aware formatting in parsers, and updates to the vLLM processing loop to support these new environments. The accompanying tests are thorough and cover the new functionality well. I have one minor suggestion to improve code conciseness in data_loader.py.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Replace generic "task_id" with the specific kwargs each environment expects: "target" for CounterEnv and "number" for GuessNumberEnv. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

natolambert

Looks good! One thing to check: WordleTextEnv imports nltk at the top of examples.py, and tools.py pulls WordleTextEnvConfig into TOOL_REGISTRY -- so if nltk isn't added to project dependencies, it'll break all tool/env imports, not just Wordle. Make sure it's in pyproject.toml.

hamishivi · Feb 27, 2026

We already had nltk for the ifeval stuff so should be fine!

gemini-code-assist Bot reviewed Feb 22, 2026

View reviewed changes

hamishivi force-pushed the text-rl-environment branch 4 times, most recently from c09ad4a to 33ddc0f Compare February 22, 2026 20:38

hamishivi changed the base branch from rl-environments-v2 to main February 22, 2026 20:38

hamishivi force-pushed the text-rl-environment branch 7 times, most recently from 1874b5a to 1b206d7 Compare February 22, 2026 21:12

gemini-code-assist Bot reviewed Feb 22, 2026

View reviewed changes

Comment thread open_instruct/vllm_utils.py Outdated

hamishivi force-pushed the text-rl-environment branch 3 times, most recently from deb338c to 4c95f75 Compare February 22, 2026 21:28

gemini-code-assist Bot reviewed Feb 22, 2026

View reviewed changes

hamishivi force-pushed the text-rl-environment branch 5 times, most recently from d4b4fb2 to c9fbbfb Compare February 22, 2026 22:01

gemini-code-assist Bot reviewed Feb 22, 2026

View reviewed changes

Comment thread open_instruct/vllm_utils.py

hamishivi force-pushed the text-rl-environment branch 2 times, most recently from 360ebbe to 27c8d3b Compare February 22, 2026 22:55

root and others added 18 commits February 25, 2026 01:27

logging+comment

a002262

changelog

da0a3ce

Merge remote-tracking branch 'origin/bound-async-steps' into text-rl-…

a643b72

…environment

Switch Wordle 8GPU to 4 learners / 4 inference engines

dd6d8e5

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

grad boundary fix

2d5a16e

Replace get_is_text_env remote call with issubclass check at config time

e8d5b8e

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Merge main into text-rl-environment

592830a

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Use main's version of accumulation boundary comment

c79224b

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

gemini-code-assist Bot reviewed Feb 26, 2026

View reviewed changes

Comment thread open_instruct/data_loader.py Outdated

hamishivi and others added 3 commits February 26, 2026 14:58

Update open_instruct/data_loader.py

c93b07b

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Use actual env kwargs instead of task_id in dataset creation

80bb45e

Replace generic "task_id" with the specific kwargs each environment expects: "target" for CounterEnv and "number" for GuessNumberEnv. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

remove 1 gpu wordle

14c3da3

hamishivi requested a review from natolambert February 27, 2026 02:48

natolambert approved these changes Feb 27, 2026

View reviewed changes

hamishivi added this pull request to the merge queue Feb 27, 2026

Merged via the queue into main with commit 25ac864 Feb 27, 2026
7 checks passed

hamishivi deleted the text-rl-environment branch February 27, 2026 20:02

Search code, repositories, users, issues, pull requests...

Conversation

hamishivi commented Feb 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Tests

Uh oh!

gemini-code-assist Bot commented Feb 22, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hamishivi commented Feb 22, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

hamishivi commented Feb 22, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hamishivi commented Feb 22, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

hamishivi commented Feb 26, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

natolambert left a comment

Choose a reason for hiding this comment

Uh oh!

hamishivi commented Feb 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hamishivi commented Feb 22, 2026 •

edited

Loading