Add TextRLEnvironment for text-based RL environments#1489
Add TextRLEnvironment for text-based RL environments#1489
Conversation
Summary of ChangesHello @hamishivi, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the reinforcement learning capabilities by introducing a new abstraction for text-based environments, unifying the handling of diverse interactive tasks. It streamlines the integration of these environments into the model generation pipeline and refines the DPO training process with performance and feature improvements, including dynamic rubric generation for more adaptive reward signals. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This is a substantial pull request that introduces a new abstraction for text-based reinforcement learning environments, TextRLEnvironment, and refactors the existing tool system to unify it under a common RLEnvironment base class. The addition of WordleTextEnv is a great example of how to use the new text-based environment. The changes to support this new abstraction, especially in vllm_utils.py and grpo_fast.py, are well-implemented.
I'm particularly impressed by the correctness fixes included here, such as the proper handling of gradient accumulation for packed sequences in DPO, the fix for the weight sync hang, and the correction to the MFU calculation. These show great attention to detail.
I have one main point of feedback regarding the new appworld_env.py file, which is quite large and could benefit from being broken down into smaller, more focused modules for better maintainability. I've also highlighted a TODO in that file that points to a potential resource leak, which should be addressed.
c09ad4a to
33ddc0f
Compare
1874b5a to
1b206d7
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces TextRLEnvironment, a new base class for text-based reinforcement learning environments. This is a significant feature that allows models to be trained on interactive tasks where the entire model generation is treated as a single string input to the environment. The implementation includes a WordleTextEnv example, modifications to the tool parsing and processing pipeline to accommodate text-based environments, and a comprehensive set of unit tests.
The overall approach of using a "shadow tool call" to integrate text environments into the existing tool-based dispatch logic is clever and minimizes code duplication. The changes are well-structured across the different files.
I have one main concern regarding the logic in vllm_utils.py that decides when to create this shadow tool call, which could lead to the text environment being skipped if the model's output is incidentally parsed as a tool call. I've left a detailed comment with a suggestion to address this.
deb338c to
4c95f75
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
The pull request introduces TextRLEnvironment for text-based RL environments, along with a WordleTextEnv example. It also updates the ToolParser to handle role for formatting outputs and integrates text environments into the process_request flow. The changes are well-structured and include necessary updates to the CHANGELOG.md and unit tests. The new WordleTextEnv provides a clear demonstration of the text-based environment capabilities.
d4b4fb2 to
c9fbbfb
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
The introduction of TextRLEnvironment is a valuable addition, enabling training on interactive text-based tasks like Wordle where the model's full generation is treated as the environment input. The implementation of WordleTextEnv is robust, correctly handling positional feedback and duplicate letters. The integration with vLLM rollouts via role_template in the tool parsers is well-designed, allowing environment responses to be formatted naturally within the conversation. I have identified one high-priority issue in the request processing logic where text environments could be triggered twice if the model output happens to match a tool call pattern.
360ebbe to
27c8d3b
Compare
…ast match - Regex now accepts any-length word in <guess> tags, length validated in step - Wrong-length guesses get specific feedback + 0.2 format reward - Repeated guesses rejected with feedback, don't consume a guess - Multiple <guess> tags: take last match (matches prime-rl parse_answer) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
TextArena's set_invalid_move does NOT end the game — it gives feedback and lets the player retry (with an error allowance). Match that: - Invalid moves (no tags, wrong length, repeats) don't set done=True - Invalid moves don't count as guesses (no turn increment) - Carry forward last valid guess's reward so reward_aggregator=last sees actual game performance, not a meaningless rejection reward Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…utive errors TextArena's SinglePlayerState allows 1 error before ending the game. Match that: first invalid move (no tags, wrong length, repeat) gets a retry with feedback. Second consecutive invalid move ends the game with reward=0. Valid moves reset the error count. Removes _last_valid_reward carry-forward — cleaner and matches how prime-rl's rubric would score an invalid final move. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Previously async_steps only controlled how many prompts were seeded initially, but the data preparation loop would run unboundedly ahead of training (40+ steps observed in practice). This caused significant staleness in prepared data. Now the prep loop blocks when it gets more than async_steps ahead of the last step consumed by training. This makes async_steps a true window: data prep stays within [consumed, consumed + async_steps]. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When the game ends due to consecutive invalid moves, compute reward like prime-rl's rubric would: partial_answer from last valid guess scoring + format_reward based on fraction of properly tagged messages. Previously returned reward=0, losing all signal from good play. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Match TextArena's full validation chain: 1. Format check (no <guess> tags) 2. Invalid characters (spaces, non-alpha in guess) 3. Length check (not 5 letters) 4. Repeat check (already guessed) 5. Dictionary check (not an English word via NLTK) All invalid moves use error_allowance=1. Previously fake words like "doule", "volfe", "tlope" consumed guess slots; now they're rejected with feedback, giving the model a chance to retry with a real word. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add --per_turn_max_tokens option: caps each generation turn while overall response_length still applies. Matches prime-rl's max_tokens=1024 per turn. - Fix feedback format: remove extra blank line between scoring and remaining guesses to match TextArena's format. - Fix length bonus: count all <guess> attempts (including invalid) to match prime-rl's length_bonus function. - Wordle 8GPU script set to --per_turn_max_tokens 1024. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Format reward is now 0.2 * (turns_with_tags / total_turns), averaged across all turns like prime-rl's rubric. Both valid and invalid-but- tagged turns count toward format compliance. Length bonus uses total_turns (including untagged) so every wasted turn penalizes the bonus. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add --truncated_importance_sampling_ratio_cap 5.0 to reweight samples based on policy divergence. Should help with the flat loss/ratio observed over 40 steps where the policy wasn't changing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the three overlapping role-formatting mechanisms (default_role_template, role_map, role_templates) with a single role_templates dict where each value has the role name baked in. Also remove unnecessary kwargs guards in example envs and fix pre-existing Wordle test failures. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces TextRLEnvironment for text-based reinforcement learning environments, a valuable abstraction that simplifies handling of non-tool-based interactions. The changes are well-structured, including a new WordleTextEnv example, role-aware formatting in parsers, and updates to the vLLM processing loop to support these new environments. The accompanying tests are thorough and cover the new functionality well. I have one minor suggestion to improve code conciseness in data_loader.py.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Replace generic "task_id" with the specific kwargs each environment expects: "target" for CounterEnv and "number" for GuessNumberEnv. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
natolambert
left a comment
There was a problem hiding this comment.
Looks good! One thing to check: WordleTextEnv imports nltk at the top of examples.py, and tools.py pulls WordleTextEnvConfig into TOOL_REGISTRY -- so if nltk isn't added to project dependencies, it'll break all tool/env imports, not just Wordle. Make sure it's in pyproject.toml.
|
We already had nltk for the ifeval stuff so should be fine! |
Summary
Adds
TextRLEnvironment, a newRLEnvironmentsubclass for environments that work with plain text instead of parsed tool calls, and makes tool output formatting role-aware across all parsers. This opens up for the possibility of training on simulated users in true multi-turn RL :DTextRLEnvironment (
environments/base.py):text_step(text: str) -> StepResultinstead ofstep(EnvCall)response_role = "user"by default (baseRLEnvironmentdefaults to"tool")step()extractsargs["text"]from a shadowEnvCalland forwards totext_stepreset()always returns an empty tool listWordleTextEnv, which matches the TextArena implementation.Multi-turn improvements
env_response_roleis passed toprocess_tool_tokens→format_tool_outputsfor role-aware formatting (only supported by vllm parsers rn).per_turn_max_tokenssupport addedtask_idfromreset()interface; envs read their own kwargs (more natural than shoe-horning in things viatask_id)Tests
I've validated our results match Prime-RL's wordle experiments:


us:
prime:
Note the rewards are off a bit because we don't have as complex a formatting reward, but the curves increase by similar amounts.
We also match in final evals:
prime:
us:
The script for recreating these results are in
scripts/train/debug/envs/wordle_8gpu.sh. Wandb id for the successful job is https://wandb.ai/ai2-llm/open_instruct_internal/runs/vi0zqrvc.Some other runs to ensure other things aren't broken:
successful 'regular' tool run: https://beaker.org/orgs/ai2/workspaces/open-instruct-dev/work/01KJE7T8N7S0Q1R55ACTJCJAQY?taskId=01KJE7T8P57ME0QJ9XJB6MWK61&jobId=01KJE7T8V7858N07DS4RGMEQAC
successful 'regular' math run: https://beaker.org/orgs/ai2/workspaces/open-instruct-dev/work/01KJE7G3RB75QCTMXZ7H1D4QAG?taskId=01KJE7G3RK4RKWN1WTFKVFC5YD&jobId=01KJE7G3Z8GGZBBTZQFTAAQ4QD